At Ibotta, our mission is to Make Every Purchase Rewarding. Helping our users (whom we call Savers) find and activate relevant offers through our direct-to-consumer (D2C) app, browser extension, and website is a critical part of this mission. Our D2C platform helps millions of shoppers earn cashback from their everyday purchases—whether they’re unlocking grocery deals, earning bonus rewards, or planning their next trip. Through the Ibotta Performance Network (IPN), we also power white-label cashback programs for some of the biggest names in retail, including Walmart and Dollar General, helping over 2,600 brands reach more than 200 million consumers with digital offers across partner ecosystems.
Behind the scenes, our Data and Machine Learning teams power critical experiences like fraud detection, offer recommendation engines, and search relevance to make the Saver journey personalized and secure. As we continue to scale, we need data-driven, intelligent systems that support every interaction at every touchpoint.
Across D2C and the IPN, search plays a pivotal role in engagement and needs to keep pace with our business scale, evolving offer content, and changing Saver expectations.
In this post we’ll walk through how we significantly refined our D2C search experience: from an ambitious hackathon project to a robust production feature now benefiting millions of Savers.
We believed our search could better keep up with our Savers
User search behavior has evolved from simple keywords to incorporating natural language, misspellings, and conversational phrases. Modern search systems must bridge the gap between what users type and what they actually mean, interpreting context and relationships to deliver relevant results even when query terms don’t exactly match the content.
At Ibotta, our original homegrown search system, at times, struggled to keep pace with the evolving expectations of our Savers and we recognized an opportunity to refine it.
The key areas for opportunity we saw included:
- Improving semantic relevance: Focusing on understanding Saver intent over exact keyword matches to connect them with the right offers.
- Enhancing understanding: Interpreting the full nuance and context of user queries to provide more comprehensive and truly relevant results.
- Increasing flexibility: More rapidly integrating new offer types and adapting to changing Saver search patterns to keep our discovery experience rewarding.
- Boosting discoverability: We wanted more robust tools to ensure specific types of offers or key promotions were consistently visible across a wide array of relevant search queries.
- Accelerating iteration and optimization: Enabling faster, impactful improvements to the search experience through real-time adjustments and performance tuning.
We believed the system could better keep pace with changing offer content, search behaviors, and evolving Saver expectations. We saw opportunities to increase the value for both our Savers and our brand partners.
From hackathon to production: reimagining search with Databricks
Addressing the limitations of our legacy search system required a focused effort. This initiative gained significant momentum during an internal hackathon where a cross-functional team, including members from Data, Engineering, Marketing Analytics, and Machine Learning, came together with the idea to build a modern, alternative search system using Databricks Vector Search, which some members had learned about at the Databricks Data + AI Summit.
In just three days, our team developed a working proof-of-concept that delivered semantically relevant search results. Here’s how we did it:
- Collected offer content from multiple sources in our Databricks catalog
- Created a Vector Search endpoint and index with the Python SDK
- Used pay-per-token embedding endpoints with four different models (BGE large, GTE large, GTE small, a multilingual open-source model, and a Spanish-language-specific model)
- Connected everything to our website for a live demo
The hackathon project won first place, generated strong internal buy-in and momentum to transition the prototype into a production system. Over the course of a few months, and with close collaboration from the Databricks team, we transformed our prototype into a robust full-fledged production search system.
From proof of concept to production
Moving the hackathon proof-of-concept to a production-ready system required careful iteration and testing. This phase was critical not only for technical integration and performance tuning, but also for evaluating whether our anticipated system improvements would translate into positive changes in Saver behavior and engagement. Given search’s essential role and deep integration across internal systems, we opted for the following approach: we modified a key internal service that called our original search system, replacing those calls with requests directed to the Databricks Vector Search endpoint, while building in robust, graceful fallbacks to the legacy system.
Most of our early work focused on understanding:
In the first month, we ran a test with a small percentage of our Savers which did not achieve the engagement results we had hoped for. Engagement decreased, particularly among our most active Savers, indicated by a drop in clicks, unlocks (when Savers express interest in an offer), and activations.
However, the Vector Search solution offered significant benefits including:
- Faster response times
- A simpler mental model
- Greater flexibility in how we indexed data
- New abilities to adjust thresholds and change embedding text
Pleased with the system’s underlying technical performance, we saw its greater flexibility as the key advantage needed to iteratively improve search result quality and overcome the disappointing engagement results.
Building a semantic evaluation framework
Following our initial test results, relying solely on A/B testing for search iterations was clearly inefficient and impractical. The number of variables influencing search quality was immense—including embedding models, text combinations, hybrid search settings, Approximate Nearest Neighbors (ANN) thresholds, reranking options, and many more.
To navigate this complexity and accelerate our progress, we decided to establish a robust evaluation framework. This framework needed to be uniquely tailored to our specific business needs and capable of predicting real-world user engagement from offline performance metrics.
Our framework was designed around a synthetic evaluation environment that tracked over 50 online and offline metrics. Offline, we monitored standard information retrieval metrics like Mean Reciprocal Rank (MRR) and precision@k to measure relevance. Crucially, this was paired with online real-world engagement signals such as offer unlocks and click-through rates. A key decision was implementing an LLM-as-a-judge. This allowed us to label data and assign quality scores to both online query-result pairs and offline outputs. This approach proved to be critical for rapid iteration based on reliable metrics and collecting the labeled data necessary for future model fine-tuning.
Along the way, we leaned into multiple parts of the Databricks Data Intelligence Platform, including:
- Mosaic AI Vector Search: Used to power high-precision, semantically rich search results for evaluation tests.
- MLflow patterns and LLM-as-a-judge: Provided the patterns to evaluate model outputs and implement our data labeling process.
- Model Serving Endpoints: Efficient deployment of models directly from our catalog.
- AI Gateway: To secure and govern our access to third party models via API.
- Unity Catalog: Ensured the organization, management, and governance of all datasets used within the evaluation framework.
This robust framework dramatically increased our iteration speed and confidence. We conducted over 30 distinct iterations, systematically testing major variable changes in our Vector Search solution, including:
- Different embedding models (foundational, open-weights, and third party via API)
- Various text combinations to feed into the models
- Different query modes (ANN vs Hybrid)
- Testing different columns for hybrid text search
- Adjusting thresholds for vector similarity
- Experimenting with separate indexes for different offer types
The evaluation framework transformed our development process, allowing us to make data-driven decisions rapidly and validate potential improvements with high confidence before exposing them to users.
The search for the best off-the-shelf model
Following the initial broad test that showed disappointing engagement results, we shifted our focus to exploring the performance of specific models identified as promising during our offline evaluation. We selected two third-party embedding models for production testing, accessed securely through AI Gateway. We conducted short-term, iterative tests in production (lasting a few days) with these models.
Pleased with the initial results, we proceeded to run a longer, more comprehensive production test comparing our leading third-party model and its optimized configuration against the legacy system. This test yielded mixed results. While we observed overall improvements in engagement metrics and successfully eliminated the negative impacts seen previously, these gains were modest—mostly single-digit percentage increases. These incremental benefits were not compelling enough to fully justify a complete replacement of our existing search experience.
More troubling, however, was the insight gained from our granular analysis: while performance significantly improved for certain search queries, others saw worse results compared to our legacy solution. This inconsistency presented a significant architectural dilemma. We faced the unappealing choice of implementing a complex traffic-splitting system to route queries based on predicted performance—an approach that would require maintaining two distinct search experiences and introduce a new, complex layer of rule-based routing management—or accepting the limitations.
This was a critical juncture. While we had seen enough promise to keep going, we needed more significant improvements to justify fully replacing our homegrown search system. This led us to begin fine-tuning.
Fine-tuning: customizing model behavior
While the third-party embedding models explored previously showed technical promise and modest improvements in engagement, they also presented critical limitations that were unacceptable for a long-term solution at Ibotta. These included:
- Inability to train embedding models on our proprietary offer catalog
- Difficulty evolving models alongside business and content changes
- Uncertainty regarding long-term API availability from external providers
- The need to establish and manage new external business relationships
- Network calls to these providers weren’t as performant as self-hosted models
The clear path forward was to fine-tune a model specifically tailored to Ibotta’s data and the needs of our Savers. This was made possible thanks to the millions of labeled search interactions we had accumulated from real users via our LLM-as-a-judge process within our custom evaluation framework. This high-quality production data became our training gold.
We then embarked on a methodical fine-tuning process, leveraging our offline evaluation framework extensively.
Key elements were:
- Infrastructure: We used AI Runtime with A10s in a serverless environment, and Databricks ML Runtime for sophisticated hyperparameter sweeping.
- Model selection: We selected a BGE family model over GTE, which demonstrated stronger performance in our offline evaluations and proved more efficient to train.
- Dataset engineering: We constructed numerous training datasets, including generating synthetic training data, ultimately settling on:
- One positive result (a verified good match from real searches)
- ~10 negative examples per positive, combining:
- 3-4 “hard negatives” (LLM labeled, human-verified inappropriate matches)
- “In-batch negatives” (sampling of results from unrelated search terms)
- Hyperparameter optimization: We systematically swept things like learning rate, batch size, duration, and negative sampling strategies to find optimal configurations.
After numerous iterations and evaluations within the framework, our top-performing fine-tuned model beat our best third-party baseline by 20% in synthetic evaluation. These compelling offline results provided the confidence needed to accelerate our next production test.
Search that drives results—and revenue
The technical rigor and iterative process paid off. We engineered a search solution specifically optimized for Ibotta’s unique offer catalog and user behavior patterns, delivering results that exceeded our expectations and offered the flexibility needed to evolve alongside our business. Based on these strong results, we accelerated migration onto Databricks Vector Search as the foundation for our production search system.
In our final production test, using our own fine-tuned embedding model, we observed the following improvements:
- 14.8% more offer unlocks in search.
This measures users selecting offers from search results, indicating improved result quality and relevance. More unlocks are a leading indicator of downstream redemptions and revenue. - 6% increase in engaged users.
This shows a greater share of users finding value and taking meaningful action within the search experience, contributing to improved conversion, retention and lifetime value. - 15% increase in engagement on bonuses.
This reflects improved surfacing of high-value, brand-sponsored content, translating directly to better performance and ROI for our brand and retail partners. - 72.6% decrease in searches with zero results.
The significant reduction means fewer frustrating experiences and a major improvement in semantic search coverage. - 60.9% fewer users encountering searches returning no results.
This highlights the breadth of impact, showing that a large portion of our user base is now consistently finding results, improving the experience across the board.
Beyond user-facing gains, the new system delivered on performance. We saw 60% lower latency to our search system, attributable to Vector Search query performance and the fine-tuned model’s lower overhead.
Leveraging the flexibility of this new foundation, we also built powerful enhancements like Query Transformation (enriching vague queries) and Multi-Search (fanning out generic terms). The combination of a highly relevant core model, improved system performance, and intelligent query enhancements has resulted in a search experience that is smarter, faster, and ultimately more rewarding
Query Transformation
One challenge with embedding models is their limited understanding of niche keywords, such as emerging brands. To address this we built a query transformation layer that dynamically enriches search terms in-flight based on predefined rules.
For example, if a user searches for an emerging yogurt brand the embedding model might not recognize, we can transform the query to add “Greek yogurt” alongside the brand name before sending it to Vector Search. This provides the embedding model with necessary product context while preserving the original text for hybrid search.
This capability also works hand-in-hand with our fine-tuning process. Successful transformations can be used to generate training data; for instance, including the original brand name as a query and the relevant yogurt products as positive results in a future training run helps the model learn these specific associations.
Multi-Search
For broad, generic searches like “baby,” Vector Search might initially return a limited number of candidates, potentially filtered down further by targeting and budget management. To address this and increase result diversity, we built a multi-search capability that fans out a single search term into multiple related searches.
Instead of just searching for “baby,” our system automatically runs parallel searches for terms like “baby food,” “baby clothing,” “baby medicine,” “baby diapers,” and so on. Because of the low latency of Vector Search, we can execute several searches in parallel without increasing the overall response time to the user. This provides a much broader and more diverse set of relevant results for wide-ranging category searches.
Lessons Learned
Following the successful final production test and the full rollout of Databricks Vector Search to our user base – delivering positive engagement results, increased flexibility, and powerful search tools like Query Transformation and Multi-Search – this project journey yielded several valuable lessons:
- Start with a proof of concept: The initial hackathon approach allowed us to quickly validate the core concept with minimal upfront investment.
- Measure what matters to you: Our tailored 50-metric evaluation framework was crucial; it gave us confidence that improvements observed offline would translate into business impact, enabling us to avoid repeated live testing until solutions were truly promising.
- Don’t jump straight to fine-tuning: We learned the value of thoroughly evaluating off-the-shelf models and exhausting those options before investing in the greater effort required for fine-tuning.
- Collect data early: Starting to label data from our second experiment ensured a rich, proprietary dataset was ready when fine-tuning became necessary.
- Collaboration accelerates progress: Close partnership with Databricks engineers and researchers, sharing insights on Vector Search, embedding models, LLM-as-a-judge patterns, and fine-tuning approaches, significantly accelerated our progress.
- Recognize cumulative impact: Each individual optimization, even seemingly minor, contributed significantly to the overall transformation of our search experience.
What’s next
With our fine-tuned embedding model now live across all direct-to-consumer (D2C) channels, we next plan to explore scaling this solution to the Ibotta Performance Network (IPN). This would bring improved offer discovery to millions more shoppers across our publisher network. As we continue to collect labeled data and refine our models through Databricks, we believe we are well positioned to evolve the search experience alongside the needs of our partners and the expectations of their customers.
This journey from a hackathon project to a production system proved that reimagining a core product experience rapidly is achievable with the right tools and support. Databricks was instrumental in helping us move fast, fine-tune effectively, and ultimately, make every search more rewarding for our Savers.