Let’s face it, returning good search results means making money. To this end, we’re often hired to tune search to ensure that search results are as close as possible to the intent of a user’s search query. Matching users intent to results, what we call “relevancy” is what gets us up in the morning. It’s what drives us to think hard about the dark mysteries of tuning Solr or machine-learning topics such as recommendation-based product search.

While we can do amazing feats of wizardry to make individual improvements, it’s impossible with today’s tools to do much more than prove that one problem has been solved. Search engines rank results based on a single set of rules. This single set of rules is in charge of how all searches are ranked. It’s very likely that even as we solve one problem by modifying those rules, we create another problem – or dozens of them, perhaps far more devastating than the original problem we solved.

For example, if we’re a clothing site. We’ve noticed that sales of lady’s dresses are fairly profitable compared to our other wares. The search developers are given a problem to solve – boost the results such that dresses come up higher in the search results. When a user types in summer dress we ought to make sure that instead of shorts, t-shirts, etc, we surface more flower dresses and the like in the search results. Knowing our customers, we’ve decided that “dress” means the literal garment, so we feel pretty confident in tasking our developers. Luckily our search developers are pretty smart folks, and they modify the search relevancy parameters giving us great results:

These changes are great. Everyone goes out for a beer to celebrate the amazing new sales in the dress department.

Little did the search developers know that the single most popular garment sold by our online store was dress shoes. Nobody bothered to run that query until sales suddenly plummet. With customers leaving in droves, and the company on the verge of bankruptcy, the search developers now sit down to try to see what went wrong. Typing in the query dress shoes they are suddenly greeted with:

The developers had no way of knowing, until changes were pushed to production, that solving the narrow problem of surfacing more dresses has caused a catastrophic drop in search quality for the sites most important queries.

This unfortunately is the current state of search relevancy work. Much like software development, it’s easy to fix one simple bug and create two catastrophic ones. Search developers frequently chase their tails as users, merchandisers, and content experts stumble upon yet another broken search that yields confusing results that bear no relevance to what the user was hunting for. Sometimes, as in our parable of the dress shoes, this happens far too late. All that developer can do is play the search quality game of “whack-a-mole”, trying to beat down every problem, until the complaints and disasters lower to a mollified din.

What can possibly be done?

One strategy used to catch these issues is to track statistics on production search quality. By tracking user’s clicks and queries, it’s fairly easy to identify queries that cause most users to frequently return to the search or give up without a sale. We can also, hopefully point out the successful queries that result in conversions and happy users.

While this is a useful tool for detecting existing issues, it’s the equivalent of waiting for the user to find the crash in your software. Can you detect that queries on your most popular item, dress shoes are failing fast enough? Can you respond with a remedy fast enough? More importantly, how do you know that the rushed fix you provide doesn’t create an even deeper catastrophic failure of your search quality?

Another heavily used strategy is to feed the query tracking data back to the search algorithm itself. Basically the thinking goes, so what if our search is only mildly competent. We can track successful results (those that resulted in satisfied users or customer sales) for individual queries and boost the successful results over the unsuccessful ones.

This is a valid strategy, however there are several significant hurdles to overcome. First, you have to deal with subtle differences in queries: dress shoes vs dressy shoes will store different sets of quality metrics. Second, your search index now must expand dramatically to store all of this information. Storing every possible query a user typed to get to a document can become impractical, given the extreme variation in human languages.

Most importantly, now you’ve created a new relevancy “meta problem”. Should you prefer the matches inside the actual content? Or should you prefer the query tracking? (a blog post in its own right…)

Both of these potential solutions require us to get to production before finding our problems. What can be done before rolling changes out to measure the impact of changes to search results? Well, if this was software, I’d be advocating generating a suite of automated tests that captured every problem solved in the past. When a new problem was to be solved, I would work intimately with this suite of tests to ensure old features didn’t break. I’d be a good steward and create tests for my new feature as I develop it.

Why can’t we do this with search?

Introducing Quepid

This is exactly why we built Quepid.

Quepid is our instant search quality testing product. Born out of our years of experience tuning search, Quepid has become our go to tool for relevancy problems. Built around the idea of Test Driven Relevancy, Quepid allows the search developer to collaborate with product and content experts to

Identify, store, and execute important queries Provide statistics/rankings that measure the quality of a search query Tune search relevancy Immediately visualize the impact of tuning on queries Rinse & Repeat Instantly

The result is a tool that empowers search developers to experiment with the impact of changes across the search experience and prove to their bosses that nothing broke. Confident in that data will prove or disprove their ideas instantly, developers are even freer experiment more than they might ever have before.

Quepid provides a canvas for the whole team to explore search relevancy and its impact across dozens of search queries instantly. Testers, marketing, and content experts can provide immediate search quality feedback to developers, intimately impacting their technical decisions. By actively participating in the process, by adding queries and even ranking individual results, the feedback provided by these groups causes much quicker iterations.

In our Test Driven Relevancy work on Silverchair Information Systems medical search (which I’ll be speaking about at Lucene Revolution), Quepid has been a game changer. Silverchair had gone through several cycles of deploying search results with relevancy issues. It wasn’t until we brought Quepid to the scene that we could tame the beast.

Much like the marketing wing of an eCommerce site, Silverchair’s medical experts would shower the developers with links to broken searches on the production site. We would work diligently to solve those problems, only to get hit again by a search we broke. (We replied with “We’re search consultants, not doctors Jim!” but they didn’t seem to listen). Tired of one-shot solutions that took days or longer and seemed to create more problems than they solved, we finally sat down and loaded Quepid up with queries. Working together, we crafted a holistic relevancy strategy that addressed all the representative queries instead of fighting one-at-a-time. We continued iterating, measuring our improvements in a development environment. Once we released to production, we were able to feel relatively confident no lurking dress shoes issues existed in our relevancy parameters.

In Silverchair’s Director of Semantic Technology, Rena Morse’s, own words:

Quepid has been a game-changer for us in the arena of search relevancy testing. With Quepid, the team can see the impact of planned search tuning changes immediately instead of waiting until changes are made live. Quepid’s search version comparison capability also allows us to understand how potential downstream results may be affected by each change, so we can select the version that’s best for the product. Weve solved and avoided numerous customer issues, and Ive been able to feel more confident that my feedback on search quality directly funnels into our search relevancy algorithm. Thanks, Quepid!

After numerous client success stories, like Silverchair’s, we’re extremely happy to be offering Quepid directly as a product to help other search teams execute. So do you wish you had more say in your site’s search results? Do you wish you had better tools to test search quality? Let us know if you’d like to try Quepid today. We’d love to have a chat about how you can try this exciting product.

So contact us! to setup a time to get a demo and speak about how Quepid and OpenSource Connections can help with your relevancy problems! Come by Lucene Revolution to see me and hear Silverchairs Rena Morse and I discuss how Quepid has fundamentally impacted how Silverchair has addressed their tough search problems.