As much as anyone I’m a fan of resurrecting trends and memes and pretending it’s cool. In that vein dear friends, I’ve exhumed the venerable “Falsehoods Programmers Believe” party from 4 years ago to bring you one about, no less, Search.

Search is a deceptively complex field, where competence is hard-won through training, practice, and experience. The list stands at a total of 105 falsehoods. I couldn’t mash up the ole 99-problems meme with this to cull 6 unworthy items, because they are all worthy. I will leave you with that brief introduction and, of course, the list:

Search engines work like databases

Search can be considered an additional feature just like any other

Search can be added as a well performing feature to your existing product quickly

Search can be added as a well performing feature to your existing product with reasonable effort

Choosing the correct search engine is easy and you will always be happy with your decision

Once setup, search will work the same way forever

Once setup, search will work the same way for a while

Once setup, search will work the same way for the next week

The default search engine settings will deliver a good search experience

Customers know what they are looking for

Customers who know what they are looking for will search for it in the way you expect

Customers who don’t know what they are looking for will search accordingly

A customer using the same query twice expects the same results for both searches

Customers only search for a few terms

Customers only search for less than some set number of terms

Customers never copy and paste a whole document into a search bar

Customers balance quotes and parenthesis

Customers that don’t balance quotes or parenthesis don’t expect phrasing or grouping

You can pass the customer query directly into your search engine

You can write a query parser that will always parse the query successfully

You will never have to return a query parse error to the customer

When you find the boolean operator ‘OR’, you always know it doesn’t mean Oregon

Customers notice their own misspellings

Customers don’t expect your search to correct misspellings

It is possible to create a list of all misspellings

It is possible to create an algorithm to handle all misspellings

A misspelled word is never the same as another correctly spelled word

All customers expect spelling correction to work the same

All customers want their misspellings corrected

A search should always return results, no matter how absurd

If you don’t have any results to show, customers won’t mind

When the perfect results are shown to the customer, they will notice it

You don’t need to monitor search queries, results, and clicks

Customers won’t get nervous that you are logging their search activity

Search queries are not affected by GDPR

Looking at the data, it is always possible to tell whether a customer found what they were looking for

Customers will click on what they are looking for when they’ve found it

You can build a search that works like Google

You can build a search that works like Google sometimes

You should use Google as a target for your search

Customers don’t mind if your search doesn’t work like Google

Customers don’t expect your search to work like Google

Customers won’t compare you to Google

A bad search, no matter how minor nor how rare, will never reflect poorly on your product

Since Google doesn’t use facets, customers don’t need them

Facet hit counts are always correct

Facets have no impact on performance

You can just cache queries to get performant facets

Personalized search is easy

Learning to rank is easy and just requires a plugin

You have enough data for learning-to-rank

Over time, you can curate enough data for learning-to-rank

You don’t need to spend lots of time formatting content for it to work well in your search engine

Text extraction engines will always produce text that doesn’t need to be post-processed

All your markup will be stripped as you expect it to be

Content is well formed

Content is mostly well formed

Content is predictably well formed

Content, sourced from a database and templates, are formed the same

Content teams treat search as their top priority

Manually changing content to improve search is easy

Improving content can be automated with reasonable effort

Queries for ‘C programming’ and ‘C++ programming’ will produce different results

Queries for ‘401k’ and ‘401(k)’ will produce the same results

Tokenization as it works out of the box is right for your content and queries

Tokenization can be changed to meet the needs of your entire corpus and all queries

Tokenization can be changed to meet the needs of most of your corpus and most queries

Tokenization can be conditional

You should roll your own tokenizer

You will never have a debate about tokenization

Regular Expressions for tokenization is a good idea

Regular Expressions have minimal performance impact

You will never have a debate about regular expressions

You should remove stop words

You should not remove stop words

You know what the list of stop words should be

Stop words will never change

When you find the stopword ‘in’, you know it doesn’t mean Indiana

It’s easy to make certain things case sensitive

Case sensitivity is a good idea

Synonyms are easy

Synonyms will improve recall in the way you want

Synonyms have the same relevance in all documents

Synonyms for Abbreviations and Acronyms always work as you expect

Synonyms can be extracted from your corpus with natural language processing

Using Word2Vec for synonyms is a good idea

Stemming will solve your recall problems

Lemmatization will solve your recall problems

Lemmatization dictionaries are static

Languages don’t change

Natural language processing (NLP) tools work perfectly

Incorporating NLP into your analysis pipeline is straightforward

Search queries are complete sentences and can be accurately tagged with parts of speech

Showing a list of search suggestions is easy

Suggestions should just use the out of the box search engine suggestions

Suggestions should incorporate customer query logs

Customers would never type anything offensive into your search bar

Customers would never try to hack you through your search bar

Customers don’t need highlighting to find what they’ve searched for

Default highlighters are good enough for all your content and queries

Making a custom highlighter isn’t too difficult. It’s just matching strings right?

Making a custom highlighter that is better than the default version will take less than a year

Turning on caching will solve your performance issues

Customers don’t expect near real time updates

30 second commit time is short enough for everyone

Keen to avoid believing falsehoods about search? Let us help!