Apache Solr tips for beginners like me

Warning: I’m still a beginner when it comes to configuring Solr

While learning how to use Solr, I’ve come across many things I would’ve like to have known beforehand. So this post is a collection of things that I’ve encountered (and will encounter) that some of you might find helpful to have a solution for as well. I’m still a beginner when it comes to using and configuring Solr, so take these things with a grain of salt and be sure to do your own research as well. You can start by reading the documentation, it’s very extensive and well written.

Query a string with a whitespace

I’ve wasted many hours on this issue. Case: I was querying a field with a search term that contained a whitespace and kept getting zero results. Search terms that didn’t contain a whitespace all worked flawlessly, so I was getting frustrated. What you need to do to solve this problem, is escape this whitespace. This can be done in two ways:

Using double quotation marks to tell Solr this is a single search term…

field:"string with whitespaces"

…or escaping the whitespace

field:string\ with\ whitespaces

Stored, indexed, and docValues fields

Stored fields are retrievable in queries, meaning you will be able to receive them into responses. Any fields that have stored=”false” will not be returned with the query responses.

Indexed fields are searchable, sortable and are able to support facetting. This means that any field you plan to search for, sort by, or use facets for, should have indexed=”true”.

DocValues fields are fields that will be added to lists in Solr. This means that if a field has the docValues attribute it’ll be added to a list as a key, with the whole document as a value. This makes searching for that specific field very performant. To enable this, add docValues=”true”

You can see all other field type definitions in the documentation.

Using schema.xml instead of managed_scheme

You can change the default managed_scheme to the static scheme.xml by changing a setting in solrconfig.xml. You can find this file in the same folder as the managed_scheme file.

Uncomment or add the following to the solrconfig.xml file:

<schemaFactory class="ClassicIndexSchemaFactory"/>

I use this to make sure I have exactly the fields I want, and nothing more or less. If you choose to use the default managed_scheme, you will be able to add any field you wish, which is great for unstructured data. But most of the time, I know exactly what data is indexed, so I have no need for that flexibility on those occasions. The managed_schema is created and updated by Solr, while you’re the one making changes in the schema.xml file. So if you want to have full control over the type of data you’re indexing I’d choose schema.xml.

Sorting documents based on the amount of matched terms within a multiValued field

This is the desired result, the results are sorted by the highest term density on top

After many hours of research, I finally found an answer on Stack Overflow that I could use to solve my problem. To sort by the occurrence of a string in a multiValued string field you have to use a function query as a way to sort documents.

sum(termfreq(field, 'search_term')) desc

This will sort the documents by the occurrence of the search terms, but to make this even more relevant, you’ll also need to sort by the number of terms in the multiValued field. Because if you have the same amount of matches in a collection of fewer words, the document is more relevant. So this adds the following sort:

location_occurrence asc

This field above is a field I added during indexing. It represents the number of strings in the multiValued field. I know I could be doing this in a different way, by using “updateRequestProcessorChain”, but I haven’t figured out how to do that just yet. Experts, I’d love your help with this. So until I find out how to do that, I’ll use this method.

Before I implemented this sorting method, the matches would simply be: okay, we found a record that contains the search term, here you go. There was no way to boost the score* because Solr didn’t care about the amount of matched terms within the multiValued field. So all documents with that multiValued field, that contained the search term, got the same score.

*Note: I’ve, since writing this part, figured out that I could’ve used a DisMax Query parser to specify different boosts for different fields. The method I used here is still (mostly) correct for my use case, but I’ll update this later to work with the DisMax query parser.

This screenshot is a representation of what the query would look like in the Solr UI.

So the sorting methods above help to get the highest search term density to the top, which, in this case, is what we want. Different situations call for different relevancy rules, but for this “simple” location keyword suggestion box, the higher the search term density, the higher the relevancy to the user.

A few relevant links that you can use to do your own research

StackOverflow: Solr filtering on the number of matches in an “or query” to a multiValued field

Apache Solr Wiki: Function queries, termfreq

Apache Solr Lucene: Function queries

Build a case-insensitive custom field type with a way to escape special characters

Building a custom field type might seem a little intimidating, but it’s actually really easy. Have a look at the configuration below and you’ll almost be able to guess what happens to an input on a field.

The first “filter” is the KeywordTokenizer. This treats the entire input string as a single token, instead of trying to break it up into smaller parts. You can read more about it in the documentation.

The second filter is the “ASCIIFoldingFilterFactory” filter. Here, I map any special characters to different, more international characters. This will help match things like Primošten to Primosten and Kroatië to Kroatie.

The third filter “TrimFilterFactory” is used to remove any surrounding whitespace in the input string. Normally the tokenizer would take care of this by breaking terms in little pieces when it finds a whitespace. Since I treat the whole string as a single token, through the KeywordTokenizer, I can make good use of the trim filter.

The fourth filter makes all input strings lowercase, so it makes sure that any search is case-insensitive. And the last filter makes sure that there are no duplicate keywords saved because that’s not what I want for an accurate location count.

In this case, I’ve duplicated the indexing filter rules to the query filter rules. This is not a necessity, I just found it useful. There are many things you could change here. For example, if you enter a sentence as a query string, you could break this up into individual words and search for the most relevant matching multiValued field.

Work in progress

This is still a work in progress because I’m still learning new things every day. I will update this post as I find new things to document. If you have any good resources regarding this post that you can help me with, please let me know. I’m always looking to finding out more about Solr and how to use it in a better way that I’m doing now. If you have any questions regarding Solr, I’ll do my best to answer them, or else try to refer you to some people that might be able to help you in a better way than I can.