An optimal search engine knows the user's request before he types.

A “recommender system” gathers relations between people and things in order to propose the information a user wants.

With this article we propose some simple strategies to implement recommender systems with Elasticsearch.

All examples in this article can be executed with an Elasticsearch 2.x installation, using the Kibana Sense console. They are easily adaptable for other approaches, for example, curl commands.

Elasticsearch a recommender engine?

The inverted index is at the core of the Lucene technology, its duty is to map terms to documents, so that these documents can easily be found.





Figure: Mapping back a set of ingredients to the original recipes. Flour is used in all the bakery products, eggs are only in the Sacher cake, water (ice) is mixed even into the bratwurst (proteins would “melt” during meat mincing).



Let's explore the Elasticsearch recommender capabilities with some simple examples.

First let’s create a schema:

POST recipes { "settings" : { "number_of_shards" : 1 }, "mappings" : { "recipe" : { "properties" : { "sales_name" : { "type" : "string" }, "ingredients" : {"type" : "string", "index" : "not_analyzed"}, "craft" : {"type": "string", "index" : "not_analyzed"} } } } }

Code: the schema of our recipes consist of a sales name and some ingredients, they are not analyzed, therefore seen as whole terms.



POST /recipes/recipe/_bulk { "index": {}} {"sales_name":"pizza Margherita", "ingredients": ["flour", "water", "tomato", "mozzarella", "yeast", "salt"], "craft":"baker"} { "index": {}} {"sales_name":"kaiser roll", "ingredients": ["flour", "yeast", "malt", "water", "salt"], "craft":"baker"} { "index": {}} {"sales_name":"focaccia", "ingredients": ["flour", "yeast", "olive oil", "water", "salt"], "craft":"baker"} { "index": {}} {"sales_name":"Sacher Torte", "ingredients": ["flour", "water", "sugar", "eggs", "chocolate", "apricot jam"], "craft":"baker"} { "index": {}} {"sales_name":"bratwurst", "ingredients": ["pork", "veal", "water", "salt"], "craft":"butcher"} { "index": {}} {"sales_name":"hamburger", "ingredients": ["beef", "bread", "pepper", "salt"], "craft":"butcher"} { "index": {}} {"sales_name":"salami", "ingredients": ["pork", "pork fat", "pepper", "salt"], "craft":"butcher"}

Code: a bunch of recipes are imported into the index



Building User Profiles

The easiest recommender is a summary statistic without any personalisation. Elasticsearch Aggregations are useful for this task. But instead of demonstrating this fact with some explicit knowledge about the personal preferences of a user, we can improve our recommendation.

A user has a profile either collected implicitly from that user’s behavior or explicitly stated by the user, for example by selecting his craft or language.

{name: “Heinz”, craft:”butcher”, language:”de”, ...} {name: “Giorgio”, craft:”baker”, language:”it”, ...}

Code: A user profile is a vector of attributes that describe a user, maybe craft is explicitly set by the user or found out via machine learning by the typical ingredients baker Giorgio uses.



Collaborative Filtering

In Recommender Systems Theory “Collaborative Filtering” is the discipline of guessing interesting items from similar items or similar users.

For example, if a baker creates a recipe with flour and water, what shall he add next? This can easily be determined by an aggregation.

User-Item Recommender

Based on the profile we can aggregate typical items for a specific craft:

GET recipes/_search?search_type=count { "query": {"term": {"craft":"baker"}}, "aggs": { "bestMatch": { "terms": { "field": "ingredients", "min_doc_count": 2 } } } } { "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 4, "max_score": 0, "hits": [] }, "aggregations": { "bestMatch": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 2, "buckets": [ { "key": "flour", "doc_count": 4 }, { "key": "water", "doc_count": 4 }, { "key": "salt", "doc_count": 3 }, { "key": "yeast", "doc_count": 3 } ] } }

Result: what do bakers need? Flour and water.



This is a lousy recommender because we are only stating the obvious. But it’s getting more interesting when we look at similar items and guess what the next ingredient might be.



Item-Item Recommender

What is the most used item together with flour and water? An aggregation counts all the ingredients in recipes that contain “flour”, “water”



GET recipes/_search?search_type=count { "query": { "bool": { "filter": {"and": { "filters": [ {"term": {"ingredients": "water"}}, {"term": {"ingredients": "flour"}} ] }} } }, "aggs": { "bestMatch": { "terms": { "field": "ingredients", "exclude": ["water","flour"], "min_doc_count": 2 } } } }

Code: search_type=count suppresses that the recipe results are displayed. An “and” filter first asserts that we look only at documents that contain both water and flour. The “aggs” part aggregates all the ingredient terms in a bucket called “bestMatch ” and counts them. “min_doc_count” gives a threshold to cut off less frequent terms from the result.



The result:



{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "bestMatch": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 2, "buckets": [ { "key": "salt", "doc_count": 3 }, { "key": "yeast", "doc_count": 3 } ] } } }

Code: “salt” and “yeast” are the most commonly used terms together with “flour” and “water”



Significant terms aggregation

With Elasticsearch 1.1 a feature called “significant terms aggregation” was introduced. The folks at Elastic.co brilliantly called it the "uncommon common aggregation".

What is the difference with the “terms” aggregation?

{ "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [] }, "aggregations": { "bestMatch": { "doc_count": 5, "buckets": [ { "key": "yeast", "doc_count": 3, "score": 0.24000000000000002, "bg_count": 3 } ] } } }

Result: when exchanging “terms” with “significant_terms” in the previous query “salt” is omitted and we only get “yeast”



The significant terms feature factors out the base probability of a term and suggests terms that are both relevant and uncommon. Salt is very common in our recipe database, so it is omitted. Yeast is not so common but highly associated with flour and water. A great tip from our recommender!



Other Elasticsearch features useful for recommendations

For unstructured text recommendations, the Like This API is very handy. To demonstrate this, we’ve prepared some recipes from the openrecipes database: bulk_openrecipe.json

curl -XPOST localhost:9200/_bulk --data-binary "@bulk_openrecipes.json"

Code: bulk loading the openrecipes database



After bulk loading these, we inspect the index and see that all the ingredients are in a text field. We apply the Like This API and see if we get similar recipes.



GET openrecipes/_search { "query": { "more_like_this" : { "fields" : ["ingredients"], "like" : "2 whole Medium Onions, Halved And Sliced

1-1/4 stick Butter

2 pounds Cube Steak, Cut Into 1/2-inch Strips

1 teaspoon Kosher Salt

1 teaspoon Black Pepper

1/4 cup Worcestershire Sauce

5 dashes Tabasco Sauce

4 whole Deli Rolls (crusty), Split

8 slices (thick) Fresh Mozzarella

8 slices (thick) Ripe Tomato

1-1/2 cup Arugula", "min_term_freq" : 1, "max_query_terms" : 12 } } }

The result is satisfying:



{ "_index": "openrecipes", "_type": "recipe", "_id": "AVHPrcQdYq29GAi8ew-S", "_score": 0.85339624, "_source": { "name": "The MM Sandwich, PW Style", "ingredients": "4 Tablespoons Butter

2 pounds Cube Steak (round Steak That's Been Extra Tenderized)

Kosher Salt

Freshly Ground Pepper

1 whole Large Yellow Onion, Halved And Sliced Thick

2 whole Green Bell Peppers, Sliced Into Rings

2 whole Red Bell Peppers, Sliced Into Rings

3 cloves Garlic, Minced

16 ounces, weight White Mushrooms, Sliced

2 Tablespoons (additional) Butter

1-1/2 cup Sherry (regular Or Cooking Sherry Is Fine)

4 Tablespoons Worcestershire Sauce

4 dashes Tabasco (more To Taste)

8 whole Deli Rolls (the Crustier The Better)

2 Tablespoons (additional) Butter

8 slices Cheese (Provolone, Swiss, Pepper Jack)", ...

Mahout integration

While Elasticsearch does a good job for simple recommendation requirements, an integration with a scalable machine learning framework like Mahout is available.

Useful links

Item-Item Recommender with Lucene

http://sujitpal.blogspot.it/2013/12/using-lucene-similarity-in-item-item.html

Some industry experience with Lucene powered recommenders

http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine

The significant-terms guide

https://www.elastic.co/guide/en/elasticsearch/guide/current/significant-terms.html

In depth with significant terms

https://www.elastic.co/guide/en/elasticsearch/reference/2.x/search-aggregations-bucket-significantterms-aggregation.html

http://www.infoq.com/presentations/elasticsearch-revealing-uncommonly-common

Alternative similarities for the Like This API

https://www.elastic.co/guide/en/elasticsearch/guide/current/pluggable-similarites.html

Integrating Elasticsearch with a Mahout recommender

https://github.com/codelibs/elasticsearch-taste

Description of the recommender algorithms in Mahout

https://mahout.apache.org/users/basics/algorithms.html

Openrecipes

https://github.com/fictivekin/openrecipes



