Elasticsearch provides a powerful, RESTful HTTP interface for indexing and querying data, built on top of the Apache Lucene library. Right out of the box, it provides scalable, efficient, and robust search, with UTF-8 support. It’s a powerful tool for indexing and querying massive amounts of structured data and, here at Toptal, it powers our platform search and will soon be used for autocompletion as well. We’re huge fans.

Chewy extends the Elasticsearch-Ruby client, making it more powerful and providing tighter integration with Rails.

Since our platform is built using Ruby on Rails, our integration of Elasticsearch takes advantage of the elasticsearch-ruby project (a Ruby integration framework for Elasticsearch that provides a client for connecting to an Elasticsearch cluster, a Ruby API for the Elasticsearch’s REST API, and various extensions and utilities). Building on this foundation, we’ve developed and released our own improvement (and simplification) of the Elasticsearch application search architecture, packaged as a Ruby gem that we’ve named Chewy (with an example app available here).

Chewy extends the Elasticsearch-Ruby client, making it more powerful and providing tighter integration with Rails. In this Elasticsearch guide, I discuss (through usage examples) how we accomplished this, including the technical obstacles that emerged during implementation.

Just a couple of quick notes before proceeding to the guide:

Both Chewy and a Chewy demo application are available on GitHub.

For those interested in more “under the hood” info about Elasticsearch, I’ve included a brief write-up as an Appendix to this post.

Why Chewy?

Despite Elasticsearch’s scalability and efficiency, integrating it with Rails didn’t turn out to be quite as simple as anticipated. At Toptal, we found ourselves needing to significantly augment the basic Elasticsearch-Ruby client to make it more performant and to support additional operations.

Despite Elasticsearch's scalability and efficiency, integrating it with Rails didn't turn out to be quite as simple as anticipated.

And thus, the Chewy gem was born.

A few particularly noteworthy features of Chewy include:

Every index is observable by all the related models. Most indexed models are related to each other. And sometimes, it’s necessary to denormalize this related data and bind it to the same object (e.g., if you want to index an array of tags together with their associated article). Chewy allows you to specify an updatable index for every model, so corresponding articles will be reindexed whenever a relevant tag is updated. Index classes are independent from ORM/ODM models. With this enhancement, implementing cross-model autocompletion, for example, is much easier. You can just define an index and work with it in object-oriented fashion. Unlike other clients, the Chewy gem removes the need to manually implement index classes, data import callbacks, and other components. Bulk import is everywhere. Chewy utilizes the bulk Elasticsearch API for full reindexing and index updates. It also utilizes the concept of atomic updates, collecting changed objects within an atomic block and updating them all at once. Chewy provides an AR-style query DSL. By being chainable, mergable, and lazy, this enhancement allows queries to be produced in a more efficient manner.

OK, so let’s see how this all plays out in the gem…

The basic guide to Elasticsearch

Elasticsearch has several document-related concepts. The first is that of an index (the analogue of a database in RDBMS), which consists of a set of documents , which can be of several types (where a type is a kind of RDBMS table).

Every document has a set of fields . Each field is analyzed independently and its analysis options are stored in the mapping for its type. Chewy utilizes this structure “as is” in its object model:

class EntertainmentIndex < Chewy::Index settings analysis: { analyzer: { title: { tokenizer: 'standard', filter: ['lowercase', 'asciifolding'] } } } define_type Book.includes(:author, :tags) do field :title, analyzer: 'title' field :year, type: 'integer' field :author, value: ->{ author.name } field :author_id, type: 'integer' field :description field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) } end {movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope| define_type scope.includes(:director, :tags), name: type_name do field :title, analyzer: 'title' field :year, type: 'integer' field :author, value: ->{ director.name } field :author_id, type: 'integer', value: ->{ director_id } field :description field :tags, index: 'not_analyzed', value: ->{ tags.map(&:name) } end end end

Above, we defined an Elasticsearch index called entertainment with three types: book , movie , and cartoon . For each type, we defined some field mappings and a hash of settings for the whole index.

So, we’ve defined the EntertainmentIndex and we want to execute some queries. As a first step, we need to create the index and import our data:

EntertainmentIndex.create! EntertainmentIndex.import # EntertainmentIndex.reset! (which includes deletion, # creation, and import) could be used instead

The .import method is aware of imported data because we passed in scopes when we defined our types; thus, it will import all the books, movies, and cartoons stored in the persistent storage.

With that done, we can perform some queries:

EntertainmentIndex.query(match: {author: 'Tarantino'}).filter{ year > 1990 } EntertainmentIndex.query(match: {title: 'Shawshank'}).types(:movie) EntertainmentIndex.query(match: {author: 'Tarantino'}).only(:id).limit(10).load # the last one loads ActiveRecord objects for documents found

Now our index is almost ready to be used in our search implementation.

Rails integration

For integration with Rails, the first thing we need is to be able to react to RDBMS object changes. Chewy supports this behavior via callbacks defined within the update_index class method. update_index takes two arguments:

A type identifier supplied in the "index_name#type_name" format A method name or block to execute, which represents a back-reference to the updated object or object collection

We need to define these callbacks for each dependent model:

class Book < ActiveRecord::Base acts_as_taggable belongs_to :author, class_name: 'Dude' # We update the book itself on-change update_index 'entertainment#book', :self end class Video < ActiveRecord::Base acts_as_taggable belongs_to :director, class_name: 'Dude' # Update video types when changed, depending on the category update_index('entertainment#movie') { self if movie? } update_index('entertainment#cartoon') { self if cartoon? } end class Dude < ActiveRecord::Base acts_as_taggable has_many :books has_many :videos # If author or director was changed, all the corresponding # books, movies and cartoons are updated update_index 'entertainment#book', :books update_index('entertainment#movie') { videos.movies } update_index('entertainment#cartoon') { videos.cartoons } end

Since tags are also indexed, we next need to monkey-patch some external models so that they react to changes:

ActsAsTaggableOn::Tag.class_eval do has_many :books, through: :taggings, source: :taggable, source_type: 'Book' has_many :videos, through: :taggings, source: :taggable, source_type: 'Video' # Updating all tag-related objects update_index 'entertainment#book', :books update_index('entertainment#movie') { videos.movies } update_index('entertainment#cartoon') { videos.cartoons } end ActsAsTaggableOn::Tagging.class_eval do # Same goes for the intermediate model update_index('entertainment#book') { taggable if taggable_type == 'Book' } update_index('entertainment#movie') { taggable if taggable_type == 'Video' && taggable.movie? } update_index('entertainment#cartoon') { taggable if taggable_type == 'Video' && taggable.cartoon? } end

At this point, every object save or destroy will update the corresponding Elasticsearch index type.

Atomicity

We still have one lingering problem. If we do something like books.map(&:save) to save multiple books, we’ll request an update of the entertainment index every time an individual book is saved. Thus, if we save five books, we’ll update the Chewy index five times. This behavior is acceptable for REPL, but certainly not acceptable for controller actions in which performance is critical.

We address this issue with the Chewy.atomic block:

class ApplicationController < ActionController::Base around_action { |&block| Chewy.atomic(&block) } end

In short, Chewy.atomic batches these updates as follows:

Disables the after_save callback. Collects the IDs of saved books. On completion of the Chewy.atomic block, uses the collected IDs to make a single Elasticsearch index update request.

Searching

Now we’re ready to implement a search interface. Since our user interface is a form, the best way to build it is, of course, with FormBuilder and ActiveModel. (At Toptal, we use ActiveData to implement ActiveModel interfaces, but feel free to use your favorite gem.)

class EntertainmentSearch include ActiveData::Model attribute :query, type: String attribute :author_id, type: Integer attribute :min_year, type: Integer attribute :max_year, type: Integer attribute :tags, mode: :arrayed, type: String, normalize: ->(value) { value.reject(&:blank?) } # This accessor is for the form. It will have a single text field # for comma-separated tag inputs. def tag_list= value self.tags = value.split(',').map(&:strip) end def tag_list self.tags.join(', ') end end

Query and filters tutorial

Now that we have an ActiveModel-like object that can accept and typecast attributes, let’s implement search:

class EntertainmentSearch ... def index EntertainmentIndex end def search # We can merge multiple scopes [query_string, author_id_filter, year_filter, tags_filter].compact.reduce(:merge) end # Using query_string advanced query for the main query input def query_string index.query(query_string: {fields: [:title, :author, :description], query: query, default_operator: 'and'}) if query? end # Simple term filter for author id. `:author_id` is already # typecasted to integer and ignored if empty. def author_id_filter index.filter(term: {author_id: author_id}) if author_id? end # For filtering on years, we will use range filter. # Returns nil if both min_year and max_year are not passed to the model. def year_filter body = {}.tap do |body| body.merge!(gte: min_year) if min_year? body.merge!(lte: max_year) if max_year? end index.filter(range: {year: body}) if body.present? end # Same goes for `author_id_filter`, but `terms` filter used. # Returns nil if no tags passed in. def tags_filter index.filter(terms: {tags: tags}) if tags? end end

Controllers and views

At this point, our model can perform search requests with passed attributes. Usage will look something like:

EntertainmentSearch.new(query: 'Tarantino', min_year: 1990).search

Note that in the controller, we want to load exact ActiveRecord objects instead of Chewy document wrappers:

class EntertainmentController < ApplicationController def index @search = EntertainmentSearch.new(params[:search]) # In case we want to load real objects, we don't need any other # fields except for `:id` retrieved from Elasticsearch index. # Chewy query DSL supports Kaminari gem and corresponding API. # Also, we pass scopes for every requested type to the `load` method. @entertainments = @search.search.only(:id).page(params[:page]).load( book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)} ) end end

Now, it’s time to write up some HAML at entertainment/index.html.haml :

= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f| = f.text_field :query = f.select :author_id, Dude.all.map { |d| [d.name, d.id] }, include_blank: true = f.text_field :min_year = f.text_field :max_year = f.text_field :tag_list = f.submit - if @entertainments.any? %dl - @entertainments.each do |entertainment| %dt %h1= entertainment.title %strong= entertainment.class %dd %p= entertainment.year %p= entertainment.description %p= entertainment.tag_list = paginate @entertainments - else Nothing to see here

Sorting

As a bonus, we’ll also add sorting to our search functionality.

Assume that we need to sort on the title and year fields, as well as by relevance. Unfortunately, the title One Flew Over the Cuckoo's Nest will be split into individual terms, so sorting by these disparate terms will be too random; instead, we’d like to sort by the entire title.

The solution is to use a special title field and apply its own analyzer:

class EntertainmentIndex < Chewy::Index settings analysis: { analyzer: { ... sorted: { # `keyword` tokenizer will not split our titles and # will produce the whole phrase as the term, which # can be sorted easily tokenizer: 'keyword', filter: ['lowercase', 'asciifolding'] } } } define_type Book.includes(:author, :tags) do # We use the `multi_field` type to add `title.sorted` field # to the type mapping. Also, will still use just the `title` # field for search. field :title, type: 'multi_field' do field :title, index: 'analyzed', analyzer: 'title' field :sorted, index: 'analyzed', analyzer: 'sorted' end ... end {movie: Video.movies, cartoon: Video.cartoons}.each do |type_name, scope| define_type scope.includes(:director, :tags), name: type_name do # For videos as well field :title, type: 'multi_field' do field :title, index: 'analyzed', analyzer: 'title' field :sorted, index: 'analyzed', analyzer: 'sorted' end ... end end end

In addition, we’re going to add both these new attributes and the sort processing step to our search model:

class EntertainmentSearch # we are going to use `title.sorted` field for sort SORT = {title: {'title.sorted' => :asc}, year: {year: :desc}, relevance: :_score} ... attribute :sort, type: String, enum: %w(title year relevance), default_blank: 'relevance' ... def search # we have added `sorting` scope to merge list [query_string, author_id_filter, year_filter, tags_filter, sorting].compact.reduce(:merge) end def sorting # We have one of the 3 possible values in `sort` attribute # and `SORT` mapping returns actual sorting expression index.order(SORT[sort.to_sym]) end end

Finally, we’ll modify our form adding sort options selection box:

= form_for @search, as: :search, url: entertainment_index_path, method: :get do |f| ... / `EntertainmentSearch.sort_values` will just return / enum option content from the sort attribute definition. = f.select :sort, EntertainmentSearch.sort_values ...

Error handling

If your users perform incorrect queries like ( or AND , the Elasticsearch client will raise an error. To handle that, let’s make some changes to our controller:

class EntertainmentController < ApplicationController def index @search = EntertainmentSearch.new(params[:search]) @entertainments = @search.search.only(:id).page(params[:page]).load( book: {scope: Book.includes(:author)}, movie: {scope: Video.includes(:director)}, cartoon: {scope: Video.includes(:director)} ) rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e @entertainments = [] @error = e.message.match(/QueryParsingException\[([^;]+)\]/).try(:[], 1) end end

Further, we need to render the error in the view:

... - if @entertainments.any? ... - else - if @error = @error - else Nothing to see here

Testing Elasticsearch queries

The basic testing setup is as follows:

Start the Elasticsearch server. Cleanup and create our indices. Import our data. Perform our query. Cross-reference the result with our expectations.

For step 1, it’s convenient to use the test cluster defined in the elasticsearch-extensions gem. Just add the following line to your project’s Rakefile post-gem installation:

require 'elasticsearch/extensions/test/cluster/tasks'

Then, you’ll get the following Rake tasks:

$ rake -T elasticsearch rake elasticsearch:start # Start Elasticsearch cluster for tests rake elasticsearch:stop # Stop Elasticsearch cluster for tests

Elasticsearch and Rspec

First, we need to make sure that our index is updated to be in-sync with our data changes. Luckily, the Chewy gem comes with the helpful update_index rspec matcher:

describe EntertainmentIndex do # No need to cleanup Elasticsearch as requests are # stubbed in case of `update_index` matcher usage. describe 'Tag' do # We create several books with the same tag let(:books) { create_list :book, 2, tag_list: 'tag1' } specify do # We expect that after modifying the tag name... expect do ActsAsTaggableOn::Tag.where(name: 'tag1').update_attributes(name: 'tag2') # ... the corresponding type will be updated with previously-created books. end.to update_index('entertainment#book').and_reindex(books, with: {tags: ['tag2']}) end end end

Next, we need to test that the actual search queries are performed properly and that they return the expected results:

describe EntertainmentSearch do # Just defining helpers for simplifying testing def search attributes = {} EntertainmentSearch.new(attributes).search end # Import helper as well def import *args # We are using `import!` here to be sure all the objects are imported # correctly before examples run. EntertainmentIndex.import! *args end # Deletes and recreates index before every example before { EntertainmentIndex.purge! } describe '#min_year, #max_year' do let(:book) { create(:book, year: 1925) } let(:movie) { create(:movie, year: 1970) } let(:cartoon) { create(:cartoon, year: 1995) } before { import book: book, movie: movie, cartoon: cartoon } # NOTE: The sample code below provides a clear usage example but is not # optimized code. Something along the following lines would perform better: # `specify { search(min_year: 1970).map(&:id).map(&:to_i) # .should =~ [movie, cartoon].map(&:id) }` specify { search(min_year: 1970).load.should =~ [movie, cartoon] } specify { search(max_year: 1980).load.should =~ [book, movie] } specify { search(min_year: 1970, max_year: 1980).load.should == [movie] } specify { search(min_year: 1980, max_year: 1970).should == [] } end end

Test cluster troubleshooting

Finally, here is a guide for troubleshooting your test cluster:

To start, use an in-memory, one-node cluster. It will be much faster for specs. In our case: TEST_CLUSTER_NODES=1 rake elasticsearch:start

There are some existing issues with the elasticsearch-extensions test cluster implementation itself related to one-node cluster status check (it’s yellow in some cases and will never be green, so the green-status cluster start check will fail every time). The issue has been fixed in a fork, but hopefully it will be fixed in the main repo soon.

For each dataset, group your request in specs (i.e., import your data once and then perform several requests). Elasticsearch warms up for a long time and uses a lot of heap memory while importing data, so don’t overdo it, especially if you’ve got a bunch of specs.

Make sure your machine has sufficient memory or Elasticsearch will freeze (we required around 5GB for each testing virtual machine and around 1GB for Elasticsearch itself).

Wrapping up

Elasticsearch is self-described as “a flexible and powerful open source, distributed, real-time search, and analytics engine.” It’s the gold standard in search technologies.

With Chewy, our rails developers have packaged these benefits as a simple, easy-to-use, production quality, open source Ruby gem that provides tight integration with Rails. Elasticsearch and Rails – what an awesome combination!

Elasticsearch and Rails -- what an awesome combination!

Appendix: Elasticsearch internals

Here’s a very brief introduction to Elasticsearch “under the hood”…

Elasticsearch is built on Lucene, which itself uses inverted indices as its primary data structure. For example, if we have the strings “the dogs jump high”, “jump over the fence”, and “the fence was too high”, we get the following structure:

"the" [0, 0], [1, 2], [2, 0] "dogs" [0, 1] "jump" [0, 2], [1, 0] "high" [0, 3], [2, 4] "over" [1, 1] "fence" [1, 3], [2, 1] "was" [2, 2] "too" [2, 3]

Thus, every term contains both references to, and positions in, the text. Furthermore, we choose to modify our terms (e.g., by removing stop-words like “the”) and apply phonetic hashing to every term (can you guess the algorithm?):

"DAG" [0, 1] "JANP" [0, 2], [1, 0] "HAG" [0, 3], [2, 4] "OVAR" [1, 1] "FANC" [1, 3], [2, 1] "W" [2, 2] "T" [2, 3]

If we then query for “the dog jumps”, it’s analyzed in the same way as the source text, becoming “DAG JANP” after hashing (“dog” has the same hash as “dogs”, as is true with “jumps” and “jump”).