Recently i helped my colleague data scientists with engineering a search system based on their Deep Learning models. Their project was about making document embeddings using a deep learning model and then use the embedding vectors in our search system to find similar documents.

A document embedding is essentially just a (long) array of numbers and finding similar documents means to find other (long) arrays of numbers that are close. Similarity can be measured for example by euclidean distance.

What you can do with this, is find similar documents. But because it's not directly based on keywords but on "embeddings", you automatically get functionality comparable to synonym expansion. It will find related documents, even if they use different keywords. So this can work much better than keyword search.

There is existing tools for such a problem, for example the FAISS library by facebook. This library is very fast and supports various clever methods for fast searching using such embedding vectors. However this library doesn't integrate nicely with a search engine such as Elasticsearch.

For Elasticsearch there is also some plugins offering similar functionality, but they aren't nearly as fast because they only calculate vector similarity but don't filter.

So we engineered our own, better solution ;-)

Fast Nearest Neighbours

For fast search usually some kind of "index" is used, a data structure that allows to efficiently filter down to the relevant matches without evaluating each match individualy. For searching one keywords, an "inverted index" is used. For searching on geo coordinates, a data structure called a KDTree is used. We will need some such mechanism that quickly filters down to the most relevant matches, so we only need to calculate the exact score on this smaller set. This is important because calculating distances with a large set of high dimensional vectors is an expensive (slow) operation.

The FAISS library mentioned above allows to solve this problem in a few different ways:

Reducing dimensionality with PCA

KMeans clustering

Locality Sensitive Hashing

Probably more ways i don't know yet

Each of these approaches enables an efficient indexing approach, where you can quickly filter down to the near-ish neighbours and then calculate exact distances to find the nearest neighbours. After dimensionality reduction, one can use a KDTree, after clustering or Locality Sensitive Hashing an inverted index.

The plot above shows how filtering down the dataset speeds up computation. There is a linear relation between the number of docs where an exact distance needs to be computed and the computation time. This shows how important it is to efficiently filter out non-similar documents.

All of these approaches can potentially be implemented in Elasticsearch as wel of course. The advantage that gives is easy integration with the rest of the search system. You can then combine queries based on keywords or any other criteria with the Deep Learning results.

Experimentation showed that on our dataset, the combination of reducing dimensions with PCA and then indexing with a KDTree gives us the best combination of speed and accuracy.

The plot above shows how filtering down the dataset affects result accuracy. What you can see there is, that filtering too aggressively means we lose some of the nearest neighbours. But if we filter down to between 50k or 75k documents, we find all the nearest neighbours, but computation time is only a fraction of brute-force computing all distances.

Elasticsearch Plugin

In the lucene library, the basis of elasticsearch, the KDTree data structure is already available. It's not exposed yet by the elasticsearch api. For calculating exact vector distance there is already plugins, so we only needed to build a small plugin that allows using this index data structure. See here.

Plugging things together

Making it all work together is now just matter of putting the puzzle pieces together in the right order:

Install the Elasticsearch plugins

PCA for dimensionality reduction (Python/sklearn or Java/Smile)

Index the reduced and full vectors in Elasticsearch (and other fields)

Ready to go! ;-)

To install the plugin, create the index, and add documents please see here. After those instructions we can now search using our embeddings vectors! Notice the range query on the pca_reduced_vector, that's what our new plugin does.

POST my_index/_search

{ "query" : { "function_score" : { "query" : { "range" : { "pca_reduced_vector" : { "from" : "-0.5,-0.5,-0.5,-0.5,-0.5,-0.5,-0.5,-0.5" , "to" : "0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5" } } }, "functions" : [ { "script_score" : { "script" : { "inline" : "vector_scoring" , "lang" : "binary_vector_score" , "params" : { "vector_field" : "full_vector" , "vector" : [ 0.0 , 0.0716 , 0.1761 , 0.0 , 0.0779 , 0.0 , 0.1382 , 0.3729 ] } } } } ], "boost_mode" : "replace" } }, "size" : 10 }

Conclusion

We showed how to efficiently search with Deep Learning vectors. Applications of this is anything where you want to find similar documents, but plain keywords aren't good enough. For embedding vectors you could for example try using Doc2Vec.

I hope you enjoyed, if you have feedback or questions leave a comment or send me a message!