Docker

We are going to split our setup into multiple components, and Docker is the easiest way to isolate every environment to make it portable and suitable for production. Docker is especially useful for running various NLP modules because those often have weird dependencies, which are hard to support in a shared environment.

The Encoder

Universal Sentence Encoder is not the only network that can generate vector representations, but in our internal tests, it has performed best (as of July 2019, NLP world is evolving fast!). As a bonus point, it’s available in a multi-lingual variant. It means that the network has been trained on an international dataset and simultaneously acts as a translator! Remember the German -> English example from the beginning of this article? Both sentences translate into a close representation using USE and would easily match one onto another.

Universal Sentence Encoder is available as a pre-trained model on the Tensorflow Hub. You can easily download it and experiment locally or launch a notebook right on the hub. For a production environment, however, it’s better to wrap the model into a separate service, so that you can easily update it, scale up and down, etc. Perhaps the most popular approach to running Tensorflow models is using the official Tensorflow Serving tool. It allows you to take a dump of any TF model, and start a multi-threaded, production-ready server with a REST API in front of it. Tensorflow Serving comes in the form of a Docker image, and all you have to do to get up and running is provide a path to your saved model.

Unfortunately, we ran into some complications running Universal Sentence Encoder on top of Tensorflow Serving. In particular, USE uses a custom TF operation called sentencepiece. TF Serving requires all custom operations to be statically linked with the main tf_serving binary, and that turned out to be not easy to do due to a number of reasons. Luckily, we found a comment in the sentencepiece repository, pointing to another project called Simple Tensorflow Serving. STS provides a similar way of running TF models but doesn’t have any hard requirements for custom operations. Below you’ll find a few snippets that will help you get your Universal Sentence Encoder up and running using Simple Tensorflow Serving.

$ git clone https://github.com/tobegit3hub/simple_tensorflow_serving git clone https://github.com/google/sentencepiece $ mkdir simple_tensorflow_serving/tf_sentencepiece # Copy the correct version of sentencepiece library into simple_tensorflow_serving $ cp sentencepiece/tensorflow/tf_sentencepiece/_sentencepiece_processor_ops.so.1.13.1 simple_tensorflow_serving/tf_sentencepiece/sentencepiece_processor_ops.so # Pin the desired tensorflow version to 1.13.1

# Note that sed command is written for Mac OS, you may need to tweak it to work in Linux $ sed -i.back s'/tensorflow.*$/tensorflow==1.13.1/g' simple_tensorflow_serving/requirements.txt # Create a folder to store the USE model $ mkdir simple_tensorflow_serving/models/use

Now, download the Universal Sentence Encoder and put it into simple_tensorflow_serving/models folder using this Python snippet:

Finally, modify the simple_tensorflow_serving/Dockerfile to look like this:

Now you can build the encoder image:

$ docker build . --tag use:latest

And run it like this:

$ docker run -it use:latest

Encoder client

New language models are being released by big companies every few months, and it makes a lot of sense to make this part of our pipeline easily replaceable. To facilitate that we’re going to describe a simple encoder interface, which will be shared by our USE client. Below is a snippet of the Encoder interface along with the implementation, that would connect to the REST API provided by the Simple Tensorflow Serving.

Building an index

We’ve got the encoding part covered, but how do we actually create an index and search? ElasticSearch uses a trie-based inverted-index to quickly map a keyword to the list of documents containing the keyword. Trie is an efficient data structure that can help you find exact word-to-word matches very quickly. Unfortunately, it won’t work for us in the vector space. With vectors, we’re not looking for exact matches between them. Instead, we want to be able to quickly return K closest vectors to the one representing a query. Each found record would then correspond to a title or some other field of an indexed document. There’s no need for an extra scoring algorithm, as the distance itself represents a score.

Facebook FAISS

Several libraries can help you build a vector-space index. The one that we use in OneBar is called FAISS. It’s developed by Facebook AI Research and available under MIT license.

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU.

Here’s a little example of how to use FAISS and the Encoder together:

Note that we’ve used a simple faiss.IndexFlatIP here, which is suitable for small amounts of data and doesn’t provide any optimizations. Searching through such an index is an O(N) operation so, for millions of vectors, it can be very slow. For large arrays of data FAISS allows you to perform clustering first, and then only scan a cluster, closest to the query, for matches. This makes the search K times faster (where K is the number of clusters), but you may also lose some precision. For more information, check out this wiki.

At OneBar, we wrote a little wrapper around FAISS that creates several indexes per-client. One index contains titles of OneBar Problems, the other holds Problem descriptions, split into separate sentences, etc. When we run a query, we try to find a match in any of them, but give a slight priority to the one with titles.

Persisting data

Sentence encoder queries are rather expensive. Encoding a 1000 sentences takes about five minutes on a single CPU. If you’re going to store a substantial number of documents and going to restart your index frequently, it doesn’t make sense to run them all through the encoder every time. Luckily, FAISS comes with a built-in way of dumping and re-loading an index to/from a disk.

You can use these low-level operations to build proper procedures for dumping your search service state to a disk and loading from there.

gRPC

We want the semantic index to act as an independent service in our infrastructure. We’d also like it to handle more than one request at a time, but don’t really want to deal with concurrent code. A cheap way to achieve this is to put a gRPC API in front of the index. It provides a standard way for other parts of the infrastructure to interact with our service and comes with a handy multi-threaded server. Here’s an example of a minimalistic search server gRPC interface.

Put it in the ./protos subfolder and compile using the following command:

$ python -m grpc_tools.protoc -I./protos/ --python_out=./protos/ --grpc_python_out=./protos/ ./protos/victor.proto

You can then inherit from a victor_pb2_grpc.VictorServicer , and create your own search service by implementing all the API methods. Depending on your specific task, you can create one or more FAISS indexes to store different parts of the data you need to search across.

Note: Victor is the name of our own vector-space search service.

ElasticSearch

To save some space, I won’t be talking about setting up the ElasticSearch in this article. There’s plenty of information on the web about this topic, and I’ll just assume we have an ES server running somewhere. Each searchable document will then be indexed twice: one time in the ElasticSearch, and one more time in our custom-built semantic index. The ElasticSearch will still be handling most of you “simple” search cases, but the semantic index will help resolving some of the trickiest ones.

Frontend and mixing the results

Finally, to complete our pipeline, we’re going to create some kind of frontend service. This service will be receiving queries from the clients (e.g., web UI), running them by the similarity index and ElasticSearch, picking the best result, and returning it back to the client. One simple strategy for choosing the best result is to set a confidence threshold for the distance and bump up all the “confident” ones to the top of the ES results. There are more advanced ways to do the mixing, but I’m not going to include them in this article for the sake of saving space.

Examples

Teeth implants ~ dental; covered ~ plan

event ~ meetup; building ~ office

When ~ at what time; dinner ~ eat

Just another mind-blowing translation example :)

Finally, regular ElastiSearch results returned for a rare keyword

Final words

The examples you’ve just seen are quite impressive, but of course, this approach comes with many limitations.