I needed a good thumbnail.

Introduction

No, she hasn’t learned Python yet.

Search engines seem almost incomplete without some sort of auto-complete. Here’s a quick guide to building your own auto-complete from scratch. We will be using Python, Postgres and Spacy. Elastic Search is a very good alternative to Postgres, but for simple tasks not involving millions and millions of rows, Postgres, despite having no in-memory capabilities, is not too shabby. Here is an interesting article highlighting the differences between the two.

An overview of the process that we would be following:

Process textual data to extract relevant phrases that are to pop up on searching.

Store this data in a search-optimized materialized view.

Create relevant indexes.

Query this view and parse results for auto-completion.

Extracting Key-Phrases

Spacy provides an industrial-grade text parsing framework. We will be using this to extract noun chunks from our text. The code is as simple as this:

Extraction of noun chunks

Additionally, you can use Python’s Counter to calculate frequencies of these phrases for ranking and other purposes. The following snippet also shows a pg query for insertion of these phrases in the database, and on conflicts, taking the necessary action. In our case, we update the frequencies of the existing phrases by adding those of the newly found ones.

Storing the noun chunks

For transactional purposes, a table is better suited. We will get a culled version of this table into a materialized view for analytical tasks, like searching.

The Materialized View

It is important to know the difference between different types of views that Postgres has to offer, and we shall look at two types: (Normal) view and Materialized View.

Materialized view stores the data it contains on the disk, whereas a normal view runs the underlying query each time the view is queried. This means materialized views have to be periodically refreshed. Refreshes can happen either manually, or through a mechanism like trigger on the main table or even a listen/notify system. We will not get into that for now.

The main table prompts looks like follows:

You can add several conditions on the types of phrases that you want: ngram size, presence of certain POS tags, all phrase as lowercase text, excluding certain stop-words, and so on.

Here, id is the primary key of this table. There is a unique constraint on the ngram field, with type being varchar(256). We don’t want key-phrases that are too long, and thus too specific.

We shall now create a materialized view of this table, and build an index on this view. Any changes to the underlying prompts table can be accounted for by simply refreshing this view.

Creation, Indexing and Refreshing of a Materialized View

For production level applications, check out refreshing the view concurrently.

Fetching Prompts

There are several ways of searching the view, and we shall stick with the following way:

select ngram from <schema_name>.mv_prompts

where tsquery('charges' || ':*') @@ to_tsvector(ngram)

limit 10

Another neat trick is the inclusion of || ‘:*’ in the tsquery(‘charges’ || ‘:*’) part of the query. This brings up search results for incomplete words as well, by regex search!

Search Results

Here, ‘charges’ is an example search term. We have an index over to_tsvector(ngram) that will optimize the search.

Query execution time: 109 msec.

Note: Only 9 results were retrieved for this search term.

Prompts for search term ‘charg’

The adjacent image shows top 10 results for searching the term ‘charg’. This serves our purpose perfectly, since we also get prompts like ‘chargeable interest’ and ‘the chargeable accounting periods’, along with the ones that we got upon searching ‘charges’.

Query execution time: 182 msec.

Not having the wild card search should bring down the execution time, but it will also limit the number of results. This is an application specific feature.

Wrapping this up in an API

Hitting the space bar after a long day at work

The back-end is now ready to serve. Now we need recurring calls to an API endpoint to fetch search suggestions as a user types. GET method from the front-end application should do the trick. This call can be triggered on every key-press, on hitting the spacebar, on a typing delay of some threshold, or any combination of these.

Wrapping Up

A Python script with a Flask API handler and subsequent database querying would look something like this:

We done yet?

In Action

Straight outta iOS!

Further parts include including relevant verb phrases, parsing to preserve semantic similarities, and understanding why Medium won’t let writers custom align their media.