Article based on my talk about Full-Text Search in Django with PostgreSQL.

I’ve given this talk in: PyCon IT 2017, EuroPython 2017, PGDay.IT 2017, PyRoma 2017

To show how I have used Django Full-Text Search and PostgreSQL in a real-world project.

To implement Full-Text Search using only Django and PostgreSQL, without resorting to external products.

These are the main topics of this article:

Full-Text Search derives from the need to do some searches in computer-stored documents. For example, to find all documents that contain specific words and their variations. If a document contains “house” or “houses” it will be the same for the search.

“… Full-Text Search* refers to techniques for searching a single computer-stored document or a collection in a Full-Text Database …” — Wikipedia

* FTS = Full-Text Search

FTS Features¶

This is a list of some features that we can find in a quite advanced Full-Text Search, to be used in a real-world website.

Stemming

Ranking

Stop-words removal

Multiple language support

Accent support

Indexing

Phrase search

Tested Products¶

Elastic and Solr are two software programs for the Full-Text Search that are popular today. There are others, but these are the only ones that I have used in my professional projects. They are Lucene based and written in Java.

Snap Market¶

Snap market was a startup where I worked which produced a mobile phone application for buying and selling used items with about half a million mobile users.

Management problems

In this project we used Elasticsearch which had already been set up on a FreeBSD system. We had some difficulty managing and synchronizing it.

Patching a Java plug-in

We had to apply some patches to a Java plugin that we used for the “decompound” of the words in German.

Java

@@ -52,7 +52,8 @@ public class DecompoundTokenFilter … { - posIncAtt.setPositionIncrement(0); + if (!subwordsonly) + posIncAtt.setPositionIncrement(0); return true; }

Apache Solr¶

GoalScout.com is website dedicated to showing sport videos uploaded by public users with about 30,000 videos.

Synchronization problems

The use of Solr for Full-Text Search in this project was a customer choice.

We have always had some problems synchronizing the data between PostgreSQL and Solr.

One way

After these problems we had to start doing all writes to PostgreSQL and all reads from Apache SOLR.

Existing Products¶

Full featured products

products Resources (documentations, articles, …)

The products that I have write about are full-featured and advanced, there are many online resources regarding it: documentations, articles and frequently answered questions.

Synchronization

Mandatory use of driver (haystack, bungiesearch, …)

I have found some problems in synchronization. I have always had to use a driver that is a bottleneck between the Django and the search engine. In some cases I have had to fix the code.

Ops Oriented¶

The focus is on system integrations. I am more a #dev than an #ops so I don’t like to be forced to integrate various systems. I prefer developing and solving problems by writing python code.

FTS in PostgreSQL¶

PostgreSQL has been supporting Full-Text Search since 2008. Internally it uses “tsvecor” and “tsquery” data type to process the data to search. It has some indexes that can be used to speed up the search: GIN and GiST. PostgreSQL added support for Phrase Search in 2016. Support for JSON[B] Full-Text Search was added in version 10.

FTS Support since version 8.3 (2008)

since (2008) TSVECTOR to represent text data

to represent TSQUERY to represent search predicates

to represent Special Indexes ( GIN , GIST )

( , ) Phrase Search since version 9.6 (2016)

since (2016) FTS for JSON [B] since version 10 (2017)

What are documents¶

The “document” is the general concept used in Full-Text Search and where the search is done. In a database a document can be a field on a table, the combination of many fields in a table or in different tables.

“… a document is the unit of searching in a Full-Text Search system; for example, a magazine article or email message …” — PostgreSQL documentation

Django Support¶

The module django.contrib.postgres contains the support to Full-Text Search in Django, since the version 1.10. BRIN and GIN indexes were added in the version 1.11. The GIN index is very useful to speed up Full-Text Search.

Module django.contrib.postgres

django.contrib.postgres FTS Support since version 1.10 (2016)

since version (2016) BRIN and GIN indexes since version 1.11 (2017)

Dev Oriented¶

The focus is on programming. The use of Postgres Full-Text Search in Django is more developer friendly.

Making queries¶

We can look at the functions of Full-Text Search in Django starting from the models present in the Django official documentation. We have a Blog and an Author class connected through an Entry class The Django ORM creates tables and constructs queries.

Python

from django.contrib.postgres.search import SearchVectorField from django.db import models class Blog ( models . Model ): name = models . CharField ( max_length = 100 ) tagline = models . TextField () lang = models . CharField ( max_length = 100 , default = 'english' ) def __str__ ( self ): return self . name class Author ( models . Model ): name = models . CharField ( max_length = 200 ) email = models . EmailField () def __str__ ( self ): return self . name class Entry ( models . Model ): blog = models . ForeignKey ( Blog , on_delete = models . CASCADE ) headline = models . CharField ( max_length = 255 ) body_text = models . TextField () pub_date = models . DateField ( auto_now_add = True ) mod_date = models . DateField ( auto_now = True ) authors = models . ManyToManyField ( Author ) n_comments = models . IntegerField ( default = 0 ) n_pingbacks = models . IntegerField ( default = 0 ) rating = models . IntegerField ( default = 5 ) search_vector = SearchVectorField ( null = True ) def __str__ ( self ): return self . headline

Standard queries¶

These are the basic searches that we can use on models in Django using “filter”.

Python

from blog.models import Author , Blog , Entry Author . objects . filter ( name__contains = 'Terry' ) . values_list ( 'name' , flat = True )

SQL

SELECT "blog_author" . "name" FROM "blog_author" WHERE "blog_author" . "name" :: text LIKE '%Terry%'

['Terry Gilliam', 'Terry Jones']

We can use “case-insensitive containment” in order to get more results.

Python

Author . objects . filter ( name__icontains = 'ERRY' ) . values_list ( 'name' , flat = True )

SQL

SELECT "blog_author" . "name" FROM "blog_author" WHERE UPPER ( "blog_author" . "name" :: text ) LIKE UPPER ( '%ERRY%' )

['Terry Gilliam', 'Terry Jones', 'Jerry Lewis']

Unaccented query¶

By activating the unaccent PostgreSQL module, we can use the “unaccent” extension.

Python

from django.contrib.postgres.operations import UnaccentExtension class Migration ( migrations . Migration ): ... operations = [ UnaccentExtension (), ... ]

SQL

CREATE EXTENSION unaccent ;

We can search without worrying about accented characters, useful in different languages.

Python

Author . objects . filter ( name__unaccent = 'Helene Joy' ) . values_list ( 'name' , flat = True )

SQL

SELECT "blog_author" . "name" FROM "blog_author" WHERE UNACCENT ( "blog_author" . "name" ) = ( UNACCENT ( 'Helene Joy' ))

['Hélène Joy']

Warning

Queries using this filter will generally perform full table scans, which can be slow on large tables.

Trigram similarity¶

By activating the trigram PosgreSQL module, we can use the “trigram” extension.

Python

from django.contrib.postgres.operations import TrigramExtension class Migration ( migrations . Migration ): ... operations = [ TrigramExtension (), ... ]

SQL

CREATE EXTENSION pg_trgm ;

A “trigram” is a group of three consecutive characters taken from a string. We can evaluate the similarity of two strings by the number of “trigrams” they share.

Python

Author . objects . filter ( name__trigram_similar = 'helena' ) . values_list ( 'name' , flat = True )

SQL

SELECT "blog_author" . "name" FROM "blog_author" WHERE "blog_author" . "name" % 'helena'

['Helen Mirren', 'Helena Bonham Carter']

Search lookup¶

This is the base search lookup of Django and with this we can execute a real Full-Text Search call on a field.

Python

Entry . objects . filter ( body_text__search = 'Cheese' ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" WHERE to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' )) @@ ( plainto_tsquery ( 'Cheese' )) = true

['Cheese on Toast recipes', 'Pizza Recipes']

To execute a more complex query we have to use three new Django objects: SearchVector, SearchQuery, SearchRank.

We can use a “SearchVector” to build our document in more fields of the same object or connected objects too. Then we can filter on the document with a string.

Python

from django.contrib.postgres.search import SearchVector search_vector = SearchVector ( 'body_text' , 'blog__name' ) Entry . objects . annotate ( search = search_vector ) . filter ( search = 'Cheese' ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" INNER JOIN "blog_blog" ON ( "blog_entry" . "blog_id" = "blog_blog" . "id" ) WHERE to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' ) || ' ' || COALESCE ( "blog_blog" . "name" , '' ) ) @@ ( plainto_tsquery ( 'Cheese' )) = true

['Cheese on Toast recipes', 'Pizza Recipes']

When we insert text into a Full-Text Search by using a “search query” we can apply “stemming” and “stop-word removal” even to the user texts. and to these we can apply basic logical operations.

Python

from django.contrib.postgres.search import SearchQuery search_query = ~ SearchQuery ( 'toast' ) search_vector = SearchVector ( 'body_text' ) Entry . objects . annotate ( search = search_vector ) . filter ( search = search_query ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" WHERE to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' )) @@ ( !! ( plainto_tsquery ( 'toast' ))) = true

['Pizza Recipes', 'Pain perdu']

Python

search_query = SearchQuery ( 'cheese' ) | SearchQuery ( 'toast' ) search_vector = SearchVector ( 'body_text' ) Entry . objects . annotate ( search = search_vector ) . filter ( search = search_query ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" WHERE to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' )) @@ (( plainto_tsquery ( 'cheese' ) || plainto_tsquery ( 'toast' ) )) = true

['Cheese on Toast recipes', 'Pizza Recipes']

Python

search_query = SearchQuery ( 'cheese' ) & SearchQuery ( 'toast' ) search_vector = SearchVector ( 'body_text' ) Entry . objects . annotate ( search = search_vector ) . filter ( search = search_query ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" WHERE to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' )) @@ (( plainto_tsquery ( 'cheese' ) && plainto_tsquery ( 'toast' ) )) = true

['Cheese on Toast recipes']

We can use the PostgreSQL “rank” to calculate the score of a document in relation to a searched text, and we can use it to filter and sort it.

Python

from django.contrib.postgres.search import SearchRank search_vector = SearchVector ( 'body_text' ) search_query = SearchQuery ( 'cheese' ) search_rank = SearchRank ( search_vector , search_query ) Entry . objects . annotate ( rank = search_rank ) . order_by ( '-rank' ) . values_list ( 'headline' , 'rank' ))

SQL

SELECT "blog_entry" . "headline" , ts_rank ( to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' )), plainto_tsquery ( 'cheese' ) ) AS "rank" FROM "blog_entry" ORDER BY "rank" DESC

[ ('Cheese on Toast recipes', 0.0889769), ('Pizza Recipes', 0.0607927), ('Pain perdu', 0.0) ]

Search configuration¶

We can setup the “search vector” or “search query” to execute “stemming” or “stop words removal” for a specific language.

Python

language = 'french' search_vector = SearchVector ( 'body_text' , config = language ) search_query = SearchQuery ( 'œuf' , config = language ) Entry . objects . annotate ( search = search_vector ) . filter ( search = search_query ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" WHERE to_tsvector ( 'french' :: regconfig , COALESCE ( "blog_entry" . "body_text" , '' )) @@ ( plainto_tsquery ( 'french' :: regconfig , 'œuf' ) ) = true

['Pain perdu']

We can get the language from a class field.

Python

from django.db.models import F language = F ( 'blog__lang' ) search_vector = SearchVector ( 'body_text' , config = language ) search_query = SearchQuery ( 'œuf' , config = language ) Entry . objects . annotate ( search = search_vector ) . filter ( search = search_query ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" INNER JOIN "blog_blog" ON ( "blog_entry" . "blog_id" = "blog_blog" . "id" ) WHERE to_tsvector ( "blog_blog" . "lang" :: regconfig , COALESCE ( "blog_entry" . "body_text" , '' ) ) @@ ( plainto_tsquery ( "blog_blog" . "lang" :: regconfig , 'œuf' ) ) = true

['Pain perdu']

Queries weighting¶

It’s possible to set up the search to give a different weight to various fields and then use these values in searches.

Python

search_vector = SearchVector ( 'body_text' , weight = 'A' ) + SearchVector ( 'headline' , weight = 'B' ) search_query = SearchQuery ( 'cheese' ) search_rank = SearchRank ( search_vector , search_query ) Entry . objects . annotate ( rank = search_rank ) . order_by ( '-rank' ) . values_list ( 'headline' , 'rank' )

SQL

SELECT "blog_entry" . "headline" , ts_rank (( setweight ( to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' )), 'A' ) || setweight ( to_tsvector ( COALESCE ( "blog_entry" . "headline" , '' )), 'B' ) ), plainto_tsquery ( 'cheese' )) AS "rank" FROM "blog_entry" ORDER BY "rank" DESC

[ ('Cheese on Toast recipes', 0.896524), ('Pizza Recipes', 0.607927), ('Pain perdu', 0.0) ]

If we want to speed up and simplify the query execution we can add a “search vector field” to the model and then execute searches on this specific field.

Python

Entry . objects . filter ( search_vector = 'cheese' ) . values_list ( 'headline' , flat = True )

SQL

SELECT "blog_entry" . "headline" FROM "blog_entry" WHERE "blog_entry" . "search_vector" @@ ( plainto_tsquery ( 'cheese' )) = true

['Cheese on Toast recipes', 'Pizza Recipes']

We have to update this field manually, for example executing a command periodically, using Django signals or with a PostgreSQL triggers.

Python

search_vector = SearchVector ( 'body_text' ) Entry . objects . update ( search_vector = search_vector )

SQL

UPDATE "blog_entry" SET "search_vector" = to_tsvector ( COALESCE ( "blog_entry" . "body_text" , '' ))

“… today’s shows in the Capital”

www.concertiaroma.com is a website used for searching for shows, festivals, bands and venues in the city of Rome and has been online since 2014.

The numbers of the project:

about 1,000 venues

about 15,000 bands

more than 16,000 shows

about 200 festivals

about 30,000 user/month

The old version of the website was developed some years ago with Django 1.7 and it runs on Python 2.7. The data was managed by PostgreSQL version 9.1 and the search was performed by using the SQL LIKE syntax.

Python 2.7

Django 1.7

PostgreSQL 9.1

SQL LIKE

The new version, recently released, was developed with Django 1.11 and it runs on Python 3.6. The data is managed by PostgreSQL 9.6 and the search uses its Full-Text Search engine.

Python 3.6

Django 1.11

PostgreSQL 9.6

PG FTS

Band models¶

We can look at the functions of Full-Text Search in www.concertiaroma.com starting from the same models present in the project. We have a Genre class connected to a Band class.

Python

from django.db import models from .managers import BandManager class Genre ( models . Model ): name = models . CharField ( max_length = 255 ) class Band ( models . Model ): nickname = models . CharField ( max_length = 255 ) description = models . TextField () genres = models . ManyToManyField ( Genre ) objects = BandManager ()

Band Manager¶

This is an example of “Manager” for the Band class which defines a search method that contains all the Full-Text Search logic.

Python

from django.contrib.postgres.aggregates import StringAgg from django.contrib.postgres.search import ( SearchQuery , SearchRank , SearchVector , TrigramSimilarity , ) from django.db import models search_vectors = ( SearchVector ( 'nickname' , weight = 'A' , config = 'english' ) + SearchVector ( StringAgg ( 'genres__name' , delimiter = ' ' ), weight = 'B' , config = 'english' ) + SearchVector ( 'description' , weight = 'D' , config = 'english' ) ) class BandManager ( models . Manager ): def search ( self , text ): search_query = SearchQuery ( text , config = 'english' ) search_rank = SearchRank ( search_vectors , search_query ) trigram_similarity = TrigramSimilarity ( 'nickname' , text ) return self . get_queryset () . annotate ( search = search_vectors ) . filter ( search = search_query ) . annotate ( rank = search_rank + trigram_similarity ) . order_by ( '-rank' )

Band Test Setup¶

To better understand the mechanism, we can take into consideration an example of a simplified test.

In the test setup we defined the example data that we will use afterwards to test our search: two bands and two musical genres that we assigned to the two bands.

Python

from collections import OrderedDict from django.test import TestCase from .models import Band , Genre class BandTest ( TestCase ): def setUp ( self ): # Genres blues , _ = Genre . objects . get_or_create ( name = 'Blues' ) jazz , _ = Genre . objects . get_or_create ( name = 'Jazz' ) swing , _ = Genre . objects . get_or_create ( name = 'Swing' ) # Bands ella_fitzgerald , _ = Band . objects . get_or_create ( nickname = 'Ella Fitzgerald' , description = ( 'Ella Jane Fitzgerald (25 Apr 1917-15 Jun 1996)' ' was an American jazz singer often referred to' ' as the First Lady of Song, Queen of Jazz and ' 'Lady Ella. She was noted for her purity of ' 'tone, impeccable diction, phrasing and ' 'intonation, and a horn-like improvisational ' 'ability, particularly in her scat singing.' )) django_reinhardt , _ = Band . objects . get_or_create ( nickname = 'Django Reinhardt' , description = ( 'Jean Django Reinhardt (23 Jan 1910-16 May 1953)' ' was a Belgian-born, Romani French jazz ' 'guitarist and composer, regarded as one of the ' 'greatest musicians of the twentieth century. He' ' was the first jazz talent to emerge from ' 'Europe and remains the most significant.' )) louis_armstrong , _ = Band . objects . get_or_create ( nickname = 'Louis Armstrong' , description = ( 'Louis Armstrong (4 Aug 1901-6 Jul 1971), ' 'nicknamed Satchmo, Satch and Pops, was an ' 'American trumpeter, composer, singer and ' 'occasional actor who was one of the most ' 'influential figures in jazz. His career spanned' ' five decades, from the 1920s to the 1960s, ' 'and different eras in the history of jazz.' )) # Bands and Genres ella_fitzgerald . genres . add ( blues ) django_reinhardt . genres . add ( jazz ) louis_armstrong . genres . add ( blues , swing ) def test_band_search ( self ): # ...

Contents from “Wikipedia, The Free Encyclopedia”:

Band Test Method¶

In the search test on the bands we simply invoked the search method giving a search text and we got back the list of values for the fields ‘nickname’ and ‘rate’. ‘nickname’ is stored on the band table, while ‘rate’ is calculated by our search method at runtime.

Python

from collections import OrderedDict from django.test import TestCase from .models import Band , Genre class BandTest ( TestCase ): def setUp ( self ): # ... def test_band_search ( self ): band_queryset = Band . objects . search ( 'jazz' ) . values_list ( 'nickname' , 'rank' ) band_objects = list ( OrderedDict ( band_queryset ) . items ()) band_list = [ ( 'Django Reinhardt' , 0.265124 ), ( 'Ella Fitzgerald' , 0.0759909 ), ( 'Louis Armstrong' , 0.0759909 )] self . assertSequenceEqual ( band_objects , band_list )

In this example we compared our search results with a list of lists where we defined the pair composed of the band’s nickname and the numerical value that is the search rate, or in other words, the numerical value that defines the importance of that term.

We have seen a simplified use of the current features of Django and the PostgreSQL Full-Text Search.

Both of these software programs are getting better in these fields and these are some of the features that can be available in the new future.

Advanced Misspelling support

support Multiple language configuration

language configuration Search suggestions

SearchVectorField with triggers

with JSON / JSONB Full-Text Search

Full-Text Search RUM indexing

In conclusion, the following are the conditions for evaluating the implementation of a Full-Text Search with PostgreSQL in Django:

not having any extra dependencies

not doing very complex searches

managing all the components easily

all the components easily avoiding data synchronization between different systems

between different systems having PostgreSQL already in the stack

already in the stack operating in a Python-only environment

For all the support (www.20tab.com)

Marc Tamlyn¶

For django.contrib.postgres (github.com/mjtamlyn)

These are the resources that I used to prepare this article and to develop the search function I have showed you.

Thank you¶

CC BY - SA ¶

This article and related presentation is released with Creative Commons Attribution ShareAlike license.

creativecommons.org/licenses/by-sa

Source Code¶

I published the source code used in this article on GitHub.

github.com/pauloxnet/django_queries

You can download the presentation from my SpeakerDeck account.

speakerdeck.com/pauloxnet