Optimizing full-text search with Postgres materialized view in Rails

ruby rails postgresql

My recent side project is an aggregator for remote dev jobs https://remotestack.club. To keep things simple, I decided to use Postgres full-text search. It offers powerful search capabilities. More than enough for a side project and early startups.

I built the project with Ruby on Rails and I used pg_search gem to handle PostgreSQL’s full-text search. I wanted to offer a search across the job’s details, skills’ names, and the company’s name.

It is easy to search columns on associated models. Unfortunately, there is no simple solution to speed up those searches. This article shows how to optimize search with Postgres materialized view.

Quick intro to full-text search

Full-text search is a technique for searching natural-language documents that satisfy a query. In our case, a query is a text provided by a user.

A document is the unit of searching in a full text search system; for example, a magazine article or email message

Postgres text search intro

In PostgreSQL, a document usually is a text field or a combination of fields. Possibly stored across multiple tables. During the search, each document is converted into tsvector.

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word

Text search data types

Below we can see a tsvector in action.

postgres=# SELECT to_tsvector('Ruby on Rails, is a server-side web application framework'); to_tsvector ----------------------------------------------------------------------------------------- 'applic':10 'framework':11 'rail':3 'rubi':1 'server':7 'server-sid':6 'side':8 'web':9 (1 row)





Schema

class CreateJobPostsSkillsAndCompanies < ActiveRecord :: Migration def change create_table :companies do | t | t . string :name end create_table :job_posts do | t | t . string :title t . text :description t . belongs_to :company t . timestamps end create_table :skills do | t | t . string :name t . timestamps end create_table :job_post_skills do | t | t . belongs_to :skill t . belongs_to :job_post t . timestamps end end end





Seed data

Let’s create some seed data. 10_000 job posts should be enough.

https://github.com/caspg/optimizing-postgresql-full-text-search-rails/blob/master/db/seeds.rb





Full text with pg_search gem

Pg_search gem builds ActiveRecord named scopes that take advantage of PostgreSQL’s full-text search. We can configure a search scope using pg_search_scope . The first parameter is a scope that we will use for full-text search.

We want to search against columns in JobPost but also against columns on associated models, Skill and Company . pg_search supports searching through associations with :associated_against options.

class JobPost < ApplicationRecord has_many :job_post_skills , dependent: :destroy has_many :skills , through: :job_post_skills belongs_to :company include PgSearch :: Model pg_search_scope ( :search , against: [ :title , :description ], associated_against: { skills: :name , company: :name }, using: { tsearch: { dictionary: 'english' , }, }, ) end

After adding a couple of lines of code, we can already use a full-text search. As we can see below, performance is not that great.

> JobPost.search ( "ruby on rails" ) JobPost Load ( 1076.5ms ) SELECT "job_posts" . * FROM "job_posts" ...





Potential optimizations

We can use database indexes to speed up data retrieval. Postgres gives us two types of indexes for full-text searches.

GIN (Generalized Inverted Index)

(Generalized Inverted Index) GiST (Generalized Search Tree)

Acording to the documentation, GIN indexes are the preferred type.

CREATE INDEX name ON table USING GIN ( column ); -- or CREATE INDEX name ON table USING GIN ( to_tsvector ( 'english' , column ));

The column must be of tsvector type or must be converted to this type with to_tsvector function. We can populate the column of tsvector type using database triggers. With searches across associated tables, we have to do some extra work to build such indexes.

We could use database denormalization and triggers to ensure data integrity. This would give us up to date indexes but would introduce extra complexity and would slow down updates. Another solution is materialized view.





Quick intro to materialized views

A View is a virtual table created by a query based on one or more tables. It can be used for wrapping commonly used complex queries. Materialized Views are special kind of View that persist results in table-like form.

They give us faster access to data but increase database size and data are not always current. They are perfect in scenarios when data does not have to be always fresh or when we have more or less static data. For example, a job aggregator which imports new posts a couple of times per day. In this case, we can refresh data after each import.

We can generate fresh data with:

REFRESH MATERIALIZED VIEW my_materialized_view ;





Materialized views in Rails with Scenic gem

Scenic gem adds methods to create and manage database views (and materialized views) in Rails.

rails generate scenic:view job_post_search --materialized create db/views/job_post_searches_v01.sql create db/migrate/[TIMESTAMP]_update_job_post_searches_to_version_1.rb

job_post_searches_v01.sql defines a query we will use to build a materialized view. We have to build a view with two columns, job_post_id and tsv_document . tsv_document is a combination of associated fields in tsvector data type.

-- db/views/job_post_searches_v01.sql SELECT job_posts . id AS job_post_id , ( to_tsvector ( 'english' , coalesce ( job_posts . title , '' )) || to_tsvector ( 'english' , coalesce ( job_posts . description , '' )) || to_tsvector ( 'english' , coalesce ( companies . name , '' )) || to_tsvector ( 'english' , coalesce ( string_agg ( skills . name , ' ; ' ), '' )) ) AS tsv_document FROM job_posts JOIN companies ON companies . id = job_posts . company_id JOIN job_post_skills ON job_post_skills . job_post_id = job_posts . id JOIN skills ON skills . id = job_post_skills . skill_id GROUP BY job_posts . id , companies . id ;

The above query returns the following results:

job_post_id | tsv_document -------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 | 'applic':17 'block':31 'dach':30 'develop':5 'framework':18 'licens':25 'mit':24 'parker':28 'rail':4,8,10 'rubi':2,6,21,32 'senior':1 'server':14 'server-sid':13 'side':15 'smith':27 'spacex':26 'von':29 'web':16 'written':19 ...more rows

Scenic adds create_view method. It creates a materialized view based on the corresponding SQL statement. Finally, we can also create a GIN index on tsv_document column.

class CreateJobPostSearches < ActiveRecord :: Migration def change create_view :job_post_searches , materialized: true # below line added by us add_index :job_post_searches , :tsv_document , using: :gin end end





Full-text search using materialized view

Thanks to ActiveRecord, a model can be backed by a view. We can define search scope on such model in the same way we did with JobPost model. This time, we want to search against tsvector type column, instead of using an expression (which is used by default). It won’t create tsvector during each search and will use a previously created index.

class JobPostSearch < ApplicationRecord self . primary_key = :job_post_id include PgSearch :: Model pg_search_scope ( :search , against: :tsv_document , using: { tsearch: { dictionary: 'english' , # specify tsvector column name tsvector_column: 'tsv_document' , }, }, ) end class JobPost < ApplicationRecord ... # omitted lines def self . faster_search ( query ) # protip: when using `select` instead of `pluck` we have one query less where ( id: JobPostSearch . search ( query ). select ( :job_post_id )) end end





Results

> JobPost.search ( "ruby on rails" ) JobPost Load ( 1076.5ms ) SELECT "job_posts" . * FROM "job_posts" ... > JobPost.faster_search ( "ruby on rails" ) JobPost Load ( 1.2ms ) SELECT "job_posts" . * FROM "job_posts" ...





Refreshing materialied view

There is one more thing that we will have to take care of. We will have to refresh the materialized view periodically. Scenic gives us a handy method to do that.

When the refresh is running in nonconcurrent mode, the view is locked for selects. We can avoid that with the concurrent mode. The concurrent mode requires at least PostgreSQL 9.4 and view to have at least one unique index that covers all rows.

class JobPostSearch < ApplicationRecord ... # omitted lines def self . refresh_materialized_view Scenic . database . refresh_materialized_view ( :job_post_searches , concurrently: true , ) end end

We can add an index to our view as to any other table.

class AddUniqueIndexToJobPostSerches < ActiveRecord :: Migration def change add_index :job_post_searches , :job_post_id , unique: true end end

Now we can run the below method when we want to generate fresh data.

> JobPostSearch.refresh_materialized_view ( 1411.2ms ) REFRESH MATERIALIZED VIEW CONCURRENTLY "job_post_searches" ;





You can reach my via email or discuss on Twitter.



