Introduction

I can’t remember how many times I’ve tried to search for something without remembering the exact search string that was needed, like, for example, a specific movie by Son Pen…Saun P…Sean Penne…(sigh)… Fortunately, web search engines do the dirty job for us, suggesting a number of options that look like our query text: it is called Approximate String Matching, but you may have heard it as Fuzzy String Searching.

Modern Databases support fuzzy searching. In this series of articles, we will dive into PostgreSQL’s fuzzy search functionalities, along with some general information about each technique.

Set up a test database

Before we start playing with search techniques, let’s create a simple database. As postgres execute

CREATE DATABASE fuzzy_test; GRANT ALL PRIVILEGES ON DATABASE fuzzy_test TO nikos;

(don’t forget to substitute “nikos” with your user name)

Now let’s create a table

CREATE TABLE people_quotes ( quote_id SERIAL PRIMARY KEY, first_name VARCHAR(15) NOT NULL, last_name VARCHAR(20) NOT NULL, catch_phrase TEXT )

and add some data in it

INSERT INTO people_quotes (first_name, last_name, catch_phrase) VALUES ('Martin', 'Golding', 'Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.'), ('Linus', 'Torvalds', 'Talk is cheap. Show me the code.'), ('Peter', 'Deutsch', 'To iterate is human, to recurse divine.'), ('Seymour', 'Cray', 'The trouble with programmers is that you can never tell what a programmer is doing until it’s too late.'), ('Alan', 'Kay', 'I invented the term `Object-Oriented`, and I can tell you I did not have C++ in mind.'), ('Larry', 'Wall', 'Most of you are familiar with the virtues of a programmer. There are three, of course: laziness, impatience, and hubris.'), ('Georges', 'Carrette', 'First learn computer science and all the theory. Next develop a programming style. Then forget all that and just hack.'), ('Donald', 'Knuth', 'People think that computer science is the art of geniuses but the actual reality is the opposite, just many people doing things that build on each other, like a wall of mini stones."'), ('Brian', 'Kernighan', 'Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'), ('Bill', 'Gates', 'Measuring programming progress by lines of code is like measuring aircraft building progress by weight.'), ('Jim', 'McCarthy', 'You can’t have great software without a great team, and most software teams behave like dysfunctional families.'), ('Yogi', 'Berra', 'In theory, theory and practice are the same. In practice, they’re not.'), ('Dennis', 'Ritchie', 'C is quirky, flawed, and an enormous success.'), ('Thomas', 'Gale', 'Good design adds value faster than it adds cost.'), ('Edsger', 'Dijkstra', 'The use of COBOL cripples the mind; its teaching should therefore be regarded as a criminal offence.')

(taken from here)

LIKE, ILIKE, SIMILAR TO, and Regular Expressions

These four search methods are the simplest way to find what you’re looking for when you don’t know the exact query string. The “I” in ILIKE is for “case Insensitive”

Their syntax is similar and quite simple

column LIKE/ILIKE/SIMILAR TO pattern

where pattern could be a string/substring of the value we are looking for, together with special matching characters

underscore _ : any single character

percent sign % : any sequence of zero or more characters

For example,

SELECT first_name FROM people_quotes WHERE last_name LIKE 'Dijkstra'; first_name ------------ Edsger (1 row)

has the same results with the variations

LIKE 'Dij%' LIKE '_ij%' LIKE '%kstra' LIKE '%kst%' ILIKE 'dij%'

With SIMILAR TO you can add flexibility to your query by using more special characters in the pattern

vertical line | : alternation (either of two alternatives)

asterisk * : repetition of the previous item zero or more times

plus sign + : repetition of the previous item one or more times

question mark ? : repetition of the previous item zero or one time

curly brackets {m} : repetition of the previous item exactly m times {m,} : repetition of the previous item m or more times {m,n} : repetition of the previous item at least m and not more than n times

parentheses () : group items into a single logical item

brackets [] : specifies a character class, just as in POSIX regular expressions

For example, if instead of LIKE we use SIMILAR TO, the following cases will have the same result as before

SIMILAR TO 'Dijkstra' SIMILAR TO '(Dikj|Dijk)%'

Regular Expressions

I left regular expressions last, because they can cover a whole book and one can find innumerable sourced in the web. Here, instead, I’ll mention only the general syntax

tilde ~ : case sensitive regular expression matching

tilde and asterisk ~* : case insensitive regular expression matching

Putting an exclamation mark before these operators, will lead to their negative result.

Using ~ and ~* instead of LIKE and ILIKE, the following cases return the same result as before

~ 'Dij' ~ 'kst' ~* 'dij'

Inside the string that specifies the search pattern, you can use all the usual regular expressions (check the official docs for more details)

Levenshtein Distance

The Levenshtein Distance is the minimum number of single-character changes (addition, deletion, substitution) that can transform one word to another. For example, starting from the word “kitten” I can get the word “attempt” by performing 5 single-character changes:

Delete “k” Delete “n” Substitute “i” by “a” Add “m” at the end Add “t” at the end

or I can get the word “bitten” by making just one substitution. So, the Levenshtein Distances (LD) for these two cases are

LD(“kitten”, “attempt”) = 5

LD(“kitten”, “bitten”) = 1

(You can experiment with this online LD calculator)

Let’s see how to use this in PostgreSQL.

Before we start, we need to import the fuzzystrmatch module (which, apart from the Levenshtein distance function, contains other useful tools that will explain later). So as postgres user do

CREATE EXTENSION fuzzystrmatch;

First, let’s see how the function works, by trying the example we mentioned above

SELECT * FROM levensthein('kitten', 'attempt') levenshtein ------------- 5 (1 row) SELECT * FROM levensthein('kitten', 'bitten') levenshtein ------------- 1 (1 row)

Now, imagine I want to find the person who said something like “Talk is cheap, show me the code!”. To be more effective, I’ll use lower case for my search quote and for the `quote` column in the table and I’ll search for entries with

LD(actual_quote, search_quote) <= 5

SELECT first_name, last_name FROM people_quotes WHERE levenshtein(lower(quote), lower('Talk is cheap, show me the code!') first_name | last_name -----------+----------- Linus | Torvalds (1 row)

Notice that it found what we were looking for, because the Levenshtein distance between the stored quote (“Talk is cheap. Show me the code.”) and the search quote (“Talk is cheap, show me the code!”) is equal to 3. If I allow the LD limit to be greater than 5, the results may include entries that are not related to the quote (don’t forget that you can get any word/phrase by increasing the number of the one-character changes)

Right, what was this guy’s name… he was a professional baseball player and he said something that applies to programming as well… “Yogi Bear” or something… Let me check

SELECT first_name, last_name FROM people_quotes WHERE levenshtein(lower(format('%s %s', first_name, last_name)), lower('Yogi Bear')<=5; first_name | last_name | quote ------------+-----------+---------------------- Yogi | Berra | In theory, theory and | | practice are the same. | | In practice, they’re | | not. (1 row)

Ah, I almost had it…

What’s next

In the next article, we’ll get to know some more useful functions of the fuzzystrmatch module.