Introducing Vladimir Levenshtein

Vladimir Levenshtein was a Russian scientist who specialized in, among many thing, information theory. This is a form of distance metric called edit distance, which at a high level is about comparing the number of changes required to turn one string into another. The Levenshtein Distance was created in 1965 and is a measure of how many insertions, substitutions, and deletions there are from one string to another. So here’s a few examples of this in effect:

“Whisky” — “Whiskey”: Levenshtein Distance, 1. An ‘e’ is added.

“Drinjs” — “Drinks”: Levenshtein Distance: 1. The ‘k’ is substituted to a ‘j’.

“Chandler Bing” — “Miss Chanandler Bong”: Levenshtein Distance, 8. Add “Miss ” (5), add ‘an’ (2), substitute ‘i’ for ‘o’ (1).

Keep in mind that this is the base algorithm that gives equal weight to inserts, substitutions, and deletions. There are variants on this that give different weights to different operations, which you may want to do depending on your use case (for example, in OCR use cases you may want to give a lower weight to ‘l’/‘1’ and ‘O’/‘0’ substitutions). If you are curious on how the algorithm works, there is an excellent Medium post that walks you through the matrix operations to get the right distance. It’s definitely worth a read! When running two strings through the algorithm, the output of the calculation is a number which gives the net number of edits.

So how do we get this working? Another UDF! In the Soundex function we were able to string together a series of SQL statements, but now we’re looking for loops so we will have to use some JavaScript. If you’re not into reading the thorough explanation linked before, the folks at the University of Pittsburgh boiled it down to these steps:

1 — Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit.

If m = 0, return n and exit.

Construct a matrix containing 0..m rows and 0..n columns. 2 — Initialize the first row to 0..n.

Initialize the first column to 0..m. 3 — Examine each character of s (i from 1 to n). 4 — Examine each character of t (j from 1 to m). 5 — If s[i] equals t[j], the cost is 0. If s[i] doesn’t equal t[j], the cost is 1. 6 — Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1.

b. The cell immediately to the left plus 1: d[i,j-1] + 1.

c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. 7 — After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

The previous link that explains how the algorithm works also has step by step matrix illustrations to show how it iterates through data, and there’s also code available online that has this algorithm implemented in JavaScript. As it’s distributed under the MIT OSI license, let’s put it into a BigQuery UDF!



dq.dq_fm_LevenshteinDistance(in_a string,

in_b string)

RETURNS INT64

LANGUAGE js AS

"""/*

* Data Quality Function - Fuzzy Matching

* dq_fm_LevenshteinDistance

* Based off of

* input: Two strings to compare the edit distance of.

* returns: Integer of the edit distance.

*/

var a = in_a.toLowerCase();

var b = in_b.toLowerCase();



if(a.length == 0) return b.length;

if(b.length == 0) return a.length;

var matrix = []; CREATE OR REPLACE FUNCTIONdq.dq_fm_LevenshteinDistance(in_a string,in_b string)RETURNS INT64LANGUAGE js AS"""/** Data Quality Function - Fuzzy Matching* dq_fm_LevenshteinDistance* Based off of https://gist.github.com/andrei-m/982927 * input: Two strings to compare the edit distance of.* returns: Integer of the edit distance.*/var a = in_a.toLowerCase();var b = in_b.toLowerCase();if(a.length == 0) return b.length;if(b.length == 0) return a.length;var matrix = []; // increment along the first column of each row

var i;

for(i = 0; i <= b.length; i++){

matrix[i] = [i];

} // increment each column in the first row

var j;

for(j = 0; j <= a.length; j++){

matrix[0][j] = j;

} // Fill in the rest of the matrix

for(i = 1; i <= b.length; i++){

for(j = 1; j <= a.length; j++){

if(b.charAt(i-1) == a.charAt(j-1)){

matrix[i][j] = matrix[i-1][j-1];

} else {

matrix[i][j] =

Math.min(matrix[i-1][j-1] + 1, // substitution

Math.min(matrix[i][j-1] + 1, // insertion

matrix[i-1][j] + 1)); // deletion

}

}

} return matrix[b.length][a.length];

"""

Let’s give this a whirl!

WITH

data AS (

SELECT

'Whiskey' a,

'whisky' b

UNION ALL

SELECT

'drinjs' a,

'Drinks' b

UNION ALL

SELECT

'Chandler Bing' a,

'Miss Chanandler Bong' b)

SELECT

a,

b,

`dq.dq_fm_LevenshteinDistance`(a, b) as ldistance

FROM

data

It looks like it works!

So let’s do a few things to see how these two work together.