Okay, I know in the last article I said we would use this one to start going into adding address elements and match groups (and I promise they’re still on their way), but I wanted to take a slight detour and add another set of functions before we get there.

Let’s revisit the Levenshtein Distance function. Earlier we used it to calculate the edit distance between two strings. It is designed to determine the number of changes that it takes for one string to become another, helping us figure out if two strings were actually supposed to be the same. We saw how we can use it with words, but what happens when we start to use these on increasingly complex strings?

Before we go into some examples, let’s first establish how to calculate the similarity ratio to more easily understand these comparisons. It’s a simple ratio of the common characters over the total characters.

((|a| + |b|) — ldist(a, b)) / (|a| + |b|)

Let’s create a function to calculate this since we will be using this pretty frequently.

CREATE OR REPLACE FUNCTION

dq.dq_fm_ldist_ratio(a STRING, b STRING) AS (

/*

* Data Quality Function - Fuzzy Matching

* dq_fm_ldist_ratio

* input: Two strings to compare.

* returns: The Levenshtein similarity ratio.

*/

(LENGTH(a) + LENGTH(b) - `dq.dq_fm_LevenshteinDistance`(a, b))

/ (LENGTH(a) + LENGTH(b))

);

To see how these functions behave with increasingly complex strings, we’re going to need a dataset to test this out with. We are going to use a dataset of room descriptions from Expedia and Booking.com. We’ll load this into a new dataset (I’ll be using fuzzy_matching ) and take a quick look.

So what are we supposed to call these, then?

These string pairs represent the same room, but you can see that they might be called different things. When we dig deeper through the data, we see some more complex combinations.

a != b

So let’s try using the tools that we have and apply them to this dataset to see how they fare (spoiler alert: it won’t be accurate).

Well, it figured one of them out at least.

Now before we go any further, it’s important to note that the Levenshtein Distance can be implemented in two different ways depending on how you weigh things. Some implementations treat substitutions as a single operation, with equal weighting as replacements and additions. Other implementations give substitutions a weight of two (one subtraction + one addition). When comparing any implementations it’s important to note how it’s weighed, or you may end up with differences in calculations you can’t quite explain. In ours, we gave substitutions as scoring weight of one.

Back to the results. As expected, the edit distance is very high since we are treating the entire string as a single token. Because of that, the match ratios are going to be quite poor. What we need to do is tokenize these strings, and perform smarter comparisons with them.

Enter FuzzyWuzzy. The string matching library, not the bear who had no hair.

FuzzyWuzzy is a Python library that takes strings, tokenizes them, and then matches them in different ways. It’s a very powerful package to utilize, but Python isn’t usable by BigQuery. So let’s see if we can’t replicate some of this functionality in SQL!

Let’s start off with the token sort ratio. This is handy when you have string tokens that are fairly similar but are out of order. What the token sort ratio does is break up the strings into individual tokens, sort them in alphabetical order, concatenate them as a single string, and then calculate the ratio based upon that. To accomplish this functionality, we are first going to create a couple helper functions that splits, cleans, and then sorts the array.

CREATE OR REPLACE FUNCTION

dq.dq_hf_gh_clean_tokenize(a STRING) AS(

/*

* (Helper) Data Quality Function

* dq_hf_gh_clean_tokenize

* This function removes all non-alphanumeric characters.

* input: Uncleaned string

* returns: String of tokenized and cleaned string.

*/

ARRAY(

SELECT

tokens

FROM

UNNEST(

SPLIT(

REGEXP_REPLACE(

UPPER(

TRIM(a)

), '[^a-zA-Z0-9 ]+', ''

), ' '

)

) tokens

)

);

Add another function to sort it.

CREATE OR REPLACE FUNCTION

dq.dq_hf_gh_clean_sort_tokenize(a STRING) AS(

/*

* (Helper) Data Quality Function

* dq_hf_gh_clean_sort_tokenize

* input: Uncleaned string

* returns: String of tokenized, sorted, and cleaned string.

*/

ARRAY(

SELECT

x

FROM

UNNEST(`dq.dq_hf_gh_clean_tokenize`(a))

AS x

ORDER BY x)) ;

From here we can re-assemble the arrays, and do our comparison.

CREATE OR REPLACE FUNCTION

dq.dq_fm_ldist_token_sort_ratio(a STRING, b STRING) AS(

/*

* Data Quality Function - Fuzzy Matching

* dq_fm_ldist_token_sort_ratio

* input: Two strings to compare.

* returns: The Levenshtein similarity ratio with sorted tokens.

*/

`dq.dq_fm_ldist_ratio`(

ARRAY_TO_STRING(`dq.dq_hf_gh_clean_sort_tokenize`(a),''),

ARRAY_TO_STRING(`dq.dq_hf_gh_clean_sort_tokenize`(b), '')));

With these two in place, let’s go ahead and add this to the dataset to see how it behaves. We are looking for entries with a high edit distance (indicating that there is a high number of raw replacements as everything is out of order) and a high sort ratio, which means the individual tokens are fairly similar.

We’re starting to get some better matches!

These results show you that you can use this to match strings with out of order tokens, while allowing for some level of tolerance for character change.

What happens when the strings have some commonality between them, but both the token length and the actual content of the rest of the strings widely vary? This is where token sets come in! The folks at SeatGeek, who developed and open sourced the FuzzyWuzzy Python package, have put up a really good description of it on their blog. I recommend reading their blog post to get a greater understanding of how this all works, but here is the gist of it.

1 — For strings a and b , tokenize both strings.

2 — Create three strings:

t0 = sorted intersection of tokens from a and b

t1 = t0 + sorted remainder of tokens from a

t2 = t0 + sorted remainder of tokens from b

3 — Return the highest ratio from ldist(t0,t1) , ldist(t0,t2) and ldist(t1,t2)

Why does this work? As the folks at SeatGeek explain it, “the intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar.” Let’s try this with one pair of room descriptions, and see how this works.

a = "Deluxe Room, 1 Queen Bed (High Floor)"

b = "Deluxe Queen Room — High Floor With Free Wi-Fi"

Cleaning, sorting, and tokenizing them gives us the following:

token_a =

['1','BED','DELUXE','FLOOR','HIGH','QUEEN','ROOM']

token_B =

['DELUXE,'FI','FLOOR','FREE','HIGH','QUEEN','ROOM','WI','WITH']

The tokens that intersect are, ‘DELUXE’, ‘FLOOR’, ‘HIGH’, ‘QUEEN’, and ‘ROOM’. This gives us the following three strings and comparisons:

t0 = 'DELUXEFLOORHIGHQUEENROOM'

t1 = 'DELUXEFLOORHIGHQUEENROOM1BED'

t2 = 'DELUXEFLOORHIGHQUEENROOMFIFREEWIWITH' `dq.dq_fm_ldist_ratio`(a, b) = 0.64

`dq.dq_fm_ldist_ratio`(t0, t1) = 0.92

`dq.dq_fm_ldist_ratio`(t0, t2) = 0.8

`dq.dq_fm_ldist_ratio`(t1, t2) = 0.83

So this giving us a confidence score of 0.92, which is much higher than the original 0.64. This is possible because the intersection consists of the majority of one the original strings, so naturally, the comparison is going to be high. If the intersection is small, but the remainder of the tokens are still similar, that still allows us to have a high confidence score by comparing t1 and t2 . We are covered from multiple angles!

Let’s now figure out how to put this into BigQuery functions. We already have a function from earlier that cleans, tokenizes, and sorts the strings. We now need to construct something that finds the sorted intersection.

CREATE OR REPLACE FUNCTION

dq.dq_hf_gh_find_array_intersection(a ARRAY<STRING>, b ARRAY<STRING>) AS(

/*

* (Helper) Data Quality Function

* dq_hf_gh_find_array_intersection

* input: Two arrays to compare

* returns: Array with the common elements

*/

ARRAY(

SELECT

*

FROM

UNNEST(a) AS a

INTERSECT DISTINCT

SELECT

*

FROM

UNNEST(b) AS b )) ;

We’re also going to need something that calculates the difference between the two sets of tokens.

CREATE OR REPLACE FUNCTION

dq.dq_hf_gh_find_array_difference(a ARRAY<STRING>, b ARRAY<STRING>) AS(

/*

* (Helper) Data Quality Function

* dq_hf_gh_find_array_difference

* input: Two arrays to compare

* returns: Array with elements a - b.

*/

ARRAY(

SELECT

*

FROM

UNNEST(a) AS a

EXCEPT DISTINCT

SELECT

*

FROM

UNNEST(b) AS b )) ;

And now we just need to create the strings and find the maximum value of the three comparisons.

CREATE OR REPLACE FUNCTION

dq.dq_fm_ldist_token_set_ratio(a STRING,

b STRING) AS(

/*

* Data Quality Function - Fuzzy Matching

* dq_fm_ldist_token_set_ratio

* input: Two strings to compare.

* returns: The Levenshtein similarity of the maximum ratio

* between the different token sets.

*/

ARRAY(

SELECT

MAX(x)

FROM

UNNEST( [

# First ratio is sorted intersection and combined A diff B

`dq.dq_fm_ldist_ratio`(

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')

,

CONCAT(

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')

,

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')))

,

# Second ratio is sorted intersection and combined B diff A

`dq.dq_fm_ldist_ratio`(

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')

,

CONCAT(

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')

,

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(

`dq.dq_hf_gh_clean_sort_tokenize`(b),

`dq.dq_hf_gh_clean_sort_tokenize`(a)),'')))

,

# Third ratio is A diff B and B diff A

`dq.dq_fm_ldist_ratio`(

CONCAT(

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')

,

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),''))

,

CONCAT(

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_intersection`(

`dq.dq_hf_gh_clean_sort_tokenize`(a),

`dq.dq_hf_gh_clean_sort_tokenize`(b)),'')

,

ARRAY_TO_STRING(`dq.dq_hf_gh_find_array_difference`(

`dq.dq_hf_gh_clean_sort_tokenize`(b),

`dq.dq_hf_gh_clean_sort_tokenize`(a)),'')))

] ) AS x)[OFFSET(0)]

);