The Fuzzy Join

The fuzzy join is actually quite simple. We need to get to a point where we can perform a join operation, given a record and it’s associations.

To do this, we will construct a new ‘fuzzy key’ field, that allows us to establish the relationships across datasets. There are plenty of Stackoverflow posts about how to perform fuzzy string examinations, but I couldn’t find much that addressed the relationship problem head on.

To help visualise this, let’s consider only two datasets, A and B. Each dataset is a dump of user data from their respective websites. Containing fields such as fullname, username, location, and other general profile attributes.

You can then construct a new field for each record called ‘fuzzyKey’. Where the fuzzy key is a string made up of each significant attribute of the dataset. For example, the string might be ‘username,location.city,location.country,attribute#1,attribute#2,attribute#3’

We now have two datasets, each with possessing records with the newly constructed ‘fuzzy key’ field.