“That’s Not How My Name is Spelled”

Getting a name right is important when making decisions around people. This holds true when dealing with all aspects of how a company engages with a person. Use cases in human resources, targeted marketing, and customer engagement are examples where getting the name right matters.

Just think back to the last time a barista wrote your name on a cup.

When it gets into the context of enterprise systems, the need to match similar names is even more important. When a single transaction happens in isolation, there might be a bit of tolerance for error. When it says “Bryan” instead of “Brian” on my cup, I can let it go (usually). If I were to appear in a CRM system in multiple permutations, this becomes a bigger issue. A good example is with bank acquisitions.

I was once a customer of a regional retail bank with a checking and savings account. It got bought by a larger regional bank. Which then got bought by a major bank holding company that also made other large acquisitions in parallel, some of which I was also a customer. The result is that I would be directly marketed for savings accounts, checking accounts, and credit cards for which I am already a customer through another acquisition, and with variations and typos in my name. The lack of a unified identifier for a single entity is a problem known as Master Data Management (MDM), and has been around for a while.

Vendors have been trying to develop applications to solve this for quite some time. MDM and identity resolution systems are nothing new. One of the more common architectures leverages middleware to perform fuzzy matching and consolidation in bulk, and writes the consolidated data to a target repository. While the algorithms have been time tested, one of the big issues is always around performance. You tend to be limited by the infrastructure that your middleware application is running on.

Having been in the enterprise information management space for a while, and also having worked at vendors that sold data integration (DI)/quality (DQ) and MDM tools, this is a topic that I’m always interested in.

I have also come to appreciate the ability to push down transformations into the database layer. With Google BigQuery being a fully managed petabyte-scale data platform, the question hit me whether we can start to perform these tasks while being able to leverage all the cool things BigQuery brings for scale?

So, why not give it a shot.

What I don’t intend to do is architect out an entire MDM system (for now, at least). As part of an overall DQ and/or MDM strategy, you’ll want to have a tool kit of cleansing and matching strategies to put together a solution for your particular problem, which is what I am hoping to help put together. I’m always interested in hearing if something works or not, and any feedback on how to make it better. Also, I don’t know how many of these I’ll be able to get to, as there’s a ton of algorithms out there solving for different things. Life is full of small little adventures, let’s see where it goes.

As a general rule, I’m always going to try and stick as close as I can to BigQuery Standard SQL. I’ve always believed that sticking with native dialects and data structures is the fastest since it doesn’t require additional engines and allows for BigQuery to optimally do its thing, and also the less languages one needs to know, the easier it is to adopt.

So let’s try one out.