Foggy morning, Oslo fjord

Stian Danenbarger has been telling me for a while about entity resolution (as he and many others call it), or identity resolution (as Wikipedia calls it). Basically, it's the process of working out which records/entities/objects actually represent the same real-world things by comparing their properties. Once Stian confirmed that Bayesian inferencing was a common method for this, I suddenly saw how you can actually do a poor man's version of this with just a little basic scripting.

Background

The context is that I'm extracting data from an ERP system, and I needed to see if there was significant overlap between the supplier table and the customer table. And, while I was at it, if there were duplicates within the tables. I started with in-table checking, since that was the easiest, and after two hours of Python programming I had something that spewed out reams and reams of duplicates in the data (and some false positives, admittedly). Since it was so easy to get useful results, I thought it was interesting enough to write about.

The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity. To compare two records you compare the properties pairwise and for each you conclude from the evidence in that property alone the probability that the two records represent the same real-world thing. Bayesian inference is then used to turn the set of probabilities from all the properties into a single probability for that pair of records. If the result is above a threshold you define, then you consider them duplicates.

The code

The suppliers table that I'm working with contains 41 columns, most of which are no use whatever to finding duplicates. For example, the time a record was last modified doesn't help, nor does the primary key. And so on. As far as I can see, only six columns are of any help here. So apart from the framework (and some data-cleaning functions), once I've imported the framework, this is all the Python code I need to check for duplicates:

THRESHOLD = 0.85 file = "suppliers-merged.csv" relevant = [Property("NAME", 0.5, 0.9, name_clean, name_compare), Property("ASSOCIATION_NO", 0.1, 0.88, orgnr_clean), Property("ZIP_CODE", 0.4, 0.6, basic_clean), Property("COUNTRY", 0.1, 0.51, basic_clean), Property("ADDRESS1", 0.49, 0.75, basic_clean), Property("ADDRESS2", 0.49, 0.75, basic_clean)] self_compare(file, THRESHOLD, relevant)

So, we set the probability threshold, give the file name, and define how we want to treat our six identity-discriminating properties. Let's walk through them one by one.

NAME is the name of the organization. Here we provide a function to normalize the names ( name_clean ), which strips out extra whitespace, changes "Ltd." into "Limited", lowercases the name, and so on. The first number says that if the names of two records are different, the probability that the organizations are the same is 0.5 (that is, 50/50, meaning we've learned nothing). The second says that if the names are the same the probability that they are the same is 90%. This is above our threshold of 0.85, so unless we later find evidence indicating that the organizations are different we will consider them duplicates.

ASSOCIATION_NO is the unique organization number assigned to the organization by the government. You'd think we could just do SQL directly on this column and go home, but it's not that simple. The numbers are formatted differently, not all suppliers have one, and in some cases different parts of a large organization (like the City of Oslo) share a single number. We provide a function to clean up the data, then specify our probabilities: if both numbers are provided and they are different, there's only a 10% chance the organizations are the same. However, if we get a match there's an 88% chance that they are the same.

So here you see immediately a benefit of the approach: let's say two organizations have the same name, but different organization numbers. That gives us 0.9 and 0.1 probability, which combines to 0.5. Having weighed the evidence, Bayes steps in and saves us from making what's probably a mistake. However, if two organizations have the same name and no organization number, we get 0.9 and 0.5, which combines to 0.9.

Anyway, ZIP_CODE is the next one. The zip code provides only a minor hint, but if the two suppliers have different codes the likelihood they are the same is slightly lower (0.4), and if they have different ones it is slightly higher (0.6). Let's say two records have the same name, no organization number, and different zip codes. This gives us 0.9, 0.5, and 0.4. This combines to 0.857, which is just above the threshold. If we find more counter-indicating evidence this pair is going to drop below the threshold.

COUNTRY is much the same. If the organizations are in different countries chances are very slim that they're the same one (although it could be a typing error), whereas if they are in the same country that barely signifies.

ADRESS1 and ADDRESS2 are much like the zip code, except differences here (often due simply to spelling or different abbreviations for "street" or "mailbox") don't signify much. However, if they are the same that's a fairly strong indicator.

Nobel Peace Center, Oslo

Example data

Let's look at how this works on real data. I'll pass over the cases where the organization number is the same, and look at ones were the other attributes do the job. The data have been slightly modified so that I'm not exposing customer data on the web.

Property Record 1 Record 2 Probability Name acme inc acme inc 0.9 Assoc. no 177477707 0.5 Zip code 9161 9161 0.6 Country norway norway 0.51 Address 1 mb 113 mailbox 113 0.49 Address 2 0.5

This sums out to 0.93 probability, which is well above the threshold, as indeed it should be. You'll note that my cleaning of the address is not sophisticated enough, so it doesn't catch that ADDRESS1 is in fact the same. This is easy to fix.

The beauty of this approach is that it works astoundingly well for the amount of effort involved. After just two hours I find huge numbers of duplicates with low false positive rates. The main downside is that the pairwise comparison is slow for a table of 17,000 rows, but even this can be solved. And perhaps the best part is that once the basic (trivial) framework is up, doing another table takes just a few minutes.

Note that I'm not necessarily recommending that you do this yourself. There are real tools out there, both open source and commercial, which I'm sure do a vastly better job than a Python script cooked up in two hours, but I thought this was useful for illustrating the concept is, and how much you can get done with simple methods.

More advanced stuff

Of course, simply comparing the names as strings is not very sophisticated. What if one company is called "acme inc" and another simpy "acme"? This is what the name_compare function briefly referenced in the example takes care of. Essentially, it breaks up the names into tokens, then compares the sets of tokens. At the moment it does so fairly primitively, but it's quite effective. Unfortunately, it leads to some false positives at the moment. My next task is to figure out how to improve that.

This, by the way, is one area where the ready-made products shine. They understand misspellings, different transcriptions of the same names, matching of complex strings, and so on. So once you need these things (or performance) you may be better off with a ready-made product.

Update: In the end I built a proper entity resolution (or deduplication) engine in Java, now released as open source.