I made an engine called Duke that can automatically match records to see if they represent the same thing. For more background, see a previous post about it. The biggest problem people seem to have with using it is coming up with a sensible configuration. I stumbled across a paper that described using so-called genetic programming to configure a record linkage engine, and decided to basically steal the idea.

Generalities

To use it you need to make a Duke configuration that sets up a data source, cleans the data, and maps it into Duke records. (If this is not clear, see an example.) This part is easy, so no problem there. You also need a data set, and a list of the correct linkages within that data set. The last bit is harder to come by, but it's necessary, since otherwise Duke cannot evaluate its own performance. Given all this, the genetic algorithm tries to come up with the best possible configuration for doing the linkage.

Sea haze, Molde, Norway

Essentially, this is an optimization problem, or as Michael Sperberg-McQueen once memorably described it: the hill-climbing problem. Imagine the set of all configurations as a plane (it has more dimensions, but ignore that for now), and the quality of each as the height. You get something rather like a landscape, with hills (the tops are the best configurations), valleys etc. Now, our program is a bit like a blind ant, trying to find the highest spot in the landcape.

It can only feel which way is up and which way is down, but has no idea what heights or depths might lie further away. So a naive algorithm will lead up onto the top of the nearest hill, and stop there. We'll have no idea whether higher hills exist elsewhere. To get around that, genetic programming produces many ants, then moves them randomly around, but kills off the worst-positioned ones, and reproduces more of the best-positioned ones. The hope is that one ant will find the highest hill, even if random modifications may push it downhill a bit first.

How it works

Implementing it was actually dead simple (here is the code). First we generate 100 random configurations. These are all totally random, and some get rejected by Duke as nonsensical, others produce no matches at all, and so on. However, we ignore that, and blithely run through all 100, computing precision and recall for each configuration, then combining that into the F-number. The F-number is our measure of the quality of each configuration.

For our purposes, a configuration consists of a flat list of properties, like this:

The probability threshold above which two records are considered to match.

The comparator used for property #1. This gets set from a list of known comparators.

The probability that the records match if property #1 does not match (low probability). This is set to a number between 0.0 and 0.5.

The probability that the records match if property #1 does match (high probability). This is set to a number between 0.5 and 1.0.

The comparator for property #2. And so on.

After running through 100 configurations we sort them by their scores. We then throw away the 25 worst-scoring configurations. Then we duplicate the 25 best-scoring ones. (That was the original approach; now I throw away 30, and make extra copies of the top three configurations.) These two steps crudely mirror natural selection in evolution: the fittest reproduce the most, and the least fit don't reproduce at all. Thus, the best configurations should come to predominate the population.

After this, the next step is to randomly modify every configuration in the population. There's no sense in keeping the configurations we've already tried because we've already evaluated those and evaluating them again will teach us nothing. There are two ways we can modify them: one is to randomly modify 1-3 properties of the configuration. The other is to "mate" two configurations, by going through each property and randomly choosing the value from either the starting configuration or another configuration in the population.

Having produced 100 new configurations in this way we repeat. Evaluate all 100, then ditch the lower 25, double the best 25, then modify. Then evaluate again, and so on. As the generations go on, the offspring of the best configurations proliferate (first 2, then 4, then 8, then 16, etc), and modifications of these move around in the search space. Some are broken and get thrown away, which is fine.

Lake, near Ebeltoft, Denmark

Trying it out

But does it actually work? I tried making a configuration for linking countries in Mondial and DBpedia. That turned out to be too easy. It came up with a configuration scoring 0.983, better than the configuration I'd made manually. The reason I say it was too easy is that it did this in the first generation, completely by random. So that doesn't validate the approach at all.

What's interesting is that it came up with a configuration that would never have occurred to me to try. The threshold is ridiculously low (0.21), capital and country names are compared exactly, and area is compared numerically (as it should be), but with strong probability (0.85). So maybe there is something to be learned from this, after all.

Next I tried it on the semantic dog food corpus. Unfortunately, something about these data cause some configurations to run very, very slow. That's a bit of a problem when you have to go through 10,000 configurations. So although I started this a couple of weeks ago, that process is still running (it's been suspended a few times, while I try these other data sets), and has only gotten to the 38th generation. It managed to score 0.54 by random, and has climbed to 0.82 by now. I've managed 0.88 manually, so that's pretty good, but we're not finished yet.

It may seem strange that this approach should be so slow, but consider that doing 100 configurations 100 times means running Duke 10,000 times. So if one run takes 8 seconds, that means the full genetic programming run will take 24 hours.

So I tried another approach. I took some real data from a customer: 50,000 records representing suppliers and customers in their ERP system. I made a subset with 3253 records picked at random, then created a manually vetted list of duplicates for the subset. This list I deliberately made difficult, by treating different departments of big organizations (separate parts of the City of Oslo administration, for example) as different, even though they might have the same organization ID.

Configuring this particularly tricky case manually I have difficulty getting over 0.65 manually. The genetic programming scored 0.766 on the first try, after running all evening and all night. Let's look at that high score came about. The best scores in the first (completely random) generation were 0.47, 0.43, and 0.3. In the second generation the top scorer (the 0.47 config) was modified in four different ways, which scored 0.136, 0.467, 0.002, and 0.49. As you can see, most modifications were harmful, but one managed to improve it a little bit.

In the second generation the top scores were 0.49, 0.467, 0.4, and 0.38, so obviously things improved quite a bit this time around. And the highest score was a modification of the best random configuration. Then, in the third generation, our 0.49 scores 0.001, 0.49, 0.46, and 0.25. So again it mostly gets messed up, but in one case at least we inflict no harm. The best scores this time around are 0.49, 0.46, 0.45, and 0.42. So once again there is improvement, and once again the descendants of that initial 0.47 dominate (top two places), but this time we don't hit a new best score.

In the fourth generation the same thing happens to the 0.49 again: one version scores 0.49, the other three score less well. But then, all of a sudden number 20, derived from a configuration that scored 0.21 gets 0.525. It turns out to descend from a configuration that originally scored 0.43 (second best in the first generation), then scored around 0.22 a few times, and now came out on top. This time around the best scores are 0.52, 0.49, 0.488, and 0.488. So again there is improvement.

In the fifth generation we hit a new top score: 0.568, derived from 0.479. This is a completely different lineage, deriving from a configuration that originally scored 0.0006 (almost surprising it survived at all). We hit 0.602 in the sixth generation, 0.702 in the seventh.

It carries on like this for a while, increasing the number of high-scoring configurations, but not setting any records. It's not until the 14th generation that we again see a score above 0.7, but it's a record (0.716). This little improvement promptly gets destroyed, so in the 15th generation nobody goes above 0.7, but there's a number of configurations just below. The next generation we see one above 0.7, but no new records. Then we have three above 0.7, but still no new scores. It carries on like this, until in the 19th generation we score 0.722, 0.725 and then 0.745 in the 21st generation.

After that progress is slow. We hit 0.747 in the 40th generation, then 0.756 in the 42nd, 0.758 in the 52nd, 0.759 in the 54th, 0.761 in the 55th, and finally 0.765 in the 60th generation. This final winner started out at 0.007, and has gradually improved over the generations until it hit the jackpot. By the time we finish the 100th generation the top 7 scorers all hit the same score, but none of the variants we've tried do any better.

The winning configuration is also interesting:

[GeneticConfiguration 0.98 [Property NAME Levenshtein 0.17 0.95] [Property ASSOCIATION_NO ExactComparator 0.06 0.69]] [Property ADDRESS1 NumericComparator 0.02 0.92] [Property ADDRESS2 PersonNameComparator 0.18 0.76] [Property ZIP_CODE DiceCoefficientComparator 0.47 0.79] [Property COUNTRY Levenshtein 0.12 0.64]

The NAME and ASSOCIATION_NO properties make perfect sense, but note the low probability with association numbers: 0.69. This is because many of the customers have the same association number, but still need to be treated as different. Note also that if the association numbers are different, then we treat the companies as different (0.06). The comparators used for ZIP_CODE and COUNTRY may seem strange, but give the same results as if exact comparison had been used, so this is just an artifact of the genetic programming.

Perhaps the most startling thing is the treatment of ADDRESS1: it uses the numeric comparator, effectively ignoring the property. It's as if it's trying to tell us that this property is actively misleading, which is interesting. I'm not sure why this happens. I am sure, however, that it would never have occurred to me to try it myself. If I turn off this property in my own, manual, configuration it reduces my F-measure quite drastically, so I'm not sure what conclusions to draw.

To see if I could improve on this first result I ran the genetic programming again, this time with a population size of 200, and 200 generations. It took a long time, but in the 81st generation it scored 0.776. This configuration looks as follows:

[GeneticConfiguration 0.95 [Property NAME Levenshtein 0.42 0.96 [Property ASSOCIATION_NO DiceCoefficientComparator 0.0 0.67 [Property ADDRESS1 NumericComparator 0.1 0.61 [Property ADDRESS2 Levenshtein 0.03 0.8 [Property ZIP_CODE DiceCoefficientComparator 0.35 0.69 [Property COUNTRY_DB JaroWinklerTokenized 0.44 0.68

What's remarkable is how similar it is to the first one. The biggest difference is the handling of ADDRESS2, which uses a different comparator and has substantially different probabilities. Even the choice of NumericComparator for ADDRESS1 is preserved, as a further hint that perhaps this property is not useful.

But one thing is how these configurations score on the training set. Quite another is how they score on the rest of the data. I made a different subset of the data, with the same size, and ran the two configurations on it. I also ran my manual best configuration on both datasets. The results are summarized in the table below.

Configuration Training Test Genetic #1 0.766 0.881 Genetic #2 0.776 0.859 Manual #1 0.57 0.838 Manual #2 0.64 0.803

The result is pretty clear: the genetic configurations are much the best. The computer can configure Duke better than I can. That's almost shocking, but there you are. I guess I need to turn the script into an official feature.