Super-Donor: detecting hidden matches in a public sperm donor registry

Nathan participated in the Insight Health Data Science program Summer of 2016, and is now a data scientist at Wayfair. Nathan got his Ph.D. in mathematics focusing on dynamical systems and differential equations. During postdoctoral work at MIT and Brown he studied systems neuroscience, conducting both computational and experimental research. In this blog post, Nathan described a tool he developed to help donor conceived people find the connections they are looking for.

Even though conception with donor sperm is increasingly common, data about this population are remarkably scarce. We don’t even have an accurate count of how many children born per year in the US are conceived via donor sperm from sperm banks. In this absence of any centralized data, crowd-sourced data repositories have grown up to fill information needs for this population.

For my Insight project, I worked with data from the Donor Sibling Registry (the DSR), a web site where people who are donor conceived can locate other people conceived via the same donor. The site includes both egg and sperm donation, but for this project I considered only sperm donation, which covers the vast majority of listings on the site.

The DSR is extremely effective at what it does — in particular people who know their own donor’s bank and ID number can easily locate other people conceived via sperm with the same donor bank and ID number. But about half of the sperm donors listed on the site have only one offspring listed. In these cases, there was a user highly motivated to find a match, but instead they found no one.

Surveys indicate 25% of sperm donors report donating to more than one bank. But each bank only keeps track of their own information, and the banks do not share information on their donors in any kind of centralized location. Donors like this could generate multiple listings (one for each bank) but there is no way for a particular user to know that these listings represent the same person. For my Insight project, I used machine learning tools to locate these “hidden matches” in the DSR, distinct listings that actually represent the same donor, and built a tool to help more donor conceived people find the connections they are looking for.

Super-Donor Example

My final product was a search tool available at super-donor.com. Here you can look up donors by Sperm Bank and ID number. If we search for The Sperm Bank of California, donor number 8, the input screen looks like this:

Example input for search for matches at super-donor.com

and produces this output:

Example search output at super-donor.com

The search returns a table with some characteristics for this donor, including weight, eye color, blood type and the average birth year for listed offspring, as well as individual descriptive words. It also gives a list of other donors in the database ranked in terms of similarity, and provides a prediction as to whether they are likely to be matched listings. As I built this tool, I noticed a pattern, where sometimes other listings high on the list would have the same ID number (8 in this example, 2nd on the list) even though I of course did not include the ID number itself as a feature. It turns out the Northern California Sperm Bank changed names to The Sperm Bank of California, producing two distinct listings with the same number. The fact that my algorithm detects this is a nice indication the search really is working. To build this tool, I used a combination of language processing and metric learning tools that I’ll now examine in more depth.

Data

I worked with information that is listed publicly (and anonymously) on the DSR website by people actively seeking contact with other members of their donor group. (Note that Super-Donor links distinct anonymous donor listings to each other, but does not in any way link a listing back to identifying information for an individual sperm donor. That would be a much harder and more ethically questionable problem than the one I set out to solve.)

I did not have access to the DSR “back end” so I had to work with what I could get publicly online. Listings on the DSR look like this:

The field in the middle is the one I worked with the most: a semi-structured free text description of each donor, including physical features like eye color and weight, as well as other information like college major or religion. I scraped and organized data for 42 banks (focusing on active banks with multiple listings), resulting in data for 3321 donors, 12415 donor offspring.

A peek into eye color distribution of sperm donors. The colored bars represent the distribution of eye color for donors with the largest group size. The gray bars show the distribution of eye color for all donors in this database.

Training Data

I needed a training set of known pairs in order to develop a detection algorithm. I reached out to the DSR, and they provided me with about 160 Donor IDs linked by their known matches. After removing listings where the matching ID was unlisted or had no text description in the DSR, I had a training set of 58 distinct Donor IDs covering 27 donors.

Handling the Text Data

I set out to do this project in part because I wanted to gain experience handling messy real world text in a meaningful way. There were multiple text entries per donor (each recipient family could write their own description) and the descriptions varied a lot in length and detail — and sometimes on the facts. Two listings purporting to describe the exact same donor often would conflict on details you would think should be fixed — like birth year or blood type. I realized as I started exploring the data that any method I used would need to be robust to noise. I also needed a way to remove duplicate information.

I noticed that the description fields used ‘. ‘ as a separator, so I started by splitting text on that, and building a list of these smaller strings for each donor from all descriptions for a single donor listing. Then I remove duplicate entries in this list. After cleaning up, my text description for one of these “known pairs” looked like this:

In each of these substrings, I searched for information about “obvious” physical features like weight (highlighted in pink), eye color (highlighted in blue) and blood type, and organized them into a dataframe, with one row for each donor.

Because multiple listings often contained conflicting information (e.g. 2 different blood types), I took the approach of allowing multiple values for categorical features— for example, a donor could have both blue and brown eyes. For continuous features, like weight and birth year of offspring, I took the average to arrive at a single number for that feature.

Language Processing

These fixed features were not going to be enough to locate truly matched donors. The free text descriptions also contained highly informative words. To incorporate this information, I took a “Bag-of-Words” approach. After removing stop words and stemming (using the NLTK package), I made a vector that included 152 nouns and adjectives that appeared more than once in the training set descriptions. I chose not to use an inverse frequency approach like TFI-DF, since some words that appeared more often were actually quite informative (“curly” for example), and not necessarily less informative than more rare words, like “economics.”

I combined this vector of word indicators for each donor with the physical features into a pandas dataframe containing one row for each distinct donor ID. My task was now to figure out whether I could find rows in this dataframe that were so similar they were likely to represent the same person.

One way to make a guess about whether two listings are similar is to take the euclidean distance between two listings. The three-dimensional version of euclidean distance is simply measuring the physical distance between two points. In this case, every feature would be weighted equally. But some of the features that I had for these donors were going to be more informative than others about whether a pair of listings was a likely match. If I wanted to do better than a euclidean distance, I needed my method to somehow learn which features were most important, and which ones weren’t.

Large Margin Nearest Neighbor Metric Learning (LMNN)

The Large Margin Nearest Neighbor (LMNN) metric learning approach learns a linear transform on our high dimensional space of features by using the training data of known pairs. This method uses an error function that increases both when a non-matched listing is close by, and when a matched pair is far away. The algorithm then searches for a minimum of this error function. I relied on this excellent metric learning python package for implementing LMNN. Here is a normalized histogram plot of the resulting distances using leave-one-out cross validation on the training set:

Distance between donors in the training set produced using LMNN, pairs are plotted in purple and non-pairs are in green. Note matches are closer overall than non-matches.

I found that this approach ranked the known match first 55% of the time out of the entire database (using leave-one-out cross validation). This might not sound so great if you think of chance as 50/50, but in this case, true chance is about 0.03%. That’s not quite a fair comparison though, because how we really know this metric-learning algorithm is working is if the LMNN produces more top-ranked matches than euclidean distance (which doesn’t need any machine learning at all). Indeed, euclidean distance produces a top-ranked match only 43% of the time, so our LMNN approach generates a 12% improvement.

Testing

With only 58 listings covering 27 individuals in my training set, I really couldn’t keep any listings back for a true testing set and hope to get anywhere. The success rates above for getting a top-ranked match are from leave-one-out cross validation. This is a useful technique in situations where a training set is tiny, and can provide a decent guess of how a method will perform outside the training set, but this is far from ideal and I really needed a better way to validate the model.

Once I realized my approach was producing matches with identical donor ID numbers but different bank names (like our TSBC number 8 example), I knew I had a shot at a reasonable test set of data. Using public notes in the DSR database I identified several banks that either were known to share samples or had changed names. Searching for identical donor IDs at these banks, and then making a human check on the listed characteristics, I was able to generate a testing data set of 80 listings covering 40 distinct donors. On this test data set, super-donor produces a top ranked match 62% of the time, even better than what I got in cross validation.

Predicting pairs

While I was able to verify that my algorithm does a good job ranking matched pairs first, there is a second problem to solve in this project: predicting whether the top ranked donor is actually a match, which isn’t actually going to be true most of the time. During the 3 weeks I had to build this project at Insight, I didn’t get this part of the problem solved as well as I wished. I used a rule-based approach, predicting that donors with distances below a threshold were likely pairs, provided the descriptions of both donors were long enough (very short descriptions produce shorter distances). If I’d had more time, I would have approached this second problem using anomaly detection.

Given that I wanted to make the product as useful as possible, I took the approach of providing a full list of the top 10 most similar donors based on my metric-learning distance, not just those predicted to be matches. Sometimes even if a match was not listed first, it was still listed in the top 5 or top 10. Individual families or donor offspring can then use this listing information to refer back to the DSR, access longer descriptions, and make their own judgment as to whether they wish follow up with any of the other donor groups to determine if this might be a true match. Fortunately, with follow up genetic testing it is possible to get a definite Yes/No answer.

Final Words and Resources

This project was my first experience with Natural Language Processing, Metric Learning, interacting with SQL and building any kind of web application. Working with my “fellow Fellows” under the guidance of the Insight mentors and program directors was a wonderful way to learn and grow. You can find my code for the web interface (using flask and bootstrap) and for generating the model on github.

I relied heavily on the metric-learn, pandas, NLTK, numpy, matplotlib, seaborn, scikit-learn python packages. I also found the books Python Machine Learning by Sebastian Raschka and Python for Data Analysis by Wes McKinney very helpful.