We will find you: DNA search used to nab Golden State Killer can home in on about 60% of white Americans

If you’re white, live in the United States, and a distant relative has uploaded their DNA to a public ancestry database, there’s a good chance an internet sleuth can identify you from a DNA sample you left somewhere. That’s the conclusion of a new study, which finds that by combining an anonymous DNA sample with some basic information such as someone’s rough age, researchers could narrow that person’s identity to fewer than 20 people by starting with a DNA database of 1.3 million individuals.

Such a search could potentially allow the identification of about 60% of white Americans from a DNA sample—even if they have never provided their own DNA to an ancestry database. “In a few years, it’s really going to be everyone,” says study leader Yaniv Erlich, a computational geneticist at Columbia University.

The study was sparked by the April arrest of the alleged “Golden State Killer,” a California man accused of a series of decades-old rapes and murders. To find him—and more than a dozen other criminal suspects since then—law enforcement agencies first test a crime scene DNA sample, which could be old blood, hair, or semen, for hundreds of thousands of DNA markers—signposts along the genome that vary among people, but whose identity in many cases are shared with blood relatives. They then upload the DNA data to GEDmatch, a free online database where anyone can share their data from consumer DNA testing companies such as 23andMe and Ancestry.com to search for relatives who have submitted their DNA. Searching GEDMatch’s nearly 1 million profiles revealed several relatives who were the equivalent to third cousins to the crime scene DNA linked to the Golden State Killer. Other information such as genealogical records, approximate age, and crime locations then allowed the sleuths to home in on a single person.

Geneticists quickly speculated this approach could identify many people from an unknown DNA sequence. But to quantify just how many, Erlich and colleagues took a closer look at the MyHeritage database, which contains 1.28 million DNA profiles of individuals looking at their family history. (Erlich is chief science officer of the ancestry DNA testing company.) If you live in the United States and are of European ancestry, there’s a 60% chance you have a third cousin or closer relative in this database, the team projected. Their success rate was similar when they did searches for 30 random profiles in GEDmatch. (The odds drop to 40% for someone of sub-Saharan African ancestry in the MyHeritage database.)

Assuming you have a relative in one of these databases, what are the chances police could find you from an unidentified DNA sample, the way they nabbed the alleged Golden State Killer? To find out, Erlich and colleagues combined the MyHeritage database information with family trees, and demographic data such as rough age and likely geographic location. On average, that allowed them to use a hypothetical DNA sequence to home in on 17 “suspects” from a pool of about 850 people, the team reports today in Science .

GEDmatch likely only encompasses about 0.5% of the U.S. adult population, but millions of Americans are using DNA ancestry testing services. Once the GEDmatch figure rises to 2%, more than 90% of people of European descent will have a third cousin or closer relative and could be found in this way. “It’s surprising how small the database needs to be,” says population geneticist Noah Rosenberg of Stanford University in Palo Alto, California, who was not involved with the work.

Rosenberg and colleagues showed last year that a profile in a consumer DNA database can be matched up with the same person’s profile in law enforcement forensic DNA databases, even though they use a different, smaller set of DNA markers. Today in Cell , they report that more than 30% of individuals in the forensic databases can also be linked to a sibling, parent, or child in a consumer database. The two types of databases combined could make it even easier to find a suspect from a DNA sample. The linked consumer DNA profile could also reveal physical appearance or medical information for a criminal or their relatives, such as genes for eye color or a disease, even though the forensic databases aren’t supposed to contain that kind of information. “More can be done with them than has been claimed,” Rosenberg says.

Although these studies are encouraging news for solving crimes, they raise privacy concerns for law-abiding citizens, Erlich says. One possible solution suggested by his team is that the consumer DNA testing companies digitally encrypt a customer’s data and that GEDMatch only allow these encrypted files to be uploaded. That way a law enforcement agency couldn’t upload DNA sequence data from its own lab without an ancestry company’s cooperation. (The police can’t just pretend to be a customer and send crime scene DNA samples to companies like 23andMe because the company’s sequencing machines typically can’t process scant, degraded DNA samples.)

Erlich also thinks U.S. officials need to revisit federal rules protecting people who volunteer for research studies. A recently revised guideline for biomedical researchers, called the Common Rule, assumes that a research participant can’t easily be identified from their anonymized DNA profile. But in its paper, Erlich’s team used GEDMatch to identify a woman who was part of a study using her anonymized DNA profile and birth date, which is often publicly available to researchers.

Genetic policy experts agree that changes to how genealogy databases and DNA sequencing firms operate or are regulated are needed. The digital signature might be “a partial solution,” says law professor Natalie Ram of the University of Baltimore in Maryland. But all the players in the direct-to-consumer DNA sequencing industry would have to agree to this scheme, she notes. “If not, we’re back to square one.”