In 2013, a young computational biologist named Yaniv Erlich shocked the research world by showing it was possible to unmask the identities of people listed in anonymous genetic databases using only an Internet connection. Policymakers responded by restricting access to pools of anonymized biomedical genetic data. An NIH official said at the time, “The chances of this happening for most people are small, but they’re not zero.”

Fast-forward five years and the amount of DNA information housed in digital data stores has exploded, with no signs of slowing down. Consumer companies like 23andMe and Ancestry have so far created genetic profiles for more than 12 million people, according to recent industry estimates. Customers who download their own information can then choose to add it to public genealogy websites like GEDmatch, which gained national notoriety earlier this year for its role in leading police to a suspect in the Golden State Killer case.

Those interlocking family trees, connecting people through bits of DNA, have now grown so big that they can be used to find more than half the US population. In fact, according to new research led by Erlich, published today in Science, more than 60 percent of Americans with European ancestry can be identified through their DNA using open genetic genealogy databases, regardless of whether they’ve ever sent in a spit kit.

“The takeaway is it doesn’t matter if you’ve been tested or not tested,” says Erlich, who is now the chief science officer at MyHeritage, the third largest consumer genetic provider behind 23andMe and Ancestry. “You can be identified because the databases already cover such large fractions of the US, at least for European ancestry.”

To make these estimates, Erlich and his collaborators at Columbia University and the Hebrew University of Jerusalem analyzed MyHeritage’s dataset of 1.28 million anonymous individuals, which is, like most of the world’s genetic databases, overwhelmingly white. Considering each one of those individuals as a human “target,” they counted the number of relatives with big chunks of matching DNA and found that 60 percent of searches turned up a third cousin or closer. That level of relatedness was all investigators needed to track down the Golden State Killer, and the 17 other cases that have so far been solved with this approach—known to law enforcement as long-range familial searching. To validate their findings, Erlich’s team plugged 30 genetic profiles into GEDmatch and saw similar results, with 76 percent of searches netting relatives in the 3rd cousin or closer range.

That analysis provides a list of around 850 individuals, depending on how prolific a person’s forebears were. But from there, basic demographic information can prune the lineup pretty quickly. Public records indicating where someone lives to within 100 miles cuts the candidate pool in half. Knowing their age to within five years excludes 9 out of 10 of the remaining candidates. The sex, which can be inferred from genetics, gets the list down to around 16 individuals. Knowing the exact birth year could get you down to just one or two people.

To demonstrate how easy it is, the researchers chose an anonymous female subject from the 1000 Genomes Project—an open-access sequencing project—who was married to the man that Erlich had previously identified in his blockbuster 2013 paper. They reformatted her DNA data to resemble a typical consumer genetic profile and uploaded it to GEDmatch. Two relatives popped up, one in North Dakota and one in Wyoming. The match suggested they were distantly related four to six generations back. An hour of public record-combing later and the team had found their husband and wife. From there, the researchers traced the pedigrees of hundreds of descendants to arrive at the identity of their target. All in all, the effort took a single day.