There have been some media “explainers” about how genetics can’t speak to Elizabeth Warren’s Native American heritage. This is a complicated issue, and not all the assertions in the media pieces I’ve seen are wrong, but a lot of the details are very confused or wrong. In sum, this is very bad journalism from people who don’t know where to start, and had no idea they were relaying confusions or falsehoods. (I’m being generous here in assuming they didn’t know that they were repeating falsehoods)

The point of this post isn’t to get too involved in the political points. Or even to argue that Elizabeth Warren should take a genetic test (I don’t think she should unless she wants to for other reasons besides the political sideshow, but that’s my personal opinion). Rather, I think that genetics is being distorted for the sake of political points and demerits. That is not optimal. Normally I don’t do much “fisking” type posts, but this is necessary at this point.

Let’s start with The Washington Post, Sorry, Scott Brown: A DNA test can’t tell us if Elizabeth Warren has Native American roots.

First, the title is false. If a few percent of Elizabeth Warren’s ancestry was derived from people whose ancestors lived in the New Word before 1492, then it would be visible on a PCA with Europeans and Native Americans. She’d be shifted a bit toward Native Americans.

Second, the journalist at The Washington Post interviews someone with serious credentials to serve as a primary source:

Nanibaa’ Garrison is a bioethicist and assistant professor of pediatrics at Seattle Children’s Hospital. A Native American, she earned a PhD in the Department of Genetics at Stanford, with a dissertation focused on ancestry.

She certainly has done genetic research, but I’m not sure that she can speak to modern genomic inference, which has advanced a lot in the past ten years.

Next:

That’s because determinations of ancestry are based on “ancestry-informative markers” — genetic flags that offer probabilities of the likelihood of certain ancestries. Most of those markers, AIMs, are “based on global populations that are outside of the U.S.,” she said, “primarily people of European descent, people of Asian descent and people of African descent. Those three populations are not enough to determine how much Native American ancestry a person has.

AIMs were popular in the 2000s. Basically they are usually less than 100 markers with very high between-population differences in frequency between your populations of interest. But today most people would not use AIMs unless cost is a major issue (e.g., I’ve seen that AIMs are still used sometimes in work from developing nations because they can’t afford SNP-chips). So all the talk about AIMs is totally irrelevant to the question at hand.

Today you can download data sets with hundreds of thousands, and in the case of the 1000 Genomes data, millions of markers. These are still ascertained for polymorphisms; variants. But they’re really not AIMs in the classical sense as they are not targeted to a narrow set of populations, but look for variation across most human groups.

Also, panels are not restricted to three populations. You can get plenty of indigenous American samples from various public panels, as well as looking in the 1000 Genome Peruvian data set. The focus on three populations is again an artifact of 2005, probably due to the HapMap era (CEU, YRI, CHB+JPT, if you know what I mean).

Then:

Warren’s understanding of her heritage was that she was part Cherokee, perhaps as little as 1/32nd based on outside sleuthing. (Brown dismissed that claim specifically on this week’s call.) The odds of identifying a particular tribal identity are essentially zero, according to Garrison, but such a small percentage of Native American blood would also make identification much harder, even if the necessary AIMs existed.

Again, AIMs are irrelevant. This is like explaining that Netflix won’t work because of 56K modem download speeds. Most people don’t use 56K modems anymore. The 1/32 fraction may be an issue, but not because ~3% is not detectable. It is. A few years ago I stumbled onto the fact that geneticist Dan MacArthur is ~2% South Asian. He checked, and his brother is in the same range, while his father is about double. It turns out that he had an ancestor who was an officer in the British army in India….

The bigger problem here is that as you proceed back generations you are less and less likely to have genetic segments from any given ancestor. So if you had an ancestor 200 years ago who was Native American, even if they were 100% Native American, you may not have any genetic segments from that individual.

So, the article says:

Even a test that was fine-tuned to pick out Native American identity might not find any on Warren’s genes, because the requisite markers simply may not have made the cut over multiple generations.

This is correct. But, you probably do have segments from someone five generations back. There’s about 5-10% chance that five generations back you wouldn’t inherit any segments from an ancestor at that remove. The expert consulted by The Washington Post states:

“It would be impossible to go back that far,” Garrison said. “One-32nd is low enough that, even if she does have Native American ancestry, just by chance the genes that show up on these AIM panels might not necessarily be passed down, even if she might have other genetic variants that are highly prevalent among Native Americans. It’s all just by chance, what you inherit from your parents.”

As I said, AIMs are irrelevant. Today you would use dense SNP-chip panels or even whole genome sequencing. But even with AIMs if you had 100 well distributed throughout the genome it would be quite possible to detect divergent ancestry from the rest of the genome. It is not “impossible” as asserted. The source is just incorrect.

Next:

“There’s a confidence interval that’s associated with [the results],” Garrison said. “That confidence interval can be very wide, especially when you’re talking about such low ancestral contribution.” So maybe Warren gets the results back and it says that she’s Native American — but that it can only be determined with 20 percent confidence. Scott Brown might not be convinced.

This is only an issue with AIMs. You can get results of 3% back pretty robustly. And it would show up on PCA too.

Then there are weird tangents, which I think exist to make the author look like they’ve “done their research” and reassure the lay audience:

Huntington disease, for example, can be spotted in DNA — but the test wouldn’t tell you when the disease might develop, which doesn’t do you much good if you’re worried about a four-year window. “There are so many different environmental factors or dietary factors and other health behaviors that would feed into whether or not a disease might develop and what time in their life it would develop,” Garrison said, making that sort of prediction impossible. (For now, at least.)

I’m not a medical geneticist, but I think the example of Huntington’s is kind of strange to put here (perhaps because people know about it?). It’s really well genetically characterized. From the link provided in the article:

As the altered HTT gene is passed from one generation to the next, the size of the CAG trinucleotide repeat often increases in size. A larger number of repeats is usually associated with an earlier onset of signs and symptoms. This phenomenon is called anticipation. People with the adult-onset form of Huntington disease typically have 40 to 50 CAG repeats in the HTT gene, while people with the juvenile form of the disorder tend to have more than 60 CAG repeats. Individuals who have 27 to 35 CAG repeats in the HTT gene do not develop Huntington disease, but they are at risk of having children who will develop the disorder. As the gene is passed from parent to child, the size of the CAG trinucleotide repeat may lengthen into the range associated with Huntington disease (36 repeats or more).

Warren is old enough that she is unlikely to have 60 repeats or more. But Huntington’s is one of those diseases where we have a good sense of age of onset because it’s triplet repeat length is proportional to age of onset.

Next we have an article in Slate, A DNA Test Won’t Explain Elizabeth Warren’s Ancestry. First:

But here’s the thing: DNA testing cannot definitively prove whether a person is Cherokee. Or a member of any community, at least not reliably. To assume it can is to assume that there’s something inherently different in the genetic makeup of tribal members and that this thing is universal within that community. That’s not true.

Strawman. We’re always talking probabilities. Then:

The problem is that DNA snippets, or markers, are inconsistent. Sometimes they are passed on and sometimes they are not, and whether they are or aren’t is random. Sure, a large percentage of Native Americans may share certain genetic markers. But many Native Americans may lack the same marker, and many non–Native Americans may carry it by coincidence.

I don’t have a good sense of what the author is trying to get at, though I think there’s something underlying all this verbiage. The issue that allele frequencies are not (usually) disjoint across populations is well known. That’s why modern SNP-chip panels use hundreds of thousands of markers. Much of the Slate article is engaging a strawman when it comes to genetics because it acts as if we’d actually rely on a few markers, though perhaps not in the public’s perceptions of how these things work. In the latter case, the author could simply put in this sentence: “genetic tests to detect ancestry usually rely on hundreds of thousands of markers today, not only a few….”

This lack of specifics crops up over and over:

So when a DNA test comes back saying you are 28 percent Finnish, all it’s really saying is that of the DNA analyzed (most companies don’t analyze all of your DNA), 28 percent of it was most similar to that of a completely Finnish person. In the end, these comparisons are a fun but ultimately unreliable way to think about the possibilities of whom your ancestors might have been, rather than definitive proof of your ethnic background.

There’s a link in the piece that takes you to a 2007 piece on how DTC tests aren’t all they’re cracked up to be. 2007 is ages in genomics. So ignore that. Second, the selection of Finnish is unfortunate for the author, as Finns are actually one of the more genetically distinctive European populations out there because of a small effective population size. So, for example, one of my friends has a grandfather whose parents were from Finland. 23andMe says she is 19% Finnish. It’s simply wrong that it’s “unreliable.” With segment matching it’s quite reliable if you get a positive hit assuming you set the genetic distance threshold high enough. Also, depending on how you delimit “ethnic background” it can be quite definitive. Samples from Northern Europe never show much evidence of African ancestry. A minority of white Americans do. That’s not a coincidence.

As in The Washington Post the author of Slate piece has an authority who lays down the truth as they see it:

“Scientists who don’t know better claim that when more Natives are sampled they’ll have better data bases, i.e. more Native markers,” said Kim TallBear, professor of Native studies at the University of Alberta in a 47-tweet takedown of Brown’s remarks about Warren. “[Geneticists] think that with more markers, and greater historical-genetic resolution they’ll be able to pinpoint tribe-specific markers.” But this does not account for the fact that people are continuously moving and reproducing with other, diverse people. They mix their genetic code with other communities (as they always have, going back to the dawn of our species). If anything our DNA is getting more muddled, not more clear.

Can you read a paper like The genetic structure of the world’s first farmers, and believe this? Geneticists who work in historical population genomics are quite familiar with the ideas of migration and gene flow. More data is clarifying, just as it science should be.

The first authority cited in The Washington Post did some legitimate science at some point, though a bit outside of the core area of expertise she was being consulted on, and her knowledge definitely seems out of date (the constant talk about AIMs is a good tell here). Kim TallBear’s publications are quite different….

The author of the Slate piece ends:

Another issue is limited and inconsistent data. Ancestry.com, for example, divides the world up into 26 genetic regions and uses just 115 samples to create the representative of each region—a very small sample size. And different companies place different weight on these samples, which come from burial grounds, modern isolated communities, and academically published data, like the Human Genome Diversity Project. For the consumer, this means if you don’t like your heritage results, try a different company. You’ll get a completely different breakdown. Whether there’s any harm in people basing their identity on faulty reasoning is unclear, but the success of these commercial endeavors proves that at the very least, consumers find it kind of fun. Genetic testing is basically just a low-cost way to get a blurry picture of whom your ancestors might have been related to.

First, the author needs to issue a correction. I immediately knew Ancestry.com didn’t use 115 samples; that’s just too low. Fifteen seconds of Google shows me that they have a sample size of 3,000. No idea where 115 samples comes out of, and I don’t care. He’s wrong. Slate should correct this. [see addendum; I may have misunderstood or been too harsh here, but a different point them crops up….]

Second, it’s misleading to say the picture is “blurry.” No, arguably it’s overly precise, and misleads people. Many of these ancestry inferences are quite precise and robust. They don’t vary between replicates that much even though they have a stochastic parameter. But, model based clustering gives results conditioned on a model. The results themselves them are sensitive to the parameters you’re putting into the model. The different regions from different DTC companies and sample sets are these different conditions.

This isn’t mysterious or difficult to understand. If you want to separate your individuals into Africans and non-Africans all the non-Africans will go into one cluster. This is robust, precise, and highly reproducible. In fact, a non-African individual will never be clustered with Africans with normal SNP-chip densities. At least not in the thousands of iterations I’ve personally run and inspected. Similarly, as you separate populations further you’ll see reasonable and comprehensible divisions.

The problems crop up when you begin to slice and dice very close genetic groups, where there isn’t much between-population difference. This is what happens in Northern Europe, and this is where most of the DTC firms’ client base is from. So this causes problems, and often difficult to interpret results. Moderate changes in parameters then can produce divergent results because the question we’re trying to get at is really hard to resolve with the data on hand, less than one million SNPs.

There are ways to resolve this. And that has to do with more data. In particular, whole genome sequencing at high coverage can pick up very rare alleles, which are highly informative of more recent genealogical history, and so divide up even Northern Europeans in a way that is more comprehensible and historically accurate.

But really the problem isn’t with the data. We have very dense SNP-chip markers now. The problem isn’t with the methods. We have genotype and haplotype-based methods which can make pretty strong inferences, especially at the intercontinental level (e.g., a friend who is 1/4 Japanese genealogically comes out to be 24% Japanese genomically; the rest is European). The problem is that the public, including journalists, aren’t always clear what the results are telling them. Sometimes the DTC companies themselves may be at fault because of their unclear communication. And to be frank, the Henry Louis Gates Jr. in my opinion has often sown a lot of confusion as well with his television show, informative as it may be.

Looping back to Elizabeth Warren, the biggest issue with her maybe not having any indigenous ancestry combined with a Cherokee ancestor five generations back is that the Cherokee nation in the 19th century was already genetically mixed. The great chief John Ross was 1/8th Cherokee by blood quantum. That is, 1/8th of his ancestors were present in the New World in 1492. So a simple reason for why Elizabeth Warren might be Cherokee, but without indigenous ancestry, is that her Cherokee ancestor may not have had much indigenous ancestry. It’s not because genetics can’t pick up indigenous ancestry, genetics can. It’s just that this is a case were social and cultural history and definitions are important.

To be honest this post is a bit trivial. But lots of people read The Washington Post and Slate. As I just explained above there is a simple reason why Elizabeth Warren could come out 100% European in her ancestry, and, be of Cherokee descent. Instead of explaining this, the media has decided to look for people who claim that genetics just can’t answer this question. In the process they garble, mislead, and repeat falsehoods (the sample size for Ancestry.com is obviously wrong to anyone who is familiar with that field, but the journalist is not familiar, so it passed their smell test since they had no grounds for discernment).

This post exists only so that at least there is someone out there correcting the record.

Note: I am a consultant for Gene By Gene and was a developer for their MyOrigins tool. This is one reason I know a lot about DTC genetic companies. But it also means I have a conflict of interest, as I think DTC genomics is useful with the proper caveats.

Addendum: A reader:

This seems, um, contrivedly obtuse. 115 samples per region times 26 regions is a total sample size of 2990, which seems reasonably close to 3000. Going the other way, 3000 / 26 is 115.4, so that will be where the claim of “115 per region” came from. There was no claim of “115 total”; the piece says that the representative of each region is constructed from 115 samples. It’s true that 115 is an average figure and that’s not made clear in the article, but I’m not sure how comforting I should find it that the representative of “Polynesia” is actually constructed from 18 samples rather than 115.

A fair, but inadvertently ignorant, point. Sample sizes of ~20 are actually quite sufficient to generate reference populations. It partially depends on how diverse the populations are you are trying to use as a reference.