The Ancestry Composition Algorithm

Overview

The Ancestry Composition algorithm comprises five distinct steps.

First, we use a computational method to estimate the phasing of your chromosomes, that is, to determine the contribution to your genome by each of your parents. Next, we break up the chromosomes into short windows, and we compare your DNA sequence in each window to the corresponding DNA in our reference datasets. We label your DNA with the ancestry whose reference DNA is most similar, and then we process those assignments computationally to "smooth" them out. We then calibrate the results to ensure that they are accurate and, finally, aggregate them across higher-level population designations. Each step in this process is described in more detail in the following sections.

Step 1: Phasing

Recall wrinkle #2 above. For each customer, we measure a set of genotypes (pairs of alleles). But what we really want is a pair of haplotypes for each chromosome. That is, we want to figure out the series of alleles present on each of your two copies of, for example, chromosome 7: one you received from your mother and one you received from your father. To do so, we first build a very large "phasing reference panel" using data from hundreds of thousands of customers. We then use Eagle (Loh et al., 2016) to phase these individuals jointly. Eagle uses sophisticated statistics and a very clever algorithm to do this. Once we have phased this large collection of customers, we can use the information inferred to efficiently phase new customers.

Step 2: Window Classification

After phasing your chromosomes, we segment them into consecutive windows containing ~300 genetic markers each. We measure between 7,400 and 50,000 markers per chromosome, which translates to 24 to 149 windows, depending on the chromosome's length. We consider each window in turn and compare your DNA to the reference datasets to determine which ancestry most closely corresponds to your DNA.

There are many ways to assign ancestry to DNA segments based on reference data, and we tried several. The best-performing option was a well-known classification tool called a support vector machine, or SVM. An SVM can "learn" different ancestry classifications based on a set of training examples and then assign new DNA segments to a learned category.

In the case of Ancestry Composition, we train the SVM with reference DNA sequences and tell it which ancestry population those sequences are from. Then, when we look at the DNA from a 23andMe customer with unknown ancestry (like you), we can ask the SVM to classify your DNA for us based on the reference datasets.

We chose an Ancestry Composition algorithm based on SVMs because it performed the best out of all the techniques that we tried. SVMs are also very fast, which is critical for a large and growing database.

Step 3: Smoothing

The SVM classifies each window of your genome independently, creating a "first draft" version of your ancestry result. We use another computational process, called the smoother, to smooth this raw SVM output. The smoother uses a version of a well-known mathematical tool called a Hidden Markov Model to correct, or “smooth,” two kinds of errors. Hidden Markov Models are used to analyze sequential data, like biological sequences or recorded speech. As an example, suppose we had three ancestry populations: X, Y, and Z. An example of output from the SVM might look like this:

chromosome 1, parent 1: X - X - X - Z - Z - Z - Y - Z chromosome 1, parent 2: Z - Z - Z - X - X - X - X - X

The first kind of error the smoother corrects is an unusual assignment in the middle of a run of similar assignments. In the first line above, there's a run of Z's, interrupted by a single Y: Z — Z — Z — Y — Z. It's possible that the lone Y was a close call between Y and Z that went the wrong way. If that were the case, the smoother could correct it to Z — Z — Z — Z — Z.

The second kind of error the smoother corrects arises from the phasing step. Phasing algorithms can make phasing mistakes known as switch errors, where they mix up the DNA of one parent with that of another. The smoother can switch the ancestry assignments between your mother and your father if it detects one of these errors. In this example, there may be a switch error after the fourth window. If the switch were reversed, then the runs of X's and the runs of Z's would stay together. In our simplified example, the smoother might output something like this:

chromosome 1, parent 1: Z - Z - Z - Z - Z - Z - Z - Z chromosome 1, parent 2: X - X - X - X - X - X - X - X

This example illustrates the purpose of the smoother. But with real data the picture is much messier, and the answers are rarely so clean. So instead of assigning a single ancestry to each window like we did in this example, the smoother estimates the probabilities of each Ancestry Composition population matching each window of DNA. The following picture shows a concrete example:

Example plot of Ancestry Composition assignment probabilities

This is the output of the smoother analysis of one copy of chromosome 2. Starting on the left, there is a short run of pink, then a wider run of green, then another run of pink. In this chart, pink is the color for Sub-Saharan African ancestry, and green is the color for Native American. The y-axis runs from 0 to 100 percent, and it shows the probability that the DNA in that region of the chromosome comes from each Ancestry Composition population. These pink and green regions fill the entire vertical space of the graph, which means that we are 100 percent confident that the DNA in those regions has Sub-Saharan African and Native American genetic ancestry, respectively.

The next region to the right — between positions 50 and 100 on the x-axis — is a stretch of multi-colored blue. The thickest strip at the bottom is dark teal, which is the color for British & Irish. This segment of DNA has somewhere between a 50 percent chance and a 60 percent chance of reflecting from British & Irish ancestry. The other shades of blue show that the same DNA segment also has a chance of reflecting Italian, Iberian, or French & German ancestry. If you think back to the haplogroup example above, this result makes sense: it is normal for a DNA marker to match reference DNA from lots of places, even if it matches some places better than others. In this example, the result shows that this DNA segment matches reference DNA from all over Europe. We can very confidently conclude that this stretch of DNA reflects European ancestry, but the evidence isn't strong enough to assign it to one specific region of Europe with high confidence.

Step 4: Re-calibration

This plot shows a lot of information, but how do we know it's correct? We use a calibration step to correct for systematic bias.

First, we ran some tests to establish whether we needed to correct for any bias. To do so, we simulated a large set of admixed individuals. Because we simulated them ourselves, we knew the "true" ancestry of each part of their genome. Then we ran these simulated individuals through the entire Ancestry Composition pipeline, and we compared their results with their true ancestries.

We found that most of the reference populations were already fairly well calibrated, which means the ancestry probabilities we estimated closely matched the empirical results. But there were a few populations, in particular the Scandinavian and Balkan reference populations, that required some adjustment.

To correct these probability estimates, we developed a re-calibration step that adjusts the ancestry proportions produced by the smoother so that each population is assigned in proportion to how often it actually occurs.

Step 5: Aggregation & Reporting

The last step is to summarize the results and display them in your Chromosome Painting. The way we do this is to apply a threshold to the probability plot as in this figure:

Applying a threshold to Ancestry Composition assignment probabilities

The horizontal line in this image indicates a 70 percent confidence threshold, which we will use for this example. You can view your own Chromosome Painting at different confidence thresholds ranging from 50 percent (speculative) to 90 percent (conservative).

We look across the entire chromosome and ask whether any ancestry has an estimated probability exceeding the specified threshold (in this case 70 percent). In this example, with the exception of the blue European stretch, the ancestry estimates exceed 70 percent over the majority of the chromosome. Each region contributes to your overall Ancestry Composition in proportion to its size: For example, the green Native American segment near the end of this plot makes up about 0.26 percent of the entire genome. Even though there is some probability that the segment comes from a different population, the Native American proportion exceeds the 70 percent threshold, and so we add 0.26 percent Native American to the overall Ancestry Composition at this threshold.

In the case of the European segment, no single ancestry exceeds the 70 percent threshold, so we don't assign that DNA to any fine-grained ancestries. Instead, we refer to our hierarchy of ancestries. There is a Broadly Northern European ancestry that includes four fine-level ancestries: British & Irish, Scandinavian, Finnish, and French & German. If, when we add up the contributions of each of these subgroups, the total contribution toward Broadly Northern European exceeds the 70 percent threshold, then we will report the region as Broadly Northern European.

In this example, the Broadly Northern European reference populations still don't exceed the 70 percent threshold, but the combined probabilities of all the European populations do. So this region is assigned Broadly European ancestry.

We use broad Ancestry Composition categories to avoid making assumptions about your ancestry when your DNA matches several different country-level populations. In regions where no ancestry — including the broad ancestries — exceeds the specified threshold, we report Unassigned ancestry. You can see the entire ancestry hierarchy in your Ancestry Composition report by clicking "See all tested populations."