The known limitations of genotyping arrays

Before getting into the details of this argument, it’s worth remembering the basic goals of most genetic association studies.

In general, if you’re running an association study, you are interested in identifying genetic variants that influence risk of a disease, and you have DNA from thousands of people with the disease and thousands of people without it. The goal is to scan across the entire 3.3 billion bases of the human genome to find sites that differ across these two groups.

The logic of genotyping arrays is as follows: obviously it’s not cost-effective to sequence all 3.3 billion bases in thousands of people, since most of the genome is just going to be identical in everyone. Instead, what you might want to do is choose a set of maybe 500,000 to 2M sites (well less than 0.1% of the genome) and measure just those. If you choose those sites well (for example, by focusing on sites that you know vary across people), you can get a lot of the benefits of looking at the whole genome for a fraction of the cost.

The counterargument to this logic is that presumably the reason you’re doing your study is you don’t know ahead of time which sites to look at. Though there are ways to help mitigate this problem, ultimately this leads to the key limitations of genotyping arrays:

Since you have to know ahead of time that a genetic variant exists before you can measure it, it is difficult or impossible to identify rarer variants that have ended up being important for a number of diseases and traits. Array design is necessarily biased towards better-studied populations; for example, arrays based on genetic variants discovered in European populations don’t perform well in African populations.

The well-known solution to all of these problems is to use a technology where you don’t have to decide ahead of time which genetic variants to look at, and the natural choice of technology is some variant of whole genome sequencing. At Gencove we’ve developed our own ultra-low-coverage sequencing assay, and as in the schematic below, even very low-coverage genome sequencing generates a more complete look at the genome than a genotyping array.

A comparison of the data generated by a genotyping array and ultra-low-coverage sequencing. The x-axis represents 6,000 bases of the human genome, and the y-axis represents 600 people, 300 assayed with a genotyping array and 300 with sequencing. Black represents the positions measured with the array, while blue represents positions measured with sequencing (see inset).

The conventional wisdom is that there is a tradeoff between genotyping arrays and sequencing: genotyping arrays let you inexpensively cover all of the known variation in the genome, while sequencing lets you identify new variants at a higher cost. But this conventional wisdom is wrong.

Sequencing approaches outperform genotyping arrays even at common variation

Genotyping arrays were not designed to measure rare or population-specific variants, so it’s hardly a surprise that they don’t have these features. What might be surprising is that many genotyping arrays don’t profile known variation particularly well either.

When we were developing the ultra-low-coverage sequencing assay that we use at Gencove, we performed a number of simulations comparing the power of imputation-based genetic studies using genotyping arrays to those using ultra-low-coverage sequencing [2].

This situation is exactly the one that genotyping arrays are designed for. And based on results from a few years ago from my colleague Bogdan Pasaniuc, we’d expected the performance of ultra-low-coverage sequencing to be poorer than genotyping arrays, but to an acceptable degree given the added benefits of sequencing in other contexts.

Instead, the results looked more like those in these figures: for example in a Nigerian population, 0.2x coverage sequencing outperforms the Illumina CoreExome and Global Screening Array chips across the entire allele frequency spectrum. Importantly, this is not expensive: if you run a quick back-of-the-envelope calculation [3], the sequencing costs of 0.2x sequencing will soon be less than $10 (though other costs then start to dominate).

Comparison of average genotype accuracy after imputation in an African population, using genotyping arrays and ultra-low-coverage sequencing. On the x-axis are genetic variants in bins of different minor allele frequencies, and on the y-axis is the average r² after imputation to the true genotypes.

Comparison of average genotype accuracy after imputation in a European population, using genotyping arrays and ultra-low-coverage sequencing. On the x-axis are genetic variants in bins of different minor allele frequencies, and on the y-axis is the average r² after imputation to the true genotypes.

For our purposes we were comparing to the genotyping arrays with the fewest markers. Recently, a paper from the Zeggini lab at the Sanger Institute appeared on bioRxiv, in which they compare higher coverage sequencing (about 1x) with denser genotyping arrays.

The take-home messages from this paper are in line with our calculations:

Sequencing increases power compared to genotyping arrays for the discovery of associations between genetic variants and traits. In their application:

Of the 54 association signals arising from genome-wide association analysis of 1x [whole genome sequencing] variants with 25 haematological traits, only 57% are recapitulated by the imputed [array] results in the same samples.

2. Sequencing, the technology with more power, is less expensive than genotyping arrays:

As of January 2017, 1x WGS on the HiSeq 4000 platform was approximately half of the cost of a dense GWAS array (e.g. Illumina Infinium Omni 2.5Exome-8 array) [and] 1.5 times the cost of a sparser chip such as the Illumina HumanCoreExome array

For the purposes of discovering new genetic associations to traits and diseases, there actually is no tradeoff: sequencing approaches are more powerful and less expensive.

What next?

The switch from genotyping technologies to sequencing technologies was inevitable, but methods for low-coverage and ultra-low-coverage sequencing have accelerated the timescale for this switch considerably. How might one plan for large-scale human genomics studies going forward?