To inform the design of genomic studies in Africa, we addressed the following questions: (1) How well do current genotype arrays perform in African populations using existing reference panels for imputation? (2) Can these genotype arrays and reference panels identify and fine-map association signals in populations across Africa? (3) Can we improve imputation accuracy in African populations using a new African reference panel? and (4) What are the most cost-effective designs for large-scale GWAS in Africa?

The 1000 Genomes Project phase I integrated panel provided reasonably accurate imputation into the Illumina Omni 2.5M array in all populations (Supplementary Note 10). However, imputation accuracy was lower among Sotho, Zulu and Afro-Asiatic populations, possibly reflecting poor representation of some African haplotypes (including Khoe-San haplotypes) within the 1000 Genomes Project panel. These findings suggest that improvements in imputation accuracy across diverse population groups may require larger and more diverse reference panels.

We assessed the reproducibility and potential for fine-mapping association signals within Africa and globally at several disease susceptibility loci (Supplementary Methods, Supplementary Table 8 and Extended Data Fig. 9). Current genotype arrays and imputation panels allowed for identification of relevant association signals at most loci across populations in SSA, demonstrating that association signals are reproducible across populations in SSA (Extended Data Fig. 9 and Supplementary Figs 7–18). African populations are likely to provide better fine-mapping resolution around the causal locus (Supplementary Table 8). We highlight one example here: the sickle-cell anaemia locus (HBB)47, which is under positive selection owing to the protection the sickle cells confer against severe malaria. This locus showed marked heterogeneity in association signals across populations, reflecting different linkage disequilibrium patterns and allele frequencies among populations in SSA (Supplementary Figs 9 and 10). This pattern is probably the result of independent selection sweeps at this locus in different parts of Africa, leading to differences in hitchhiking rare haplotypes that attained high frequencies among different populations48. This suggests that these signatures are recent and occurred during or after the Bantu expansion, consistent with the hypothesis that the advent of agriculture and increased malaria transmission may have resulted in increased selection pressure49. However, in contrast to previous reports47, we show that association signals even at such highly differentiated loci can be captured with dense genotype data using existing reference panels for imputation, despite individual population groups not being fully represented in these. This suggests that, instead of large-scale population-specific sequencing across Africa, what is needed is a broad sequencing approach, targeted at capturing widespread haplotype diversity.

To assess the utility of a larger and more diverse African reference panel for imputation, we generated a panel integrating the 1000 Genomes Project phase I and AGVP WGS panels (Supplementary Methods and Supplementary Note 9). Using this integrated panel, we observed marked improvements in imputation accuracy across the whole range of the allele frequency spectrum in specific populations poorly represented by the 1000 Genomes Project panel (Fig. 3 and Supplementary Note 11). These findings suggest that even common haplotypes in some SSA populations may not be sufficiently captured by existing panels, limiting our power to examine associations of common variants with disease. Importantly, given the specificity of the improvement in imputation accuracy, we infer that targeted sequencing of divergent populations representing a broad spectrum of haplotypes across Africa, including HG and North/East African haplotypes, rather than widespread population sequencing is likely to provide a more efficient strategy to improve imputation accuracy and a practicable GWAS framework in Africa.

Figure 3: Improvement in imputation accuracy with the AGVP WGS panel. The substantial improvement in imputation accuracy in some populations (Sotho), compared to minimal improvement in others (Igbo) with the addition of the AGVP WGS reference panel to the 1000 Genomes Project phase I reference panel (‘merged’) suggests poor representation of some haplotypes (for example, Khoe-San haplotypes in Sotho) in the 1000 Genomes Project reference panel alone (‘1000’). r2 is the correlation coefficient, representing the correlation between imputed and genotyped data, on masking each genotyped variant during imputation. MAF, minor allele frequency. PowerPoint slide Full size image

We compared the utility of existing chip designs (2.5M Illumina) and ultralow-coverage WGS designs (0.5×, 1×, 2× coverage) to determine the optimal design for African GWAS. Sensitivity for common variation was >90% at all sequencing depths (Supplementary Note 12). Examining the effective sample size for a fixed budget50, we found the effective sample size was greater for all ultralow-coverage WGS and chip array designs compared with 4× WGS. When computational costs were accounted for (Supplementary Note 12), the HumanOmni2.5M array provided the greatest effective sample size supporting the development and large-scale use of efficient genotype arrays in Africa, where these have been underutilized.

We therefore sought to evaluate a potential chip design to tag common variation across a wider range of African populations (Supplementary Note 13). Importantly, we show that an array with one million genetic variants could capture >80% of common variation (minor allele frequency >5%) across the genome (Extended Data Fig. 10). These analyses suggest that designing a pan-African genotype array to effectively capture common genetic variation across Africa is feasible, and could greatly facilitate large-scale genomic studies in Africa.