Population Structure

We applied a principal component analysis (PCA) to investigate the population structure of the new populations genotyped in this study from the Sudanese region (Supplementary Fig. S1a). PC1 (3.56% of the variation) follows a North-South cline and separates populations inhabiting the region between the Nile River and the Red Sea (Nubians and Arabs along the Nile, Beja and Ethiopians along the coast) from Darfurians and Nuba of South-West Sudan and Nilotes of South Sudan. Copts are a separated group close to the North-East populations, in a more outlier position: they are the extreme of the northern genetic component. PC2 (0.7%) separates the nomadic Fulani from the other populations.

Next, we combined our new populations (140 K data set) with previously studied populations of special interest for this analysis: Qatar12, Egypt13 and three sub-Saharan populations (Luhya, Yoruba and Maasai) from 1000 Genomes Project14 to have external references both in the north and south of the Sudanese region. This new data set contains 14,343 SNPs (14 K data set). Even if the number of SNPs in this second set is small, it is enough to differentiate components in the African genetic landscape15. Fig. 2 shows a PCA of this extended data set, where East African populations are distinct from both sub-Saharan and North African populations. PC1 (6.08%) separates between populations from North Africa/Middle East and sub-Saharan Africa (Fig. 2a). Copts are closer to North African and Middle East populations but remain as a separate cluster when PC2 is considered. PC2 (1.46%) along with PC1 separate the two homogeneous clusters of North-East and South-West populations: Nubians, Arabs, Beja and Ethiopians on one hand and Nuba, Darfurians and Nilotes on the other. PC2 separates all Sudanese and Ethiopian populations from the rest. PC3 (0.56%) differentiates West-African populations (Fulani and Yoruba) from Sub-Saharan East African populations (Maasai) (Fig. 2b). Both PC analysis using data sets with different number of SNPs preserve the topology of the populations. As expected, with a low number of SNPs we observe a higher intra-population variation (Supplementary Fig. S1b).

Figure 2 Principal component analysis of the populations from the Sudanese region in the context of the African continent. Plot shows a) PC1 and PC2 and b) PC2 and PC3 and the variation explained by them. Sudanese populations cluster in four groups according to their geographic location, with PC1 representing a north-east to south-west axis in East Africa. Populations not genotyped in this study are shown with grey filled symbols. MKK = Maasai from Kinyawa, Kenya; LWK = Luhya from Webuye, Kenya; YRI = Yoruba from Ibadan, Nigeria. Full size image

To test whether these particular sets of Immunochip SNPs (140 K and 14 K data sets) can recover population structure, we extracted 1000 Genome data from world-wide populations and observed that the genetic structure between them is maintained across the different data sets of SNPs used (Supplementary Fig. S2). In addition, the effect of ascertainment bias in the Immunochip was also assessed using a subset of presumably neutral SNPs (SNPs located in intergenic regions) (Supplementary Fig. S3). No strong effect of ascertainment bias was observed. Thus, our inferences of population structure seem robust to the sample size and particularities of the data sets of SNPs used.

Pairwise F ST statistic, a measure of global population differentiation, confirmed the PCA clustering (Supplementary Table S2, Supplementary Fig. S5). Populations geographically close had low average F ST values, even though population-specific characteristics were emphasized by excluding population outliers (Supplementary Fig. S4). The lowest average F ST (0.003) was found both in the pair Arabs and Nubians, located at the Nile River Valley and in the pair Beja and Ethiopians, located at the coast. Among North-East populations, Nubians had the highest F ST values when compared with Beja and Ethiopians (average F ST of 0.006 and 0.007 respectively). South-West populations showed higher population differentiation among themselves than North-East populations. When comparing North-East populations with South-West populations, all comparisons have a high F ST (between 0.044 and 0.054). Copts, with a strong individual heterogeneity, are more similar to Arabs (F ST = 0.019) than to any other East African population. Copts and South-West populations are the most distant populations (F ST > 0.1). Fulani had on average lower F ST values when compared to South-West (Nuba, Darfurians and Nilotes) than to North-East populations (Nubians, Arabs, Beja and Ethiopians). These values show a complex situation beyond the simple North African versus Sub-Saharan Africa main differentiation.

To test the hypothesis that geographically close populations are genetically similar, we performed a Mantel test to determine to which extent geographic and genetic distances (as pairwise F ST ) between populations are correlated. We found a significant positive correlation between genetic and geographic distance (r = 0.5105, p-value < 0.0001).

Population Admixture

To infer the ancestral populations of the East African individuals, we run ADMIXTURE from k = 2 to k = 10 in the 14 populations (the analysis for the internal nine populations is presented in Supplementary Fig. S7, S10). We analysed the results from k = 2 to k = 5 as higher numbers of ancestral components do not have a clear origin. A complex pattern of admixture is observed in East African populations (Fig. 3). At k = 2, we already detect different ancestries in the Sudanese populations. Copts show a common ancestry with North African and Middle Eastern populations (dark blue), whereas the South-West cluster (Darfurians, Nuba and Nilotes) share an ancestry component (light blue) with sub–Saharan samples. The North-East cluster (Beja, Ethiopians, Arabs and Nubians) shows both components, although the main component (~70%) is that detected in North Africa and Middle East (Fig. 3).

Figure 3 ADMIXTURE results for the 14 populations. A random subset of 18 individuals from each population was selected to avoid sample size bias. Columns represent individuals, where the size of each colour segment represents the proportion of ancestry from each cluster. Although k = 3 is the statistically supported model, here we show the results from k = 2 through k = 5 as they explain several ancestral components: North African/Middle Eastern (dark blue), Sub-Saharan (light blue), Coptic (dark green), Nilo-Saharan (light green) and Fulani (pink). MKK = Maasai from Kinyawa, Kenya; LWK = Luhya from Webuye, Kenya; YRI = Yoruba from Ibadan, Nigeria. Full size image

At k = 3 (best statistically supported model, see Supplementary Fig. S8b), a new component (light green) appears, well differentiated from other South Saharan or North Africa and Middle East populations. This component defines South-West Sudanese populations (Nuba and Darfurians) and Nilotes of South Sudan and is different from the main sub-Saharan component as seen in Yoruba and Luhya. This Nilo-Saharan component, which is also found at lower percentage in the North-East cluster and Maasai, will be outlined in the discussion.

Copts share the same main ancestral component than North African and Middle East populations (dark blue), supporting a common origin with Egypt (or other North African/Middle Eastern populations). They are known to be the most ancient population of Egypt and at k = 4 (Fig.3), they show their own component (dark green) different from the current Egyptian population which is closer to the Arabic population of Qatar.

It is noteworthy the case of the Fulani, which feature more Sudanese ancestry (>45%) than North African (<40%) or sub–Saharan (<15%) and at k = 5 show their own component (Fig.3). They have a high individual component variance suggesting a recent admixture event in this population.

To formally test the results of the admixture analysis, we applied the three-population test (f 3 statistics)16. We used all possible pairs of populations as surrogates of the ancestral populations of each ethno-linguistic group. All populations that have a complex pattern of admixture (Fig. 3) showed statistically significant results (Z-score <−4, p-value <3.2 × 10−5): those of the North-East cluster (Beja, Ethiopians, Arabs and Nubians) and Fulani. Populations from the North-East cluster: Beja, Ethiopians, Arabs and Nubians (Table 2) may be explained as admixture products of an ancestral North African population (similar to Copts) and an ancestral South-West population (Nuba, even if in one case Darfurians have better fit). These four populations had an intermediate position between Copts and South-West Sudanese populations both in the PC and admixture analyses.

Table 2 Three-population test. Here we show the combinations of source populations that give the most negative f 3 statistic (Z-score < -4, p-value < 3.2×10–5) for each target population (α L is the lower bond and α U is the upper bound of α, where α is the admixture proportion by which the target population was formed from the ancestral population of source population 1). Full size table

Fulani, who are known to have West-African ancestry, have a negative f 3 with Copts and Yoruba as source populations (Table 2). As they have a complex history and present high levels of admixture with different populations and high individual variance, this three-population phylogeny seems naïve to explain their complex population history. None of the South-West populations (Darfurians, Nuba and Nilotes) appear as admixed in the three-population test. This result fits the ADMIXTURE analysis (Fig. 3 and Supplementary Fig. S10) and it confirms a specific ancestral component for these populations.

Low genetic distance between populations for genes involved in infectious diseases

We studied the effects of infectious pressures on the genetic make-up of populations in East Africa by calculating genetic distances (as F ST ) between populations using the genetic variation in genes involved in defence against different agents. We selected among the genes genotyped in the Immunochip those associated with resistance/susceptibility to malaria17 (Supplementary Table S5), those related to host defence against bacteria18 (Supplementary Table S6) and those related to host defence against fungi (Supplementary Table S7). For every pair of populations, the mean F ST of those genes was compared to the mean F ST of a set of randomly selected SNPs from genic regions with the same sample size and similar MAF, using a permutation test (10.000 permutations). All pairwise comparisons showed that the mean F ST score of malaria-related genes was significantly lower than the mean F ST score of the sampling distribution (Fig. 4). This suggests that all these populations have suffered a strong selective pressure in the same direction in genes related to malaria resistance. In the case of antibacterial host defence genes, all comparisons except Copts and the North-East populations had a mean F ST score significantly lower than the sampling distribution mean (Fig. 5). For the genes encoding proteins important for antifungal defence only three comparisons showed populations with a mean F ST score lower than the sampling distribution: Copts compared to South-West populations, Copts compared to Fulani and North-East populations compared to South-West (Fig. 6).

Figure 4 Genes associated with resistance/susceptibility to malaria. Sampling distribution of the sample mean pairwise F ST between populations. Average F ST value of genes associated with resistance/susceptibility to malaria (♦) is significantly lower than the mean F ST score of the sampling distribution in all pairwise comparisons. COP = Copts; NOR = Beja, Ethiopians, Arabs and Nubians; SOU = Darfurians, Nuba and Nilotes; FUL = Fulani. The sampling distribution is drawn from the mean F ST value of subsets of randomly selected genic SNPs with a sample size equal to the number of common SNPs between populations in the selected genes (n) and with similar MAF (10,000 permutations). Full size image

Figure 5 Anti-bacterial host defence related genes. Sampling distribution of the sample mean pairwise F ST between populations. All pairwise comparisons, except COP vs. NOR, have an average F ST value of anti-bacterial host defence related genes (♦) that is significantly lower than the mean F ST score of the sampling distribution. COP = Copts; NOR = Beja, Ethiopians, Arabs and Nubians; SOU = Darfurians, Nuba and Nilotes; FUL = Fulani. The sampling distribution is drawn from the mean F ST value of subsets of randomly selected genic SNPs with a sample size equal to the number of common SNPs between populations in the selected genes (n) and with similar MAF (10,000 permutations). Full size image

Figure 6 Anti-fungal host defence related genes. Sampling distribution of the sample mean pairwise F ST between populations. Only three pairwise comparisons have an average F ST value of anti-fungal host defence related genes (♦) that is significantly lower than the mean F ST score of the sampling distribution: COP vs. SOU, COP vs. FUL and NOR vs. SOU. COP = Copts; NOR = Beja, Ethiopians, Arabs and Nubians; SOU = Darfurians, Nuba and Nilotes; FUL = Fulani. The sampling distribution is drawn from the mean F ST value of subsets of randomly selected genic SNPs with a sample size equal to the number of common SNPs between populations in the selected genes (n) and with similar MAF (10,000 permutations). Full size image

We tested whether the specific SNPs present in the Immunochip for genes related to infectious diseases are a representative sample of all the SNPs of those genes using 1000 Genomes data of African populations (Supplementary Table S8). Results show that the SNPs present in the Immunochip for the genes of interest can be considered as a representative sample of all the SNPs in those genes.