Genome-wide scans for recent positive selection in humans have yielded insight into the mechanisms underlying the extensive phenotypic diversity in our species, but have focused on a limited number of populations. Here, we present an analysis of recent selection in a global sample of 53 populations, using genotype data from the Human Genome Diversity-CEPH Panel. We refine the geographic distributions of known selective sweeps, and find extensive overlap between these distributions for populations in the same continental region but limited overlap between populations outside these groupings. We present several examples of previously unrecognized candidate targets of selection, including signals at a number of genes in the NRG–ERBB4 developmental pathway in non-African populations. Analysis of recently identified genes involved in complex diseases suggests that there has been selection on loci involved in susceptibility to type II diabetes. Finally, we search for local adaptation between geographically close populations, and highlight several examples.

The ability to identify the molecular signature of natural selection provides a powerful tool for identifying loci that have contributed to adaptation. Recently, a number of analytical techniques have been developed to identify signals of recent positive selection on a genome-wide scale and applied to polymorphism data from several human populations (Kelley et al. 2006; Voight et al. 2006; Wang et al. 2006; Sabeti et al. 2007; Williamson et al. 2007; Barreiro et al. 2008). Several loci important for human adaptation have been identified or confirmed in these scans: notable examples include a number of genes involved in skin pigmentation (Voight et al. 2006; Sabeti et al. 2007; Williamson et al. 2007); EDAR, involved in hair morphology (Kelley et al. 2006; Fujimoto et al. 2008; Mou et al. 2008); and LCT, at which variants under selection contribute to lactase persistence (Bersaglieri et al. 2004).

The populations studied to date, however, represent a limited sample of human diversity. Most of these studies have relied on either the HapMap (Frazer et al. 2007) or Perlegen (Hinds et al. 2005) datasets, each of which includes samples from only a few populations: one European, one African (or African-American), and one or two East Asian populations. Since selective pressures such as diet, climate, and pathogen load vary greatly across the globe, even on relatively small scales, understanding the genetic response to this environmental variation requires higher geographic resolution in the sampling of human diversity (Prugnolle et al. 2005; Perry et al. 2007; Hancock et al. 2008). In this paper, we present results from a series of genome-wide scans for natural selection using single nuclotide polymorphism (SNP) genotype data from the Human Genome Diversity-CEPH Panel (HGPD), a data set containing 938 individuals from 53 populations typed on the Illumina 650Y platform (Li et al. 2008).

Our goals here were twofold. First, we sought to employ data from the 53 populations of the HGDP to better understand the geographic patterns of selected haplotypes. We find extensive sharing of putative selection signals between genetically similar populations, and limited sharing between genetically distant ones. In particular, Europe, the Middle East, and Central Asia show strikingly similar patterns of putative selection signals.

Second, we sought to identify novel candidate loci that have experienced recent positive selection and relate these signals to phenotypic variation. We identify several novel strong candidates for selection, including C21orf34, a gene of unknown function, and several genes in the NRG–ERBB4 developmental pathway. Interpretation of previous scans for selection has been limited by the relative paucity of information about the genetics of natural variation in humans. Recent genome-wide association studies, however, are beginning to fill this void, and many loci have been identified at which variation influences phenotypes (McCarthy et al. 2008). We have used this information as a guide in the interpretation of our scans for selection. In general, we find limited overlap between the results of genome-wide association studies and our scan for selection, with some notable exceptions, particularly in pigmentation and type II diabetes.

Results

After quality control and removal of related individuals, the HGDP data consist of 657,143 SNPs typed on 938 individuals in 53 populations. For some analyses, each population was treated individually, but for others we found it more powerful to group populations together and increase sample sizes. For these latter analyses, we divided the individuals into eight groups, most of which represent broad geographic regions: Bantu-speaking populations, Biaka Pygmies, Europeans, Middle Easterners, South Asians, East Asians, Oceanians, and Native Americans. These groups were chosen to provide reasonably homogenous sets of populations for analysis, as judged by clustering at randomly chosen loci (Rosenberg et al. 2002; Li et al. 2008). The Mbuti Pygmies and San were dropped from these groups because their large divergence from other African populations means that we might lose power by grouping them with the other Africans, and their small sample sizes indicate that we would have low power in treating them on their own.

Our analyses focus primarily on two haplotype-based tests: iHS (Voight et al. 2006) and XP-EHH (Sabeti et al. 2007). These tests were chosen because previous power analyses suggest they are largely complementary—iHS has good power to detect selective sweeps at moderate frequency (~50%–80%), but low power to detect sweeps that have reached high frequency (>80%) or fixation; in contrast, XP-EHH is most powerful for selective sweeps above 80% frequency (Voight et al. 2006; Sabeti et al. 2007). Some analyses presented here also use F ST , a measure of population differentiation which has power to detect selection on standing variation as well as on new selected sites (Innan and Kim 2008), or the CLR test of the allele frequency spectrum (Nielsen et al. 2005; Williamson et al. 2007), an alternative to XP-EHH for detecting high-frequency selective sweeps. Throughout this paper, the “P-values” presented will be empirical P-values; that is, a low P-value indicates that a locus is an outlier with respect to the rest of the genome (Teshima et al. 2006). We find this approach useful because P-values based on an explicit demographic model are unreliable when there is uncertainty in the demographic parameters (as is the case for humans). However, we note that loci detected as being under selection using this approach may be an unrepresentative sample of all truly selected loci; in particular, selection on standing variation and recessive loci are likely to be underrepresented (Teshima et al. 2006).

Assessment of power

The HGDP data present a number of challenges for the detection of selection. First, the data consist largely of tag SNPs selected to maximize coverage of the HapMap populations (Eberle et al. 2007). The allele frequencies and linkage disequilibrium patterns at these SNPs differ from the genome as a whole. Second, the populations of the HGDP have different demographic histories and sample sizes, which may affect power to detect selection.

The selection of tag SNPs from the HapMap is expected to reduce coverage in regions of the genome that show strong evidence of selective sweeps (and thus contain extensive LD) in the HapMap populations. We find this is indeed the case: of the genomic regions with the strongest iHS signals in the HapMap data, about 25% of 200-kb regions contain <20 SNPs on the Illumina chip. This is significantly less than the genome-wide average of about 40 SNPs per 200 kb overall on the Illumina chip (P = 8 × 10−4, P = 9 × 10−3, and P = 2 × 10−6 for regions identified as under selection in the HapMap European, Asian, and Bantu samples, respectively; one-sided t-test) and far fewer than the average 180 SNPs/200 kb in the HapMap. This indicates that power may be reduced in the HGDP for confirming selective sweeps already identified in the HapMap, although it should not affect power to detect novel selection signals in other populations.

To further explore the power to detect selection in this panel, we performed simulations under a simple, three-population model of human demography based on the HapMap (Schaffner et al. 2005) and we approximated the Illumina SNP ascertainment scheme (see Methods; Supplemental Fig. 1). These simulations were designed to guide intuition about the impact of a few chosen parameters on power, rather than to represent a formal null model. One important feature of the demographic model used here is the presence of two population bottlenecks in the non-African populations, with the second bottleneck being stronger in the East Asian population. This demographic model provides a good fit to several aspects of the data for the HapMap populations (data not shown). We use this model here because it is likely to be a good approximation to the demographies of many of the HGDP populations, and because fitting a demographic model to the 53 populations of the HGDP presents significant challenges and no such model is currently available.

As previously reported (Voight et al. 2006; Sabeti et al. 2007), we find that the fraction of extreme iHS scores in a genomic region is a more powerful statistic than the maximum score, while the reverse is true of XP-EHH (data not shown). As noted above, iHS has moderate power to detect a selective sweep that has reached intermediate frequency and little power to detect a sweep near fixation, while XP-EHH is more powerful to detect selective sweeps at or near fixation. Neither test has appreciable power to detect a selective sweep that has not yet reached a frequency >30%. We saw an important effect of demography in these simulations. The power to detect selection is highest in the “African” demography, intermediate in the “European” demography, and lowest in the “East Asian” demography (Supplemental Fig. 2). Although not explicitly included in the simulations, this suggests that power is low for both these tests in Oceania and America, which have experienced more recent and severe bottlenecks (Conrad et al. 2006). This is consistent with the observation that nonequilibrium demographies can inflate haplotype-based test statistics (Macpherson et al. 2008).

We also investigated the impact of sample size on power. For iHS, the loss of power incurred by decreasing sample size is modest until a threshold of ~40 chromosomes, while XP-EHH maintains power with as few as 20 chromosomes, as long as the reference population is of a fixed sample size (Supplemental Fig. 3). Since many HGDP populations contain around 10 individuals, power may be gained for iHS by grouping together genetically similar populations.

Overview of genomic regions with selection signals

To identify genomic regions that may have been targets of recent selection, we calculated XP-EHH and iHS on each broad population grouping mentioned above and on each individual population. To facilitate comparisons of genomic regions across populations, we then split the genome into nonoverlapping segments of 200 kb and computed, in each segment, the maximum XP-EHH score and the fraction of extreme (|iHS| > 2) iHS scores. The choice of 200 kb as a window size was motivated by the desire to have a sufficient number of SNPs in a window while maintaining a size on the scale of the signal generated by selective sweeps (~0.3–0.5 cM) (Voight et al. 2006). Other window sizes, and the use of a sliding window, gave qualitatively similar results (Supplemental Figs. 6, 7). In each window, we converted the test statistic to an empirical P-value, taking into account the number of SNPs in the window (see Methods). In Figure 1, we show the 10 most extreme windows of the genome from each geographic region for these statistics. The complete lists of regions with empirical P < 0.01 for each geographic region are in Supplemental Figures 13–28.

View larger version: Download as PowerPoint Slide Figure 1. Top 10 iHS (A) and XP-EHH (B) signals by population cluster. Each row is a 200-kb genomic window, each column is a geographic region, and each cell is colored according to the position of the window in the empirical distribution of scores for that region. Plotted are the most extreme 10 windows for each geographic region. Gray cells in A are windows that have fewer than 20 SNPs for which iHS was calculated (see Methods). To the right of each row is a list of genes that fall in the window. Windows where the genes are in red are discussed in the text. Note that interpretation of the overlap in XP-EHH signals is complicated by the need for a reference population; see the main text.

A number of interesting patterns emerge from Figure 1. First, there is extensive sharing of extreme iHS and XP-EHH signals between Europe, the Middle East, and Central Asia, while overlap between other regions is much more limited. In fact, 44% of the genomic segments in the 1% tail of iHS in Europe fall in the 5% tail for both the Middle East and Central Asia (89% are shared between Europe and at least one of these two), while only 12% of European signals are present in East Asia by the same criterion. Second, XP-EHH signals seem to be shared on a larger geographic scale than iHS signals. However, the fact that XP-EHH needs a reference population makes overlap hard to interpret; in particular, the lack of overlap between African and non-African groups could be a consequence of use of a reference. To address this, we compared the overlaps in selection signals as judged by the CLR test, which, like XP-EHH, has power to detect high frequency and fixed selective sweeps, but does not rely on a reference population (Williamson et al. 2007). XP-EHH and the CLR test tend to identify the same regions as putative targets of selection and, as with iHS and XP-EHH, signals from the CLR test tend to be shared between Europe, the Middle East, and Central Asia, while sharing between African and non-African populations is very limited (Supplemental Fig. 8). Another concern is that these patterns of overlap may be influenced by the way populations were grouped together for analysis. However, this is not the case; the patterns of overlap hold as well when analysis is performed in each population individually, and both iHS and XP-EHH signals are generally shared between populations in a geographic region (Supplemental Figs. 4, 5).

We also asked how our scans in these pooled populations related to those from the HapMap. We calculated both iHS and XP-EHH on the Phase II HapMap samples and performed the same procedure as above. We found considerable overlap: For iHS, 51% of the windows in the 1% tail in Europe fall in the 5% tail in HapMap Europeans, 63% of signals in East Asia overlap those from the HapMap Asian population by the same criteria, and 38% of signals in the Bantu overlap those identified in the HapMap Yoruba. For XP-EHH, the corresponding figures are 69%, 89%, and 41%. While this is extensive overlap, it is far from complete. One important reason for the incomplete overlap is the vastly different SNP density between the HapMap (which contains 3.1 million SNPs) and the HGDP, especially in regions with strong selection signals in the HapMap, as noted above. Other reasons for incomplete overlap include the presence of some population-specific partial sweeps and incomplete power to detect selection.

Genes and phenotypes under selection

The data in the HGDP permit the use of multiple types of population genetic evidence in evaluating the case for selection on a particular locus. These types of evidence include haplotype structure (Hudson et al. 1994; Sabeti et al. 2002), linkage disequilibrium (Kim and Nielsen 2004; Jensen et al. 2007), the allele frequency spectrum (Smith and Haigh 1974; Tajima 1989), and population differentiation (Lewontin and Krakauer 1973). As an example of how these data allow for the identification of novel selection candidates, we present in Figure 2 the case for selection at C21orf34, a locus of unknown function on chromosome 21.

View larger version: Download as PowerPoint Slide Figure 2. Evidence for selection in a region containing part of the gene C21orf34. (A) Haplotype plots in a 500-kb region on chromosome 21 surrounding the locus. Each row represents a haplotype, and each column a SNP. Rows are colored the same if and only if the underlying sequence is identical (some low-frequency SNPs are excluded). For full details on the generation of these plots, see Conrad et al. (2006). (B) Heterozygosity in the same region. Lines show heterozygosity calculated in a sliding window of three SNPs across the region in different populations. Black arrows at the top of the plot represent the positions of SNPs with F ST > 0.6 (i.e., in the 0.01% tail of worldwide F ST ). (C). A pie chart of the worldwide distribution of a SNP that tags the red haplotype in A (rs2823850). (Red) The derived allele frequency; (blue) the ancestral allele frequency.

This locus contains some of the most extreme XP-EHH scores in the genome in non-African populations, including the most extreme score in Europe (Fig. 1B). Visualization of the haplotypes in the region (Fig. 2A) revealed a striking lack of diversity in an ~200-kb region in non-African populations; this was confirmed by using a sliding window of heterozygosity (Fig. 2B). If this reduction in diversity and strong haplotype structure had been driven by positive selection in non-African populations, the region should contain SNPs with large population differentiation between African and non-African populations (Slatkin and Wiehe 1998). This is indeed the case: There are a number of SNPs in the region with extreme F ST , including the SNP with the highest F ST in the HGDP data set, which almost perfectly differentiates African from non-African individuals (Fig. 2C). In sum, these observations suggest that a haplotype in this region swept to near fixation at some point since the out-of-Africa migration. This region of reduced diversity in non-Africans includes the terminal three exons of C21orf34, a gene that is expressed in many tissues (Gardiner et al. 2002), as well as three microRNA genes (mir-99a, let-7c, and mir-125-b2). Currently, there are no known SNPs in any of these potential functional targets.

We next turned our attention from the top signals to specific candidate genes. We took two approaches to identifying selection signals of interest. First, we assembled lists of polymorphisms associated with potentially evolutionarily-relevant traits, and tested whether regions surrounding these polymorphisms show more population differentiation than random regions of the genome (see below, Methods). Second, we considered all genomic windows in the 1% tail of iHS or XP-EHH to be candidates for containing a selective sweep, and all genes within 50 kb of a window as candidates for the target of selection. We then manually examined these regions for genes of interest. In the following sections, we present the results from this analysis.

Pigmentation

Several genes involved in pigmentation have been targets of recent positive selection in non-African populations, including SLC24A5 (Lamason et al. 2005), KITLG (Miller et al. 2007), and SLC45A2 (Norton et al. 2007). Indeed, two of these (KITLG and SLC24A5) appear in the list of the most extreme haplotype patterns in the genome in Figure 1 and Supplemental Figure 8: SLC24A5 has one of the most extreme iHS and XP-EHH signals in Europe, the Middle East and Central Asia (referred to hereafter as West Eurasia) and KITLG has one of the most extreme XP-EHH scores in all non-African populations.

To assess more comprehensively the extent of natural selection in pigmentation genes, we compiled a list of genes currently known to contribute to natural variation in humans from recent GWA studies (Stokowski et al. 2007; Han et al. 2008; Sulem et al. 2007, 2008). Around each pigmentation-associated SNP, we defined a window of 100 kb, and took the maximum F ST across SNPs for pairwise comparisons of all continental regions. We used this approach rather than taking F ST directly at the association signals because the Illumina chip does not contain most of the SNPs with association signals and, even where it does, the associated SNP may simply be tagging the true causal variant. For comparison, we randomly sampled 10,000 SNPs from the Illumina chip and performed the same procedure. The results for six pairwise comparisons are presented in Figure 3.

View larger version: Download as PowerPoint Slide Figure 3. F ST around loci involved in natural variation in pigmentation. For each SNP found to be associated with pigmentation in a genome-wide scan, we plot the maximum pairwise F ST between geographic regions in a 100-kb window surrounding the SNP in the HGDP data, as well as a histogram of the null distribution calculated by finding the maximum F ST in 100-kb windows surrounding each of 10,000 random SNPs. The dotted lines shows the position beyond which 5% of the random SNPs fall, and the solid lines the position beyond which 1% of the random SNPs fall. Gene names that are starred fall in the 5% tail of at least one comparison, and those with two stars fall in the 1% tail of at least one comparison. Letters are positioned along the y-axis to improve readability. The key in the bottom right panel applies to all panels.

Overall, regions of the genome associated with pigmentation tend to have higher F ST between Africa and Europe, and between Europe and East Asia, than random regions of the genome (Africa-Europe P = 3 × 10−4, Europe-East Asia P = 1 × 10−4; one-sided Mann-Whitney test). These regions do not show unusually high differentiation between East Asia and Africa (P = 0.51) despite the fact that East Asians have evolved lighter skin since the out-of-Africa expansion. This reflects the fact that most pigmentation genes have been discovered in European or African-American samples and that the evolution of light pigmentation in East Asia seems to have occurred largely via separate genes (Norton et al. 2007). Turning to individual genes, all three of the previously well-supported targets of selection fall well into the 1% tail of at least one comparison. With equally extreme differentiation are OCA2 and TYRP1, previously reported to be candidates for selection based on haplotype structure alone (Voight et al. 2006). These two loci strongly differentiate Europe from Central Asia, in contrast to the overall trend of West Eurasia showing similar patterns of selection (Fig. 1).

Although not identified to date by genome-wide scans for pigmentation variation, we also noticed that other candidate pigmentation loci fall among the top haplotype-based signals of selection. In particular, MLPH shows a strong XP-EHH signal in non-African populations, and RGS19 shows a strong iHS and XP-EHH signal in Bantu populations. MLPH is known to influence pigmentation in mouse (Matesic et al. 2001), dog (Drogemuller et al. 2007), cat (Ishida et al. 2006), and chicken (Vaez et al. 2008), and RGS19 was recently shown to influence pigmentation in mouse (McGowan et al. 2008). These loci have not been identified in genome-wide association studies to date, but we note that there have been no genome-wide admixture mapping studies of pigmentation; this type of study will be necessary to confirm the role of these genes, if any, in between-population variation in pigmentation.

The above results confirm that using the maximum F ST in a window around an associated SNP is a relatively sensitive measure for detecting selection, even when the causal SNP may not be present in the data. With this tool in hand, we turn to other phenotypes where the role of selection is unknown.

Disease susceptibility and other quantitative phenotypes

It has been hypothesized that alleles involved in common disease could often be targets of selection (Neel 1962; Di Rienzo and Hudson 2005; Nielsen et al. 2007; Hancock et al. 2008). However, studies of SNPs associated with complex disease have found no evidence that they are significantly more differentiated among populations than random SNPs in the genome (Lohmueller et al. 2006; Myles et al. 2008). Recent genome-wide association studies, along with the genome-wide SNP data from the HGDP, permit a more comprehensive test of this hypothesis. We compiled lists of SNPs associated with several common diseases and quantitative traits for which many associated loci are known (Crohn's disease, type I and II diabetes, height, and lipid levels) from published genome-wide association studies (Scott et al. 2007; Todd et al. 2007; Gudbjartsson et al. 2008; Lettre et al. 2008; Unoki et al. 2008; Weedon et al. 2008; Willer et al. 2008; Yasuda et al. 2008; Zeggini et al. 2008). We applied the above method used for pigmentation to test whether loci associated with any of these other phenotypes are enriched for SNPs with high F ST .

Loci involved in lipid levels (Supplemental Fig. 9), susceptibility to Crohn's disease (Supplemental Fig. 10), height (Supplemental Fig. 11), and susceptibility to type I diabetes (Fig. 4) show little evidence of being subject to selection. However, there are a few notable exceptions to this. One such exception is a nonsynonymous SNP (rs3184504) in SH2B3, identified as a risk factor for both type I diabetes and celiac disease (Todd et al. 2007; Hunt et al. 2008). The region appears in the 1% tail of iHS in Europe, and the iHS score on the individual SNP is −2.02 (empirical P = 0.02). This region is also an outlier in the F ST comparison between Europe and East Asia (Fig. 4). Interestingly, the risk allele appears on the sweeping haplotype, suggesting that risk for autoimmune disease may have increased as a byproduct of natural selection in some populations.

View larger version: Download as PowerPoint Slide Figure 4. F ST around loci involved in natural variation in diabetes susceptibility. For each SNP associated with either type I or type II diabetes we plot the maximum pairwise F ST between geographic regions in a 100-kb window surrounding the SNP in the HGDP data, as well as a histogram of the null distribution calculated by finding the maximum F ST in 100-kb windows surrounding each of 10,000 random SNPs. The dotted lines shows the position beyond which 5% of the random SNPs fall, and the solid lines the position beyond which 1% of the random SNPs fall. Gene names that are starred fall in the 5% tail of at least one comparison, and those with two stars fall in the 1% tail of at least one comparison. Letters are positioned along the y-axis to improve readability. The key in the bottom panel of each column applies to the entire column.

Risk of type II diabetes has been hypothesized to be a target of natural selection in humans due to the effect of the disease on metabolism and energy production (Neel 1962). Indeed, the locus with the strongest impact on disease susceptibility in Europeans, TCF7L2, shows impressive differences in allele frequencies between Africa and East Asia (Fig. 4; Helgason et al. 2007). Overall, we show in Figure 4 that regions of the genome harboring SNPs associated with type II diabetes significantly differentiate Europeans and East Asians from Africans (Europe-Africa P = 0.006, East Asia-Africa P = 0.02, one-sided Mann-Whitney test). There are a number of regions that contain strong outliers in at least one comparison, including TCF7L2, TSPAN8, JAZF1, and ADAMTS9. Other associated regions also show XP-EHH signals well into the 1% tail: these are THADA in East Asia (maximum XP-EHH at rs12474030 of 3.7, empirical P = 1 × 10−4) and an intergenic region on chromosome 11 in Europeans (maximum XP-EHH at rs16936071 of 3.6, empirical P = 2 × 10−4). We note, however, that though these type II diabetes-associated regions are more differentiated than random regions of the genome, the associated SNPs themselves often are not (as in Myles et al. [2008], at a subset of the SNPs considered here). We return briefly to this point in the Discussion.

NRG–ERBB4 pathway

Among the top selection candidates shown in Figure 1, we noticed that two—ERBB4 and NRG3—are, in fact, binding partners (Zhang et al. 1997). Although these two genes are large, and thus contain a number of tested windows, they both are outliers with respect to the rest of the genome even after a conservative Bonferroni correction for the number of windows (empirical P = 0.001 and P = 0.006 in the Middle East for ERBB4 and NRG3, respectively). Further inspection of genes in the NRG–ERBB4 pathway (Kanehisa et al. 2008) revealed a striking alignment of selection signals (Fig. 5A). ERBB4 shows extreme iHS signals in all non-African populations (Fig. 5B,C), NRG3 shows extreme iHS signals in West Eurasian populations, and two other binding partners of ERBB4—NRG1 and NRG2—fall well into the 1% tail of iHS scores in East Asia (Fig. 5A). Further, ADAM17, the gene encoding the enzyme that converts NRG1 to its active form (Mei and Xiong 2008), falls in a region that contains some of the most extreme XP-EHH scores in East Asia (maximum value of XP-EHH in the region of 4.2 at rs2709591, empirical P = 2 × 10−5).

View larger version: Download as PowerPoint Slide Figure 5. Selection signals in the NRG–ERBB4 pathway. (A) A schematic of the NRG–ERBB4 pathway, drawn from interactions reported in KEGG (Kanehisa et al. 2008) and Mei and Xiong (2008). Each oval represents a gene, and the colored circles denote the geographic regions that have significant selection signals (empirical scores in the top 5% of the distribution). We excluded Oceania and the Americas from this plot since selection scans are expected to have low power in these regions. For ADAM17, the selection statistic is XP-EHH; for the others it is iHS. (B) Haplotype plots at the putative selected region in ERBB4. (C) Worldwide allele frequencies of a SNP that tags the red haplotype in B (rs1505353). (Red) The derived allele; (blue) the ancestral allele.

The NRG–ERBB4 signaling pathway is well-studied and known to be involved in the development of a number of tissues, including heart, neural, and mammary tissue (Gassmann et al. 1995; Tidcombe et al. 2003). Variants in genes in this pathway have been associated with risk of schizophrenia and various psychiatric phenotypes (Stefansson et al. 2002; Hall et al. 2006; Mei and Xiong 2008). We suggest that an unidentified phenotype affected by this pathway has experienced strong recent selection in non-African populations.

Local adaptation

A significant advantage of these data over previous scans for selection is that they allow for the detection of selection on much smaller geographic scales. To identify differential selection between closely related populations, we chose to use F ST rather than haplotype-based methods, as haplotype-based signals are largely shared between geographically close populations (Supplemental Figs. 4, 5) and we speculated that selection on a very local scale may lead to only modest allele frequency changes. We manually examined the 100 SNPs that most extremely differentiate in select pairs of populations for evidence of local adaptation (such that all the SNPs mentioned in the following paragraphs fall in the 0.05% tail of the comparison being discussed). We note that though alleles underlying local adaptation may often be population-specific and thus not included on the Illumina chip, a selective sweep should often affect differentiation at nearby tag SNPs (though the magnitude of this effect depends on the levels of migration and the selection coefficient, among other factors; Santiago and Caballero 2005).

Within Africa, we compared the Yoruba to each of the Pygmy populations, and the two Pygmy populations (Mbuti and Biaka) to each other, hypothesizing that the loci involved in reduced stature in Pygmies should be specific to that group (and be detectable by differentiation at nearby tag SNPs). Notably, genes involved in variation in height in Europeans are not enriched for SNPs that strongly differentiate Pygmy from Bantu populations, suggesting that variation in these genes is not responsible for the divergence in phenotype between these two groups (Supplemental Fig. 12). However, among the 100 most differentiated SNPs between the Yoruba and Biaka are two SNPs in genes in the insulin growth factor signaling system: one is a SNP (rs6917747) in an intron of IGF2R, the receptor for the well-studied growth factor IGF2 (F ST = 0.54, empirical P = 1 × 10−4). Knockouts of this gene in the mouse lead to fetal overgrowth (Lau et al. 1994). A second SNP (rs9429187) in the gene PIK3R3, which acts downstream of IGF1R (Dey et al. 1998) also differentiates the Biaka and Yoruba (F ST = 0.6, empirical P = 1 × 10−4). Considering the defects in the responsiveness of Pygmy cells to IGF1 (Hattori et al. 1996; Jain et al. 1998) without any known causal polymorphism (Bowcock and Sartorelli,1990), we consider both of these genes to be strong candidates for harboring a polymorphism that leads to decreased body size in some Pygmy populations (although we note that the IGF2R polymorphism appears to be specific to Biaka and absent from Mbuti).

Within Western Eurasia, we compared the French, Palestinian, and Balochi populations, as they have the largest sample sizes in their regions. The most extreme SNP (rs4833103) identified in the French–Balochi and French–Palestinian comparisons (French–Balochi F ST = 0.69, French–Palestinian F ST = 0.6) falls in a cluster of Toll-like receptor genes. A nonsynonymous SNP in TLR6 (rs5743810), a gene involved in the recognition of bacterial pathogens (Ozinsky et al. 2000), is among the highly differentiated SNPs in this cluster (Fig. 6A). This region was previously identified by Todd et al. (2007) as containing SNPs that strongly differentiate populations within Europe; those investigators also noted that the region shows no haplotype-based signature of selection. Also among the most differentiated loci in these comparisons are SNPs in SLC45A2, mentioned above as a locus involved in pigmentation, and a cluster of SNPs in SLC25A13, the gene responsible for type II citrullinemia, a Mendelian disorder of the urea cycle (Kobayashi et al. 1999).

View larger version: Download as PowerPoint Slide Figure 6. Worldwide allele frequencies of two nonsynonymous SNPs showing evidence of local adaptation. (A) Frequencies of rs5743810 in TLR6; (B) frequencies of rs12421620 in DPP3. (Red) The frequency of the derived allele; (blue) the frequency of the ancestral allele.

Within East Asia, we chose populations in an attempt to identify SNPs that, like TLR6 and LCT in Europe, show clinal variation in allele frequencies. We examined the tails of the F ST distribution between the Dai (a southern Chinese population), Oroqen (a northern Chinese population), and the Han. No SNP showed as striking a pattern as TLR6 in Europe. However, these comparisons identified a number of SNPs in genes related to immunity—a cluster of SNPs in the HLA region (maximum F ST of 0.69 at rs1737078) differentiate the Oroqen from the Dai, and SNPs in a cluster of interleukin receptors (maximum F ST of 0.68 at rs279545) differentiate the Oroqen from the Han.