Abstract High-altitude hypoxia (reduced inspired oxygen tension due to decreased barometric pressure) exerts severe physiological stress on the human body. Two high-altitude regions where humans have lived for millennia are the Andean Altiplano and the Tibetan Plateau. Populations living in these regions exhibit unique circulatory, respiratory, and hematological adaptations to life at high altitude. Although these responses have been well characterized physiologically, their underlying genetic basis remains unknown. We performed a genome scan to identify genes showing evidence of adaptation to hypoxia. We looked across each chromosome to identify genomic regions with previously unknown function with respect to altitude phenotypes. In addition, groups of genes functioning in oxygen metabolism and sensing were examined to test the hypothesis that particular pathways have been involved in genetic adaptation to altitude. Applying four population genetic statistics commonly used for detecting signatures of natural selection, we identified selection-nominated candidate genes and gene regions in these two populations (Andeans and Tibetans) separately. The Tibetan and Andean patterns of genetic adaptation are largely distinct from one another, with both populations showing evidence of positive natural selection in different genes or gene regions. Interestingly, one gene previously known to be important in cellular oxygen sensing, EGLN1 (also known as PHD2), shows evidence of positive selection in both Tibetans and Andeans. However, the pattern of variation for this gene differs between the two populations. Our results indicate that several key HIF-regulatory and targeted genes are responsible for adaptation to high altitude in Andeans and Tibetans, and several different chromosomal regions are implicated in the putative response to selection. These data suggest a genetic role in high-altitude adaption and provide a basis for future genotype/phenotype association studies necessary to confirm the role of selection-nominated candidate genes and gene regions in adaptation to altitude.

Author Summary High-altitude hypoxia is caused by decreased barometric pressure at high altitude, and results in severe physiological stress to the human body. Three human populations have resided at high altitude for millennia including Andeans on the Andean Altiplano, Tibetans on the Himalayan plateau, and Ethiopian highlanders on the Semian Plateau. Each of these populations exhibits a unique suite of physiological changes to the decreased oxygen available at altitude. However, we are just beginning to understand the genetic changes responsible for the observed physiology. The aim of the current study was to identify gene regions that may be involved in adaptation to high altitude in both Andeans and Tibetans. Genomic regions showing evidence of recent positive selection were identified in these two high-altitude human groups separately. We found compelling evidence of positive selection in HIF pathway genes, in the globin cluster located on chromosome 11, and in several chromosomal regions for Andeans and Tibetans. Our results suggest that key HIF regulatory and targeted genes are responsible for adaptation to altitude and implicate several distinct chromosomal regions. The candidate genes and gene regions identified in Andeans and Tibetans are largely distinct from one another. However, one HIF pathway gene, EGLN1, shows evidence of directional selection in both high-altitude populations.

Citation: Bigham A, Bauchet M, Pinto D, Mao X, Akey JM, Mei R, et al. (2010) Identifying Signatures of Natural Selection in Tibetan and Andean Populations Using Dense Genome Scan Data. PLoS Genet 6(9): e1001116. https://doi.org/10.1371/journal.pgen.1001116 Editor: David J. Begun, University of California Davis, United States of America Received: January 4, 2010; Accepted: August 9, 2010; Published: September 9, 2010 Copyright: © 2010 Bigham et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was supported by the National Science Foundation (http://nsf.gov/) [grant number 0622337 to AWB, Graduate Research Fellowship to AWB and MJW and BNS-8919645 to LGM]; the Wenner-Gren Foundation (http://www.wennergren.org/) [grant number 7538 to AWB]; the National Institutes of Health (http://www.nih.gov/) [grant numbers HLBI 079647, TW001188 to LGM and HG-002154 to MDS]; American Heart Association (http://www.americanheart.org/) [Predoctoral Fellowship to CGJ]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction As human populations migrated across the globe, they encountered numerous environments each with unique ecological conditions. These colonizers responded to the niche-specific environmental pressures both culturally and biologically. One such newly encountered environment was high altitude. High-altitude regions of the earth lie above 2,500 meters (m) sea level. The extreme environmental conditions experienced at high altitude challenge the ability of humans to live and reproduce, i.e., adapt and/or acclimatize. Some of the environmental hardships at high altitude include but are not limited to decreased ambient oxygen tension, increased solar radiation, extreme diurnal ranges in temperature, arid climate, and poor soil quality. Behavioral or cultural modifications buffer many of these factors. However, low ambient oxygen tension, caused by decreased barometric pressure and commonly referred to as high-altitude hypoxia, cannot readily be overcome by cultural buffers. Rather, physiological acclimatization and/or genetic adaptation is required for populations residing at altitude to overcome this environmental stress. The Tibetan Plateau and the Andean Altiplano are two high-altitude regions where human populations have resided for millennia (Figure 1). According to archaeological data, they were first populated approximately 25,000 and 11,000 years ago, respectively [1], [2]. Today, the populations indigenous to these high-altitude zones possess unique suites of physiological characteristics with respect to one another and with respect to low-altitude populations (for review see [3]). Researchers have sought to understand the physiological differences between high- and low-altitude populations and whether such differences are the result of acclimatization or adaptation [4]. Several studies have shown that Tibetan populations exhibit lower than expected hemoglobin concentrations [5]–[7]. This is in contrast to Andean populations and to high-altitude sojourners who show elevated hemoglobin concentrations [5]. Other important differences concern the extent to which these lifelong high-altitude residents exhibit a blunted ventilatory response to acute hypoxia, their degree of protection from altitude-associated fetal growth restriction as well as their susceptibility to chronic mountain sickness and hypoxic pulmonary hypertension [4], [8], [9]. Related research has explored the heritability of specific altitude phenotypes such as arterial oxygen saturation and hemoglobin concentration [5], [10], [11]. One heritability study concluded that a major autosomal dominant locus exists for high oxygen saturation, where Tibetan women carrying this high oxygen saturation allele had a higher offspring survival rate than women possessing the low oxygen saturation allele [10]. Research of this nature documents the role of local adaptation, not simply acclimatization, to the high-altitude environment. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. The geography of human adaptation to high altitude. Geographic locations where humans have adapted to life at high-altitude are indicated in grey and include the Andean Altiplano of South America, the Tibetan Plateau of Central Asia, and the Semien Plateau of Ethiopia. Only populations from the Andean Altiplano and the Tibetan Plateau were considered here. Inset: Map locations of the four Native American population samples including Peruvian Quechua, Bolivian Aymara, Nahua, Mixtec, and Tlapanec speakers from Guerrero, Mexico, and Maya from the Yucatan Peninsula, Mexico. https://doi.org/10.1371/journal.pgen.1001116.g001 Even though the physiology of these populations has been well studied [3], [4], [12], [13], very little research has been devoted to the identification of the genes responsible for the observable physiological differences [14]. It is challenging to speculate from existing data what patterns of genetic variation underlie adaptation to altitude in Andeans and Tibetans. Studies identifying genetic variants associated with particular physiological phenotypes exhibit varying results. The lactase persistence phenotype is one example. Here functionally similar changes in the same gene, Lactase (LCT), have evolved independently in African and European populations to produce the same phenotypic outcome [15]. Genes affecting skin pigmentation show a different pattern from that observed for lactase persistence: separate genes are responsible for light-skinned phenotypes in East Asians and Europeans [16], [17]. Another well-studied human adaptive trait is malarial resistance. Like the pattern observed for skin pigmentation, multiple genes confer adaptive resistance to malaria in different populations [18]–[20] and like lactase persistence, particular mutations, namely the sickle cell S allele have recurred. Therefore, it is unclear whether Tibetan and Andean populations should be expected to show similarities or differences in genes or functionally different changes in the same genes that are responsible for their distinct high-altitude phenotypes. Moreover, it is possible that changes in different genes, but genes that are part of the same biochemical pathway, are responsible for the observable phenotypic differences between these two groups. For example, genes belonging to the hypoxia inducible transcription factor (HIF) pathway - important in embryogenesis, development, and homeostasis - and the renin-angiotensin system (RAS) are involved in oxygen sensing and metabolism. Variants identified in genes from these two pathways are known to affect particular altitude phenotypes [21]–[24]. For instance, the angiotensin converting enzyme (ACE) insertion-deletion (I/D) polymorphism has been significantly associated with the resting and exercise SaO 2 among Quechua [23]. Additionally, genes in the alpha and beta globin gene family are involved in hemoglobin production. Accordingly, since such pathways or systems has been hypothesized to help regulate physiological responses to hypobaric hypoxia, they may be enriched for genes showing evidence of recent positive selection in high-altitude populations. The goal of this study was to identify candidate genes for high-altitude adaptation based on signatures of positive selection in Andeans and Tibetans. Previously, we analyzed data from ∼500,000 SNPs to search for signatures of positive directional selection in Andeans [25]. This was the first such study of its kind. Here we increase the number of assayed SNPs to 906,600, and expand the populations to include a second high-altitude human group, Tibetans. We identified selection-nominated candidate genes and gene regions by two methods. First, we looked across each chromosome for extended regions of statistical significance for a given test statistic to identify candidate gene regions with previously unknown physiological functions with respect to altitude phenotypes. Second, we targeted genes that are members of biochemical pathways with known physiological responses to hypoxia including the HIF pathway, the RAS, and the globin family of genes [26]. In these two high-altitude groups we employed four test statistics commonly used to detect positive directional selection. Each test statistic possesses varying degrees of efficacy depending on the allelic background of the populations considered, the strength of selection, the type of variation natural selection acted upon (e.g., new mutations or standing variation), and the length of time elapsed since the start of the selective event. By comparing and contrasting two human populations who have adapted to life at altitude, we hope to better characterize the genetic mechanisms responsible for adaptation to high-altitude hypoxia, contribute to the understanding of the genetic and evolutionary architecture of adaptation to altitude, and provide deeper insight into the genes responsible for human phenotypic diversity.

Discussion In this study, we identified genomic regions showing evidence of recent positive selection in two high-altitude human populations, Andeans and Tibetans, using dense multilocus SNP genotype data. Putative natural selection candidate loci were detected in particular pathways with hypothesized roles in high-altitude adaptation as well as chromosomal regions with previously unknown involvement in altitude phenotypes. Four tests based on different characteristics of the data were used in our analysis: LSBL, lnRH, the standardized difference of D, and the WGLRH test. It is worthwhile to review some important issues related to the characteristics of each of these statistics and their application to these data. LSBL is based on Wright's F ST (see Materials and Methods) and summarizes the change in SNP allele frequency across three populations to identify loci displaying large frequency differences between populations. The value of LSBL will be contingent upon the initial frequencies of the alleles in the populations, the time since the populations diverged, and the strength of the selective event. Fixation of alternative alleles will result in the maximum LSBL. Such a scenario can occur regardless of whether selection operated on standing variation or on a new mutation. However, a maximum LSBL need not be observed to infer evidence of selection. In fact, intermediate LSBL values can be observed for loci under selection. LnRH summarizes the reduction in heterozygosity in one population relative to another population. Thus, it identifies directional selection in the population with a loss of heterozygosity. Interestingly, lnRH can miss a signal of directional selection if the change in allele frequency is symmetrical about the midpoint in heterozygosity (changing from 40% to 60% allele frequency for example). Tajima's D uses the site frequency spectrum to detect departures from neutrality. In this instance, it was used to identify regions of the genome that have an excess of rare variants to detect positive selection. Lastly, the WGLRH test looks at patterns of LD to identify genomic regions that exhibit longer than expected LD given their frequency in the population. However, the WGLRH test only considers derived alleles with respect to an outgroup (e.g., chimpanzee) whose frequencies have risen to >0.85 in the populations under consideration. The problem with only considering haplotypes with high derived allele frequencies is that natural selection could also act to select the ancestral allele and these signatures would not be detected using the WGLRH test. The strength of selection, time since selection began, as well as the recombination background of the selected region will all affect the signal obtained when applying this statistic to a dataset. Given the aspects of genetic variation summarized by these statistics, it is not expected that the results of each of these tests will be significant. Rather, these statistical tests should be considered as complementary tests that can be useful for the identification of regions under positive selection. Commonly, a traditional parametric-model approach is taken in screens for natural selection where the level of variability at candidate loci is compared to either simulated distributions or theoretical expectations [40], [41], [42]. However, recent research has shown that the empirical distribution may be better than a simulated or theoretical distribution because the latter two approaches may be confounded by underlying demographic assumptions [43]. By comparing individual SNPs to the genome-wide empirical distribution for each test statistic, the results of this study were not confounded by demography. Yet, the empirical distribution approach is not without problems. For example, consider a population where much of the allele frequency change across the genome was the result of genetic drift. In this scenario, the entire LSBL distribution would be shifted to the right. Higher levels of variance in LSBL and higher genome average LSBL levels could result in overlooking outliers that resulted from positive selection. The chromosomal regions showing either extended significant regions for the standardized difference of D and LSBL or lnRH are excellent candidates for further study. Of the 52 regions identified, the four consecutive regions on chromosome 12 in Andeans and the single region on chromosome 2 containing the HIF pathway candidate gene EPAS1 in Tibetans are of keen interest. When considering the genes encoded by the chromosome 12 region in Andeans, none stand out as obvious candidates for high-altitude adaptation. Further characterization of this chromosomal region will be necessary to elucidate the genetic variants responsible for the observed pattern. In Tibetans, the hypergeometric region containing EPAS1 is a strong candidate for adaptation to altitude given the known biological function of EPAS1 in oxygen sensing. Future genotype-phenotype correlation studies should focus their attention on this gene in Tibetans. In addition to the 52 chromosomal regions, the candidate regions identified by the WGLRH test are also excellent candidates for further study. Of these candidate regions, one Andean region contained a HIF pathway candidate gene, TH. Given the divergence time of Andeans and Tibetans, their independent adaptation to high altitude, and their unique physiological adaptations to altitude, we find the overlap in the EGLN1 signal to be of keen interest. EGLN1 is part of the HIF pathway wherein cellular oxygen homeostasis is regulated by HIF-1. This heterodimer is composed of an α subunit and a β subunit. The β subunit is constitutively expressed whereas the α subunit is transcriptionally controlled by cellular O 2 concentration [44], [45]. HIF-1α is a basic-helix-loop-helix protein encoded by the gene HIF1A [44], [46]. This protein mediates transcriptional responses to hypoxia in nearly 100 genes to control cellular oxygen supply and maintain cell viability during periods of low oxygen concentration. EGLN1, along with EGLN2 (PHD1) and EGLN3 (PHD3), is a molecular oxygen sensor that regulates the HIF transcriptional pathway [47]. In normoxia, EGLN1 hydroxylates HIF-1α's oxygen dependent degradation domain which targets this protein for breakdown by the E3 ubiquitin ligase complex [48]–[50]. Decreases in oxygen tension lead to a reduction in prolyl hydroxylation of HIF-1α by EGLN1, thus increasing HIF levels and permitting HIF1 to continually target downstream genes to maintain cellular oxygen homeostasis [45]. Thus, EGLN1 plays a critical role in cellular oxygen sensing. Our results suggest that adaptation has occurred independently at this gene in these two highland groups, although it is difficult to discern if selection operated on shared standing variation or new mutations. In addition to EGLN1, the HIF pathway genes exhibiting the most compelling evidence of positive directional selection in Andeans are PRKAA1 and NOS2A. PRKAA is a heterotrimeric enzyme belonging to the ancient 5′-AMP-activated protein kinase gene family involved in regulation of cellular ATP [reviewed in (49)]. PRKAA1 functions as a cellular energy sensor under ATP-deprived conditions such as those experienced in hypoxia, thus suggesting a biologically-plausible role for the PRKAA1 (AMPKa1)-mTOR pathway in metabolic responses to hypoxic environments. Also, it has been demonstrated that the PRKAA1 gene product is essential for hypoxia-inducible factor-1 (HIF-1) transcriptional activity. HIF-1 trans-activates multiple genes in the HIF pathway that are important for oxygen delivery [50]. In addition, HIF-1 is critical for both embryonic vascularization and development. Therefore, if a genetic variant in PRKAA1 contributes to the differential survival of babies at altitude, one would expect that the selection coefficient for this allele to be strong. NOS2A, in combination with additional nitric oxide synthase isoforms, synthesizes nitric oxide (NO) from arginine and oxygen. NO is a signaling molecule with myriad physiological functions throughout the body. Important with regard to high-altitude adaptation, this gene is responsible for the production of NO, formerly known as endothelium-derived relaxing factor (EDRF), in endothelial and other cell types [51]. NO, in combination with a cascade of additional circulating substances prompts arterial smooth muscle relaxation, vasodilation and increased blood flow [52]. Erzurum et al. [51] have shown that NO production is increased in Tibetans resident at 4,200 m compared to sea-level controls. Moore and co-workers have shown that increased blood flow to the uteroplacental circulation is an especially important factor in protecting Tibetan as well as Andean high-altitude residents from altitude-associated reductions in fetal growth [52]–[54]. On balance, these studies suggest that vascular factors, not simply hematological or ventilatory systems, are critical for altitude adaptation in Tibetan and Andean populations. Here, we show preliminary evidence of positive selection in NOS2A in the Andean population, but do not show compelling evidence of positive selection in Tibetans. Both PRKAA1 and NOS2A were identified in a previous study as selection-nominated candidate genes in Andeans using a subset of the data [25]. A single gene, EPAS1, exhibited a strong signature of recent positive selection in the Tibetan population. This was evident from the HIF pathway candidate gene analysis as well as the chromosomal scan. EPAS1 is a HIF regulatory gene encoding a transcription factor that induces downstream genes when cellular oxygen levels decrease. Recently, this gene has been implicated in high-altitude pulmonary edema (HAPE) in Tibetan populations, although the results are not conclusive [55]. Further research elucidating the genotype-phenotype relationship between this gene and corresponding high-altitude phenotypes will be an important step in understanding the functional significance of EPAS1 variation. In summary, we performed a genome scan on high- and low-altitude human populations to identify selection-nominated candidate genes and gene regions in two long-resident high-altitude populations, Andeans and Tibetans. Several chromosomal regions show evidence of positive directional selection. These regions are unique to either Andeans or Tibetans, suggesting a lack of evolutionary convergence between these two highland populations. However, evidence of convergent evolution between Andeans and Tibetans is suggested based on the signal detected for the HIF regulatory gene EGLN1. In addition to EGLN1, a second HIF regulatory gene, EPAS1, as well as two HIF targeted genes, PRKAA1 and NOS2A, have been indentified as selection-nominated candidate genes in Tibetans (EPAS1) or Andeans (PRKAA1, NOS2A). PRKAA1 and NOS2A play major roles in physiological processes essential to human reproductive success [56]. Thus, in addition to demonstrating the likely targets of natural selection and the operation of evolutionary processes, genome studies also have the clear potential for elucidating key pathways responsible for major causes of human morbidity and mortality. Based on the findings of this study, it will be important to confirm the results with genotype-phenotype association studies that link genotype to a specific high-altitude phenotype.

Materials and Methods Populations and Genome-Wide Data High-density multilocus SNP genotype for 347 individuals were generated using the Genome-Wide Human SNP Array 6.0 by Affymetrix Inc. (Santa Clara, CA). This array includes 1.8 million genetic markers consisting of 906,600 SNPs located throughout the genome (SNPs: 869,225 on autosomes, 37,000 on the X-chromosome including 478 on the pseudo-autosomal region, 257 on the Y-chromosomes and 119 mitochondrial DNA SNPs). These genetic markers are typed on two arrays, named for the restriction enzymes used in the complexity reduction step of the reaction, the NSP array and the STY array. In total, 905,747 autosomal and X chromosome SNPs were analyzed. The Y-chromosome and mitochondrial DNA SNPs were not considered. In addition to SNPs, common copy number polymorphisms (CNPs) and rare number variants (CNVs) were included in this analysis. For this analysis, we considered common copy number polymorphisms (CNPs) as the subset of copy number variants that segregate in greater than 3% of the population, whereas CNVs that were found in less than 3% of the population were considered as rare CNVs. All CNPs and CNVs were called using Birdsuite (Cambridge, MA), and McCarrol et al.'s [57] high-resolution map of CNPs was used to define these loci (that is, McCarrol et al.'s [57] CNP map was derived using the same array resolution for the Affymetrix SNP 6.0 microarray as in our study). Next, CNVs were verified using two additional algorithms for CNV detection. They include the Affymetrix Genotyping Console (Santa Clara, CA) and a hidden markov model (HMM) from Partek Genomics Suite (St. Louis, MO). Stringent CNV calls were defined as CNVs that were detected by two or more algorithms. Therefore, in order for a CNV to be included in our analysis, it must have been detected by two of the three algorithms. We assayed two high-altitude human populations, Andeans and Tibetans, as well as closely related low-altitude control populations to identify selection-nominated candidate genes or gene regions. These two populations were analyzed separately and the results from the independent analyses compared. The Andean sample was composed of 49 individuals belonging to two high-altitude populations and included 25 individuals of largely Aymara ancestry collected in La Paz (3,600 m), Bolivia, and 24 Peruvian Quechua from Cerro de Pasco (4,338 m), Peru [54], [58], with the Peruvian and Bolivian populations being sampled by two of us (LGM and TDB) and the Tibetans by one of us (LGM). The Tibetan sample consisted of 49 ethnic Tibetans from three counties within the Tibetan Autonomous Region of China, and can be broken down into 22, 20, and seven individuals from Nachu County (4,400 m), Shannan County (3700 m), and Linchi County (3,000 m), respectively [59]. The low-altitude control samples included four Mesoamerican populations including 25 Maya from the Yucatan Peninsula of Mexico (10 m), and 14 Mesoamericans composed of 2 Nahua, 7 Mixtec, and 5 Tlapanec speakers from Guerrero, Mexico (1,600 m). All Mesoamerican individuals were from populations that are not known to have lived at high altitude. In addition, we analyzed two HapMap Project (www.hapmap.org) populations consisting of 60 individuals from the United States of northern and western European ancestry (CEU), and 90 East Asians from Beijing (55 m), China and Tokyo (8 m), Japan. The population samples typed using the Affymetrix Inc. (Santa Clara, CA) Genome-Wide Human SNP Array 6.0 are listed in Table S8. Participants provided informed, written consent according to the guidelines approved by the Institutional Review Board at Penn State University. Genetic ancestry estimates were calculated for all of the Indigenous American individuals (both Andeans and Mesoamericans) using a panel of ancestry-informative markers (AIMs) that is useful for distinguishing between West African, Northern European, and Indigenous American populations [60], [61]. All the individuals included in this study show high levels of Indigenous American genetic ancestry (>90%) and lower components of West African and Northern European ancestry (<10%). No AIMs are currently available to distinguish Tibetan ancestry from Han Chinese ancestry. However, all Tibetan samples were collected from individuals living at least 20 kilometers from the nearest town to ensure minimal admixture with Han Chinese. All assayed samples were included in the SNP analysis, but for the CNV analysis, 16 samples, eight Tibetan, two Andean, and six Mesoamerican, did not pass the quality control filters and were removed from further CNV study. Population Structure Analysis The EIGENSOFT package [29] was used for PC analysis, with default parameters, and nsnpldregress = 0. The “snpweightoutname” option was implemented to obtain the influence that each SNP weighs on the PC1 that separated the Europeans and Indigenous Americans (Figure S1A). We then trimmed the tails of the distribution of weights to obtain reduced SNP sets that bear less influence of SNPs informative for European/Indigenous American differences. For example, at a threshold of 0.9 (240,969 SNPs) the gap between Europeans and Indigenous Americans is partially reduced (in fact this axis switches from PC1 to PC2). At threshold 0.8 (Figure 1A) and below (data not shown), the European cluster is positioned in the midst of a consistent pattern of Indigenous American clusters. Admixture estimates were performed with the maximum likelihood method implemented in frappe [30] that considers each individual's genetic makeup to be composed of K ancestral populations that sum to 1. All frappe runs were performed until the convergence criterion was met (less than 1 point of likelihood increase between each step). Frappe includes a stochastic aspect that can cause the results with the same input parameters to differ across runs. We have presented frappe results at the highest K-clustering that were consistent across multiple runs. For Figure 2C, we chose K = 7 (equivalent to K = 5 because we only mapped the non-European, non-African samples), as it was the highest K giving consistent clustering across 3/4 of the runs (the 4th run showed a split between Peruvian Quechua and Bolivian Aymara). For Figure 2D, we chose K = 3 because it was extremely consistent across 4 runs, whereas K = 4 separate the Lahu in 3/4 of the cases (the 4th run showing internal variation within Tibet). Frappe clustering for values of K other than those presented in Figures 2C and 2D are shown in Figures S2 and S3 (patterns beyond the chosen K are not the same across runs, although they appear recurrently in different forms across runs (not shown)). Related individuals were removed based on IBD PI_HAT values (PLINK software) [62]. Two Maya samples appeared to cluster among the HGDP-CEPH individuals (HGDP00863 and HGDP00872). Five pairs of individuals (1 Bolivian and 4 Maya, pairs) appeared strongly related (PI_HAT>0.5) and one person of each pair was removed in subsequent structure analyses. Similar criteria lead to the removal of three Tibetan individuals. In order to avoid inter-population biases in missing data, we defined a threshold of at least three genotypes called per population sample. We also cleaned the merged dataset of SNPs with global minor allele frequency <1%. Merging our samples with HGDP-CEPH samples was performed with the PLINK software v1.05 [62], [63]. Genome-Wide Analysis of Signatures of Positive Selection In order to identify signatures of positive selection specific to Andeans and Tibetans, the Mesoamerican sample and the East Asian sample were used as low-altitude control populations for the Andeans while the European and East Asian samples were used as low-altitude control populations for the Tibetan population. Candidate positive selection loci were identified in Andean and Tibetan populations by applying four tests of natural selection. They include the locus-specific branch length (LSBL), the log of the ratio of heterozygosities (lnRH), the standardized difference of D, and whole genome long range haplotype (WGLRH) tests [31], [32], [33], [34]. LSBL, lnRH and the standardized difference of D were computed by implementing Perl scripts written specifically for this data set. The LSBL was computed for each SNP in the data set individually whereas an overlapping sliding windows approach was taken to calculate lnRH and the standardized difference of D for each window. A window size of 100 kb was selected based on the genome coverage and the marker density of the Affymetrix SNP Array 6.0. Statistical significance for each of the LSBL, lnRH, and the standardized difference of D statistics was determined by using its respective genome-wide empirical distribution generated by these data independently for the autosomal chromosome SNPs and X-chromosome SNPs. Separate analysis of X-chromosome markers was required given the smaller effective population size and the higher degree of natural selection observed and expected for markers on this chromosome. The empirical p-value for LSBL, lnRH, and the standardized difference of D was calculated by using the following equation: Those loci with P E values falling in the top (LSBL) or bottom (lnRH and the standardized difference of D) 5% of the empirical distribution for the autosomal chromosomes or the X chromosome were considered statistically significant (α = 0.05). The WGLRH test was computed using the algorithm developed by Zhang et al. [32]. This test is based on the observation that loci in linkage disequilibrium (LD) with the functional SNP will be swept to fixation or near fixation during the selective event, resulting in haplotypes with high population frequencies coupled with long range LD. For this test, significance was assessed by comparing the relative extended haplotype homozygosity (REHH) of a specific core haplotype to the gamma distribution. The false discovery rate approach was then applied to correct for multiple tests [64]. In our study, LSBL was used to describe the relationship between the relevant populations at each locus by apportioning the genetic diversity into three population branches for each of the two population triangulations of interest. The population triangulations include 1) Andean, Mesoamerican and East Asian and 2) Tibetan, East Asian and European. To calculate LSBL, we computed Wright's F ST using Weir and Cockerham's equation at every SNP position for each two-way population comparison (i.e., East Asian to Mesoamerican, East Asian to Andean, and Mesoamerican to Andean for the Andean triangulation) [65], [66]. Next, the pairwise F ST values were used to calculate the LSBL as previously described [33], again at each SNP. With three contrasting populations, the LSBL statistic is mathematically equivalent to the population-specific F ST introduced by Weir and colleagues in 1984 [66]. LSBL values falling in the upper 5% tail of the empirical distribution for the Andean population or the Tibetan population are suggestive of positive natural selection in that particular population. The natural log of the ratio of the heterozygosity between two populations of interest, or lnRH, was used to summarize the extent to which population-specific loss in genetic diversity characterizes SNP regions and candidate genes in the two high-altitude populations under consideration [67], [68]. This statistic was calculated for each two-way population comparison using an overlapping sliding window size of 100,000 base pairs (bp) and moving in 25,000 bp increments along a chromosome. For the high-altitude populations, we focused our search on the Andean-Mesoamerican lnRH values for the Andean panel or the Tibetan-East Asian lnRH values for the Tibetan. Statistically significant negative lnRH values for each of these comparisons indicate regions where there has been a reduction in variation in the population of interest that did not occur in the closely related comparison population. Regions of the genome with negative Tajima's D values are a hallmark of positive selection. However negative values of D can also result from demographic events, specifically the recovery from a population bottleneck. As for the lnRH analysis, Tajima's D was calculated for each population using an overlapping sliding window size of 100 kb with a 25 kb offset. We used a modification of the Tajima's D statistic, standardized Tajima's D to compare Tajima's D across windows. This statistic is similar to the iHs statistic [69] and is calculated using the following equation: Where D i is the Tajima's D calculated for a sliding window in a given population panel (Andean, Mesoamerican, or East Asian), μ is the mean Tajima's D for all windows, and SD is the standard deviation of Tajima's D for all windows. Using this statistic, we identified significantly negative windows in Andean and Tibetan populations. However, because we are interested in the identification of regions of the genome that have been subject to recent positive selection in Andeans and Tibetans, we compared Tajima's D in Andeans vs. Mesoamericans, as well as Tibetans vs. East Asians. To do so, we used the standardized difference of D to summarize the difference of Tajima's D between two populations using the equation: Here, D i A is Tajima's D computed for a given sliding window in population A, D i B is Tajima's D computed for a given sliding window in population B, μ is the mean Tajima's D for all windows, and SD is the standard deviation of Tajima's D for all windows. Again, Tajima's D was calculated for each population using an overlapping sliding window size of 100 kb with a 25 kb offset. Therefore, we first identified negative Tajima's D values in Andeans or Tibetans using the normalized Tajima's D statistic. We then applied the standardized difference of D metric to identify windows that were significantly different between high- and low-altitude human groups. The final test used to infer positive selection was the WGLRH test of Zhang et al. [32]. This test first calculates the REHH for each core haplotype in the data set and identifies core haplotypes with longer than expected ranges of linkage disequilibrium (LD) given their frequency in the population. A gamma distribution is then estimated using maximum likelihood methods against which the REHH of each core haplotype is tested to determine if its respective p-value is suggestive of recent, positive selection. This test then considers the ancestral state of the alleles, determined by a closely related outgroup, to identify SNPs where the derived allele has risen to extremely high frequencies (>0.85). For this data set, the ancestral state for all SNPs available in the chimpanzee sequence was retrieved using the UCSC genome browser. In total, the ancestral states for 846,032 SNPS on the autosomes and X chromosome were obtained. Lastly, the WGLRH test applies a false discovery rate approach to control for false positives and identifies significant extended haplotypes using the gamma distribution. To identify CNPs that may be involved in adaptation to high altitude in Andeans or Tibetans, we compared population frequencies for each biallelic CNP using pairwise F ST calculated in the same manner as described previously for SNPs [66]. We also extended the methods of the SNP LSBL analysis to the biallelic CNPs to identify along which branch the greatest changes in allele frequency has occurred using F ST to calculate LSBL. Andean-Mesoamerican and Tibetan-East Asian pairwise F ST and LSBL values falling within the top 1% of the empirical distribution were identified as statistically significant (data not shown). Next, we implemented a similar technique as that described by Redon to calculate REHH [37]. Each biallelic CNP was treated as a SNP located at the boundary of the CNV window, and SNPs falling 500 kb upstream and downstream of the respective boundary were used to calculate REHH in SWEEP (http://www.broad.mit.edu/mpg/sweep/resources.html). SNPs located within CNPs were excluded. Lastly, we considered rare CNVs identified in Andeans or Tibetans that overlapped specifically with the selection-nominated candidate genes identified from our SNP analysis.

Acknowledgments The authors would like to thank the people of high altitude for participating in this study, three anonymous reviewers for insightful comments on the manuscript, and Lynn Bemis for her contributions to the initial planning of this study.

Author Contributions Conceived and designed the experiments: AB LGM MDS. Performed the experiments: RM. Analyzed the data: AB MB DP DLH. Contributed reagents/materials/analysis tools: XM JMA RM SWS CGJ MJW TB EJP LGM MDS. Wrote the paper: AB MB LGM MDS.