Study subjects

Demographic, phenotypic and genetic data were collected from 69 high-altitude native Sherpa residing in two villages at 3,800 m in the Khumbu region of Nepal during the summer of 2010 (Supplementary Table 7). The participants were healthy non-smoking men and women, born and raised above 3,000 m altitude, 17–54 years of age, who had not travelled to areas <2,400 m or >5,400 m in the previous 6 months, had normal body mass index, lung function and blood pressure, were not anaemic, did not have fever and were not pregnant. All study participants provided written informed consent. This study was approved by the IRBs at Case Western Reserve University and University of Chicago, by the Oxford Tropical Research Ethics Committee and by the Nepal Health Research Council.

Phenotypic data collection

Standing height without shoes (GPM Anthropometer, Stuttgart, Switzerland) and weight in light clothing (Pelouze Mechanical Shipping Scale, P114S, Bridgeview, Illinois, USA) were measured to the nearest mm and pound, respectively, according to standard protocol43. Blood haemoglobin concentration was measured in duplicate using the cyanmethaemoglobin technique (Hemocue, Angelholm, Switzerland) immediately after venipuncture.

Genotype data

All the Sherpa samples were genotyped using Illumina HumanOmni1-Quad beadchip in the Genomics Core Facility at Case Western Reserve University. We removed SNPs with call rate <95%, minor allele frequency <5% or with strand-ambiguity. Genetic relatives were inferred from a random set of 2,000 autosomal SNPs using Relpair v2.0.1 (ref. 44). Twenty related individuals with closer relationships than first cousins were excluded in subsequent analyses except for the genotype–phenotype association test. We also obtained genotype data of 96 Tibetans from three previous studies, each including 30–35 individuals (Supplementary Table 16)3,4,12. As all three data sets are from different regions of the Tibetan plateau and from different genotyping platforms (Illumina Human1M-Duo v3 for Lhasa Tibetans12, Illumina Human610-Quad for Yunnan Tibetans3 and Affymetrix Genome-wide Human SNP 6.0 for Qinghai Tibetans4), most of the analyses were conducted on the three Tibetan samples separately. To maximize the overlap between data sets, we imputed the Sherpa and the three Tibetan genotype data sets using IMPUTE2 (ref. 45) with the HapMap3 data set (Supplementary Table 17)13 as a reference. Each of four data sets was imputed separately. For each individual and SNP, we called a genotype if it had posterior probability ⩾0.9, otherwise, we treated it as missing data. We excluded SNPs with call rate <96.5% in the Sherpa or in any of the three Tibetan samples (⩾3 missing genotypes in the Sherpa or ⩾2 missing genotypes in Tibetans). This process yielded 543,555 SNPs overlapping between the Sherpa, Tibetans and HapMap3 data sets (‘HM3’; n=1,165); before imputation, 80,819 overlapping SNPs were identified. To analyse Sherpa and Tibetan genetic variation in a broader context, we overlapped this data set with each of three additional data sets. First, we overlapped it with data from HGDP14 and two Siberian populations, Naukan Yup’ik and maritime Chukchee16, to obtain 50,464 SNPs with genotype call rate ⩾96.5% for all populations in the data set (‘HM3–HGDP’; n=2,138). Second, we overlapped the HM3 data set with HGDP individuals genotyped on Affymetrix Axiom Genome-wide Human Origins 1 array, described by Patterson et al.9 We obtained 58,756 SNPs (‘HM3-HumanOrigin’; n=2,107) after excluding strand-ambiguous or low-quality SNPs (⩾2 missing genotypes in any of the 53 HGDP populations). Last, we overlapped the HM3 data set with the genotype data of 14 Asian populations (Thai, Vietnamese, Cambodian, Iban, Buryat, Kyrgyzstani, Nepalese, Pakistani, Andhra Pradesh Brahmins, Andhra Pradesh Madiga, Andhra Pradesh Mala, Tamil Nadu Brahmins, Tamil Nadu Dalit and Irulas) from Xing et al.15 We obtained 62,541 SNPs after excluding strand-ambiguous or low-quality (⩾2 missing genotypes in any of the 14 Asian populations) SNPs (‘HM3-Asian’; n=1,418).

Analyses of admixture

We used EIGENSOFT 4.2 (ref. 17) to perform PCA. For unsupervised clustering analysis, we used ADMIXTURE v1.22 (ref. 18) with fivefold cross validation to find the optimal number of clusters. When using the HM3-Asian data set for ADMIXTURE analysis, we generated LD-trimmed SNP sets by removing one SNP from each pair of SNPs with r2>0.2 in 50 SNP blocks using PLINK v1.07 (ref. 46). The D-test and 3-population test were performed as implemented in ADMIXTOOLS v1.1 (ref. 9). We used ALDER7 to date the admixture event using admixture LD decay. To reduce the noise introduced by genetic drift specific to the HA-proxy sample and CHD, we used SNP loadings of a principal component (PC) representing the Sherpa–Tibetan axis of genetic variation as a weight vector instead of allele frequency difference between two reference populations. To obtain this weight vector, we ran a PCA with the Sherpa, Tibetans and HapMap3 CHD (Chinese in Metropolitan Denver, Colorado), CHB (Han Chinese in Beijing, China), JPT (Japanese in Tokyo, Japan) and GIH (Gujarati Indians in Houston, Texas) (Supplementary Fig. 14). SNP loadings of the second PC were used as a weight vector. A bin size of 0.25 centiMorgan (cM) was used for all analyses. When fitting the exponential curve, we excluded SNP pairs within distance bins of ⩽0.5 cM to remove those in LD in the ancestral populations. Significance of the estimates was assessed by block jackknifing of one chromosome at a time.

Whole-genome sequence analysis

Two Sherpa males with 100% high-altitude ancestry component were chosen for 100-bp paired-end whole-genome sequencing using Illumina HiSeq 2000. We followed a standard Illumina TruSeq DNA sample preparation protocol for paired-end sequencing to construct sequencing libraries. Each sample was tagged with a unique adapter sequence. We mixed an equal amount of library DNA from two samples and sequenced the mixed library in a flow cell. After separating raw reads using adapter sequences, we mapped reads onto the human genome reference sequence (hg19) using BWA47. We further adjusted the alignment by conducting local realignment around indels and base quality recalibration steps using Genome Analysis Tool Kit48. After removing duplicates using Picard, we called consensus genome sequences of each individual using Samtools v0.1.19 (ref. 49)49. We applied the Pairwise Sequentially Markovian Coalescent (PSMC) model10 to autosomal consensus sequence of each sample separately, using the following options: −N15−t15−r5−p ‘2*2+50*1+4+6’. To compare them with low-altitude East Asians, we downloaded the aligned sequence reads (in BAM file format) for a Han Chinese and a Dai individual from Meyer et al.11 These sequence data were generated using essentially the same protocol as for our sequence data: sequencing libraries were prepared following the standard Illumina TruSeq DNA sample preparation protocol for paired-end sequencing of 101 bp reads and 200–400 bp insert size, sequencing was performed on an Illumina HiSeq 2000, reads were mapped to the human genome reference sequence (hg19) using BWA47, and the coverage was 27.7 × and 28.3 × for a Han and a Dai individual, respectively. In addition, to minimize any bias introduced by differences in post-alignment processing, we processed the Han and Dai reads in the same way as we did for the Sherpa samples. To convert the estimates of population parameters into the effective population size (N e ) and time in year, we used autosomal neutral mutation rate μ Auto =1.25 × 10−8 per base-pair per generation and 25 years per generation50. We also applied the PSMC model to composite diploid X chromosome sequences, composed of two haploid X chromosome sequences from two males. For this analysis, we called the genotypes of X chromosome sequences using the same pipeline applied to the autosomal sequences and called a heterozygote if the haploid genotype calls of two individuals are different at a site. Sites with a missing genotype in either of two individuals were coded as missing data. We ran the PSMC with the options −N25−t15−r5−p ‘6+2*4+3+13*2+3+2*4+6’. Time bins were sliced in a coarser way than those for the autosomal data, considering the lower resolution of X chromosome data. We adjusted the neutral mutation rate for X chromosome (μ X ) by applying the ratio of male-to-female mutation rate α=2 (ref. 51) and the formula μ X =μ Auto × [2(2+α)]/[3(1+α)]=1.11 × 10−8 (ref. 10) All raw sequences generated in this study have been deposited into the Sequence Read Archive (NCBI) with the accession numbers SRS520217 and SRS520218.

Local ancestry estimation

We estimated local ancestry across the genome of three Tibetan samples using HAPMIX24, using the 21 HA-proxy and the 21 CHD individuals with the highest proportion of the low-altitude component as the reference populations (Fig. 1b). Phased genotypes of CHD individuals were obtained from the HapMap3 website. The Sherpa genotype data were phased by using fastPHASE52. We estimated the local ancestry of each of the three Tibetan samples separately, using 50% admixture proportion, 80 generations since admixture and population recombination parameter ρ=600 for both reference populations. To summarize the local ancestry estimates across the three samples, we first individually centred local ancestry estimates by subtracting individual mean local ancestry (MLA). Then, we averaged local ancestry of each SNP across all individuals and standardized these mean values of local ancestry. We repeated this step for each of three Tibetan samples separately. To guard against the effect of genotyping error in the estimated local ancestry, we additionally ran HAPMIX with only half of the SNPs in Tibetans (odd- and even-numbered SNPs, respectively). False local ancestry peaks inferred from genotyping errors at a SNP will disappear in only one of these two sets. Therefore, we defined a penalty for the MLA value of each SNP, where the penalty is the absolute difference between MLAs from the odd-numbered and even-numbered SNP sets. We adjusted the estimated MLA by

So, this adjustment reduces the peak height toward zero in proportion to the difference between odd/even SNP sets. If this adjustment changes the sign of MLA, we set MLA to zero. After adjustment, we standardized MLA (Fig. 4 and Supplementary Fig. 10).

Gene set enrichment analysis

We tested for an enrichment of genes involved in the response to hypoxia in the regions with excess high-altitude ancestry across the Tibetan genome. We focused on the Reactome pathway gene set ‘Cellular response to hypoxia’ (25 genes) and its subset ‘Oxygen-dependent proline hydroxylation of HIF alpha’ (18 genes; Supplementary Table 11)25. To determine whether there was an excess of SNPs with high high-altitude ancestry in the gene sets, we calculated top 0.5, 1.0 or 5.0 percentile of the high-altitude local ancestry of all SNPs within 10 kb of all genes except for the hypoxia genes. Then, we calculated the proportion of SNPs with higher high-altitude local ancestry than the above percentiles within 10 kb of genes in the gene set of interest. We repeated this process 1,000 times by bootstrapping the whole genome and counted the number of bootstrap replicates for which the proportion of the top high-altitude ancestry SNPs in the hypoxia genes is higher than the corresponding percentile. We repeated the enrichment analysis after removing EGLN1 and EPAS1 genes from the gene set.

PBS analysis

To detect SNPs showing high divergence of allele frequency in the Sherpa and Tibetans from low-altitude East Asians, we calculated the PBS of each SNP in the HA-proxy and each of three Tibetan samples. To maximize the amount of the genome covered by SNPs, we used a lenient SNP filtering by allowing up to 10% missing genotypes for the HA-proxy and for each of three Tibetan samples, resulting in 879,434 SNPs (‘extended HM3’). We also merged all three Tibetan samples to increase accuracy in allele frequency estimates. We chose HapMap3 CHD and CEU (Utah residents with Northern and Western European ancestry from the CEPH collection) to represent a low-altitude East Asian population and the outgroup, respectively. We followed Weir and Cockerham53 to calculate pair-wise F ST and Yi et al.5 to calculate PBS. We retrieved SNPs with the highest rank around EGLN1 and EPAS1 genes (within 300 kb) in Tibetans and checked their ranks in the HA-proxy. All analyses were performed using R v2.15.1 (ref. 54).

MR analysis

To identify SNPs with extreme frequency divergence in Tibetans while taking into account their history of admixture, we used a MR analysis described in Alkorta-Aranburu et al.26 Briefly, we modelled Tibetan allele frequencies in the extended HM3 data set as a linear combination of those of the 11 HapMap3 populations (Supplementary Table 17) and the HA-proxy. Alleles were polarized based on the global allele frequency. We obtained the standardized and squared residuals (‘MR score’) from the above MR model. We took the rank of each SNP, divided it by the total number of the SNPs and minus log 10 transformed to get the transformed rank.

Genetic association with haemoglobin levels in the Sherpa

For the EPAS1 gene region on chromosome 2, SNP genotypes in a 5 Mb region were imputed using IMPUTE2 (ref. 45) with 1,000 Genomes phase 1 integrated variant set55 as a reference panel. SNPs with information metric ⩾0.9 were chosen for downstream analysis. Twenty-six SNPs passed this threshold among those reported to be associated with haemoglobin concentration in two Tibetan cohorts3. For the HYOU1/HMBS gene region on chromosome 11, we selected the 64 genotyped SNPs with high-altitude ancestry ⩾3.7 s.d. (Supplementary Fig. 13). We tested for a genetic association between mean posterior genotype of these SNPs and haemoglobin concentration in all 69 Sherpa individuals. We took relatedness among samples into account through a LMM scheme using GEMMA56 (Supplementary Tables 9 and 13). Genetic relatedness between individuals was estimated using the genome-wide genotype data. We included sex as a covariate. One thousand permutations of concentration across individuals were performed to test whether the results were significantly enriched for low P-values (Supplementary Tables 10 and 14).

Mapping of admixed populations on phylogenetic trees

We used MixMapper8 and TreeMix20 to infer the ancestral populations contributing to the Tibetan gene pool and the proportion of admixture in a single step. We ran MixMapper with a scaffold tree of five populations (the HA-proxy, HapMap3 YRI, CEU, CHD and JPT) using the HM3 data set. These five populations were chosen to cover major branches of human populations and to reduce computational load. We also ran TreeMix, with each of the three Tibetan populations and the above five populations. We allowed three admixture events for each of these six-population sets. In both MixMapper and TreeMix analyses, we performed 500 bootstrap replicates with 50 SNPs as a block for bootstrap sampling.

Archaic human admixture in the Sherpa and Tibetans

We ran PCA for Chimpanzee, Neanderthal and Denisovan genotype data (HM3-HumanOrigin data set) and obtained the projection of modern human individuals on these PCs. Population means of PC1 and PC2 were plotted for the 53 HGDP populations, the HA-proxy and three Tibetan samples. We also performed the D-test. The D statistic was calculated for each of ((H 1 , H 2 ), (A, Chimpanzee)) quartets with the HM3-HumanOrigin data set: H 1 =the 53 HGDP populations; H 2 =the HA-proxy/Tibetans; A=Neanderthal or Denisovan.

Estimating the selection coefficient

Given the sharing of adaptive variants in the EGLN1 and EPAS1 gene regions between Tibetans and the HA-proxy, we estimated selection coefficients of these variants in the HA-proxy because its demography is simpler than that of Tibetans. Here we applied a simple deterministic model of selective sweep with additive genetic effects, using the following formula: