Overall sequencing results

We obtained a total of 2.93×109 sequencing reads out of which 1.11×109 reads (37.9%) mapped uniquely to the human reference genome (hg18). Paired-end reads covered 96% of the reference genome with small variation among chromosomes (max=97.74% for chr22, min=89.61% for chrY) (Supplementary Fig. S1a and Supplementary Table S1). The average depth of coverage for non-redundant reads excluding duplicates is 7.6-fold, ranging from 4.76-fold for chromosome X to 10.8-fold for chromosome Y (Supplementary Fig. S1b). More than 65% of the genome is covered at least five-fold. Regions with more than ×100 coverage were excluded from analysis, as these most likely represent PCR duplication artefacts.

Deamination-based damage, as the most common modification in ancient DNA5,6, potentially complicates correct identification of base substitutions that result from evolutionary processes. The sequencing of a 4,000-year-old palaeo-eskimo's DNA sequence demonstrated that exclusion of postmortem, damage-based changes had little impact on the high-throughput sequencing result5. However, to estimate the extent of miscoding lesions (that is, postmortem changes in DNA), we determined the transition and transversion ratios (Supplementary Table S2 and Supplementary Table S3) and found them to be largely consistent with ratios observed for other genomes recently analysed by high-throughput sequencing.

One limitation to the whole-human genome sequencing has been the ability to provide broad access to the underlying sequence information in a useable format such that individuals can readily query the genome for their own specific research needs. To enable full access to the Iceman's genome to a broader scientific community, we have generated a genome browser (http://IcemanGenome.net)8. This browser contains all the aligned sequencing reads to the Human Genome Reference 19, provides a basic level of annotation including individual single-nucleotide polymorphisms (SNPs), insertions and deletions, and gene identification. We have also incorporated a query engine for more complicated analysis routines, such as the comparison of the Iceman's genome to other contemporary genomes (Methods).

Contamination controls

Assessment of the percentage of miscoding lesions and potential human contamination was carried out through analysis of the mtDNA. As the mitochondrial genome is naturally at a very high ratio as compared with the nuclear genome (×1160 coverage versus ×7.6 in this specific case), it provides a good target to identify low-frequency variants that can be caused by contamination. Additionally, the Iceman's mitochondrial genome had previously been sequenced in a different laboratory using a different sample3, thus serving as an ideal independent comparison. Allele ratios were determined for the previously noted positions at which the Iceman differs from the Cambridge Reference Sequence (rCRS, NC_012920). For each read not consistent with the known Iceman genome, the specific exchange type was examined as to whether it is a typical deamination product or whether it constitutes a known human mitochondrial variant (Supplementary Fig. S2, Supplementary Table S4). The average rate of concordance with the known Iceman genome is over 94%, the percentage of exchanges corresponding to miscoding lesions is under 5% and the rate of exchanges corresponding to other known human sequences is <2.5%. While the latter may be caused by modern contamination (either human or other), it may also suggest heteroplasmy of certain positions of the Iceman's mitochondrial genome. Although this phenomenon was thought to be rare, recent studies have shown that heteroplasmy is not uncommon in healthy mitochondrial cells6. This analysis suggests a maximum potential contamination level of 2.5%, an amount that would have little impact on the allele calls for the nuclear genome.

The next step comprised the analysis of DNA damage patterns. As sequences derived from ancient DNA molecules show characteristic nucleotide composition patterns, those can be used to distinguish them from contaminants of modern DNA fragments. In particular, postmortem deamination of cytosine to uracil leads to an increased rate of observed C to T transitions, particularly near the ends of the DNA fragments7. Furthermore, postmortem DNA fragmentation occurs at an increased rate at the 3′-end of purine nucleotides due to depurination, leading to a characteristic enrichment in purines at the base preceding the sequencing read7.

In order to determine whether the Iceman DNA shows these characteristic patterns, we analysed a random subsample of the aligned sequencing reads for their nucleotide composition and misincorporation patterns. From each of the SAM files containing the aligned read data of the three sequencing slides, we randomly sampled one out of every 1,000 reads, and combined them into an initial analysis file containing a total of 1,156,960 reads. To simplify the subsequent nucleotide substitution analysis, we removed reads that showed insertions or deletions with respect to the reference sequence, as well as reads that mapped closer than 10 bp from the edges of the respective chromosome. This final read set contained 1,148,786 reads, and all subsequent results are based on this set. For each read, we then retrieved the corresponding genomic sequence from the human genome hg18 build, extended to include sequence up to 10 bp upstream (downstream) for reads that map to the forward (reverse) strand.

The average nucleotide composition around the ends of the analysed reads shows the typical pattern expected for ancient DNA fragmentation (Fig. 2a and b). A sharp increase in the frequency of purines (A and G) is observed at the first base upstream of the forward reads, accompanied by a complementary increase of pyrimidines (C and T) just downstream of the reverse reads. Additionally, substitution rates for each of the possible nucleotide mismatches along the sequencing read also show the expected pattern. While substitution rates are generally low (<0.25%) for most mismatch types, C to T transitions show an increased overall rate, in particular towards the end of the reads (up to 3.5% at the first nucleotide). As above, the increase in C to T transitions for the forward reads are accompanied by the complementary increase in G to A transitions observed for the reverse reads. Overall, these results are comparable to other ancient human genomes recently published, such as the Saqqaq5 and the Australian Aboriginal genome8,9, and therefore provide additional indication that the sequences are indeed ancient.

Figure 2: Analysis of DNA damage patterns. (a) Average nucleotide composition around the ends of the analysed reads. (b) Observed rate of C to T and G to A transitions. Full size image

SNP analysis

To gain first insight into genetic traits and into physiological and pathological characteristics of the Iceman, we identified 2.2. million SNPs in the Iceman's genome sequence of which 1.7 million SNPs have previously been reported in dbSNP, with ∼900,000 and 800,000 positions being present in a heterozygous and a homozygous state, respectively (Supplementary Table S5). As the 1,000 Genomes Project has demonstrated, using higher sequence coverage should result in the detection of ∼3.5 million SNPs per individual; however, due to the limited amount of ancient DNA, lower sequencing coverage was generated, resulting in the detection of fewer SNPs. As shown in Supplementary Table S5, we found a concordance rate of ∼92% for homozygous SNPs but only of 55% for heterozygous SNPs indicative of an undercalling of heterozygous SNPs as expected for 7.6-fold coverage.

We compared the number and frequency of homozygous to heterozygous SNPs known from dbSNP on each chromosome. The highest frequency of homozygous SNPs among autosomes was detected for chromosome 13, where 29,400 of the 43,680 SNPs were shown to be homozygous (67.3%). We detected heterozygous SNPs in regions with high similarity to other genomic regions and pseudoautosomal regions. However, the overall highest rate of homozygous SNPs, namely 96.8% (16,080 of 16,606 SNPs), was detected in the X chromosome. As the Iceman was a male, these results further demonstrate the low potential degree of contamination, consistent with the results derived from the mtDNA analysis.

We furthermore analysed SNPs, including variants in specific genomic regions, of potential clinical or functional relevance. In the 5,300-year-old mummy, our analysis did not identify variations in human accelerated regions10, which are apparently functional sections of the genome showing human-specific evolution over the last few million years.

SNPs of potential clinical relevance with additional direct assessment of further SNPs of phenotypic interest (for example, lactase persistence, pigmentation and blood group) are summarized in Supplementary Table S6. Not all listed SNPs necessarily infer a phenotypic consequence or a high predisposition risk. A number of variants supported by morphological or radiological data (such as risk factors for atherosclerosis), or variants that are likely to lead to a specific phenotype (such as lactase nonpersistence) have been resequenced using the Sanger method. These are listed together with individual coverage depth in Table 1. Sanger sequencing, carried out using extract from a separate sample, confirmed the next-generation sequencing results in all cases where amplification was successfully carried out, further underlining the low degree of contamination. Additionally, some of the characteristic SNPs of the Iceman's mitochondrial genome were screened using mitochondrial primers to confirm authenticity (Table 2).

Table 1 Verification of nuclear SNPs. Full size table

Table 2 Mitochondrial primer sequences. Full size table

As a next step we analysed the genetic ancestry of the Iceman. As the Iceman's mtDNA haplotype has not yet been detected among thousands of sampled contemporary individuals, it has been difficult to assess his genetic ancestry with high accuracy. The first analysis was to determine if the Iceman's autosomal DNA shows an affinity to any specific population or if he remains an outlier among contemporary samples. We intersected genotype calls of greater than ×6 coverage from his genome with the population reference sample consisting of more than 1,300 Europeans genotyped for SNPs on the Affymetrix 500K array11,12, 125 individuals from seven North African populations ranging from Egypt to Morocco on the Affymetrix 6.0 array and 20 Qatari samples from the Arabian Peninsula13. When plotting the Iceman's genotype along the first two major axes of variation in Principal Component space (PC1 versus PC2), PC1 is driven by a north-to-south gradient differentiating North Africans from Europeans, and PC2 aligns individuals along north-to-south gradient within Europe (Fig. 3a). The Iceman clusters nearest to southern European samples, suggesting no greater genetic affinity with the North African or Middle Eastern components of variation than present day southern Europeans (Fig. 3a). When considering only European populations, however, we observe that the Iceman clusters closest with five outlier contemporary samples from south-western Europe. In particular, the Iceman abuts the Italian samples originating from geographically isolated regions such as Sardinia (Fig. 3b). Analysis of a larger set of samples, including Sardinians from HGDP across a smaller subset of SNPs, further supports the clustering of the Iceman with samples from Sardinia based on autosomal SNPs (Fig. 3f).

Figure 3: Autosomal and Y-chromosome evidence of the Iceman's origin. (a) Principal component analysis (PCA) of the Iceman genome and SNP array data for European, North African and Near Eastern populations. Population samples were as follows: Europe S (for example, Italy), Europe SW (Spain/Portugal), Europe W (for example, France), Europe C (for example, Germany), Europe NW (for example, UK), Europe NNE (for example, Sweden), Europe SE (for example, Greece), Europe ESE (Cyprus/Turkey), North Africa (Algeria, Libya, Egypt, Morocco, Sahara and Tunisia) and Near East (Qatar). PCA was performed using 123,425 SNPs. (b) PCA projecting the Iceman into genetic clusters within Europe. PCA was performed using 132,981 SNPs after intersecting the Iceman genome with 1,387 European samples from the PopRes dataset. Samples clustering close to the Iceman were of Sardinian origin. (c) The phylogenetic relationship of Y-chromosome haplogroup G2a4 within haplogroup G was constructed using markers ascertained in the Iceman as listed in Supplementary Table S7. Unboxed and boxed marker labels indicate observed ancestral and derived allele status, respectively. (d) Spatial frequency distribution of G2a4-L91 chromosomes in Europe. Appropriate sample locations of 7,797 individuals from 30 populations are indicated with red dots. (e) Spatial distribution of L91 frequency in various regions of Corsica and Sardinia. Marker L91 was genotyped by HaeIII RFLP analysis, which identifies a G to C transversion at position 250 within the 447-bp PCR fragment amplified using primers (F) 5′-ctttgccattcatgcaaagg-3′and (R) 5′-gtgagagtgctcagccagtc-3′. The frequency data were converted to spatial-frequency maps using Surfer software (version 7, Golden Software, Inc.), following the Kriging procedure. (f) PCA of the Iceman genome combined with SNP array data for 1,387 European samples from PopRes and 28 Sardinians from HGDP. PCA was performed using 28,003 SNPs. Full size image

A potential reason for this may be an error in genotype calls due to low-coverage sequencing. To evaluate this hypothesis, we repeated the principal component analysis (PCA) analysis including data from five HapMap CEU individuals with high-confidence genotype calls based on multiple array platforms that were also sequenced as part of the 1,000 Genomes Project. To establish controls for the PCA analysis and to address any issues that may arise from the low sequencing coverage of the Iceman, we randomly chose five CEU individuals from the Pilot 1 data release, which had a similar (or lower) coverage to the Iceman, and were sequenced using the same SOLiD technology. The 1,000G CEU sample genotypes were inferred using a single-sample procedure similar to that used for the Iceman (Supplementary Methods). SNP array genotype data for the same individuals were obtained from HapMap release 28. The merged HapMap/1,000 Genomes/PopRes/Iceman dataset contained 125,729 SNPs. The 1,000 Genomes control samples were projected onto the PC space inferred from the rest of the dataset using both the low-coverage SNP calls and the high-confidence HapMap genotypes. The low-coverage 1,000 Genome control samples fall closer to the middle of the PCA plot, near central Europe, when compared with their corresponding genotype HapMap data (Supplementary Fig. S3a). The 1,000 Genomes control samples illustrate how low-sequence-coverage genomes tend to project towards the averaged centre of the distribution due to the missing genotypes. Thus, the Iceman's position along with the 1,000 Genomes control samples are shifted towards the origin as those missing genotype values are set to 0 after the normalization step in the PCA. When using a ×12 threshold (though with fewer loci) the placement of the Iceman shifts from the European continent closer to the Sardinian sample group, as there are no missing genotypes, reflecting his true position more accurately. (Supplementary Figs S3b–d show placement of the Iceman on PopRes PCA based on varying degrees of coverage and thresholding for missing genotypes.)

To determine the Iceman's paternal ancestry, his Y-chromosome haplogroup allocation was assessed according to the hierarchical order of markers organized by the International Society of Genetic Genealogy14. The detection of four phylogenetically equivalent SNPs15 and an independent verification of the rs2032636 (M201) SNP by Sanger sequencing of a PCR amplicon obtained using genomic DNA isolated from a 2007 muscle biopsy unequivocally assigns the Iceman to haplogroup G, specifically subgroup G2a (Supplementary Table S7). While haplogroup G displays its highest frequency in the Caucasus16, it is also present at ∼11% in present day Italy17. Genetic analysis of ancient DNA from 5,000-year-old skeletons from a burial cave site in southern France were mostly assigned to haplogroup G2a-P15 (ref. 18). Haplogroup G2a3 was also reported in an ancient DNA sample of an early Neolithic individual from Saxony-Anhalt, Germany19. We used the phylogenetic relationships of the ancestral and derived allelic states for 13 other detected SNPs within the G hierarchy (Supplementary Table S7, Fig. 3c) to place the Iceman into the G2a4-L91 lineage that is notably divergent from G2a3-U8 lineages typical of modern continental Europeans. Knowledge about the phylogeography of haplogroup G2a4 is unreported. We addressed this issue here by analysing the G2a4-defining L91 SNP in 7,797 chromosomes from 30 regions across Europe. Fig. 3d shows the spatial frequency distribution of G2a4 throughout Europe. The highest frequencies (25 and 9%) occur in southern Corsica and northern Sardinia, respectively, (Fig. 3e) while in mainland Europe the frequencies do not reach 1%.

We also performed a detailed analysis of the Iceman's phenotype and the diseases he may have suffered from. One trait associated with the beginning of agriculture in Europe is lactase persistence (the ability to digest milk after early childhood), which is commonly associated with a polymorphism in the MCM6 gene (−13,910*T)20. Palaeogenetic analyses from various prehistoric sites failed to detect the derived allele in any of the tested Neolithic samples, indicating that lactase persistence was rare in the Neolithic and, due to the substantial selective advantage conferred by this trait, gained in frequency over the next millennia and was widespread in Central Europe by the Middle Ages21. Comparisons between genotype and phenotype (diagnostic methane–hydrogen breath test) in South European individuals have shown that the homozygous ancestral C allele causes clinical symptoms in over 85% of cases22. Although a small number of genetically non-lactase-persistent individuals show no malabsorption problems, this may not least depend on the age variation of the study group: the onset age of lactose malabsorption differs between populations, sometimes not becoming manifest before the 30th year of life. As the Iceman's genome displays the homozygous ancestral allele at this site (coverage 14-fold, independently replicated by PCR), he was in all probability lactose intolerant as an adult.

Computer tomography scans of the Iceman recently revealed major calcification in carotid arteries, distal aorta and right iliac artery as strong signs for a generalized atherosclerotic disease1 (Fig. 4). As his lifestyle inferred by radiological and stable isotope data did not entail major environmental cardiovascular risk factors, we looked for genetic risk factors, specifically for SNPs linked with cardiovascular disease in genome-wide association studies. The Iceman was homozygous for the minor allele (GG) of rs10757274, which has been repeatedly confirmed as a major risk locus for CHD. This genotype shows an up to 40% increased risk (confidence interval=1.19–1.60 in the Copenhagen City Heart Study and 1.09–1.52 in the Atherosclerosis Risk in Communities Study) for development of a clinically manifest CHD in different ethnicities, independent of classical risk factors23. This SNP was furthermore identified as a risk locus for ischaemic stroke (odds ratio=1.59 (1.20–2.11), P=0.004)24,25 and sudden cardiac death (meta-analysis of six independent cohort studies: odds ratio=1.21 (1.04–1.40), P=0.01)26. We also identified a homozygous minor allele of rs2383206 (GG), which is another major CHD and ischaemic risk SNP with a hazard ratio of 1.26 (1.07–1.48)23 and 1.30 (1.06–1.58)27, respectively. In combination with rs10757274 the risk for developing CHD almost doubles28. In addition, the Iceman's genome harbours endothelin receptor type B heterozygote variant rs5351, which independently increases the risk for atherosclerosis in men29. In addition, we found SNPs in three genes that have been associated with CHD, namely VDR, TBX5 and BDKRB1. While we detected the SNP rs2228570, located in a start codon of the gene VDR, for genes TBX and BDKRB1 we found novel mutations in the respective stop codons (Supplementary Table S8).

Figure 4: CT image of abdomen and coronal reconstruction. The aorta is visible as a strip-shaped structure to the right of the vertebral column. Two calcifications constituting atherotic plaques are visible at the left contour of the aortic bifurcation1. Full size image

Several SNPs in the OCA2 and HERC2 genes identified as being associated with iris colour were analysed. The most strongly associated variant, rs12913832, shows the homozygous A allele in the Iceman's genome (coverage depth ×31), which is associated with brown eye colour in over 80% of cases even when regarded alone30. Branicki et al.31 defined a haplotype of five SNPs as predictor: rs4778138, rs4778241, rs7495174, rs12913832 and rs916977. The Iceman's haplotype (for which TTTTC and TTTTT were addressed together as the last SNP is a clear heterozygote) in the Branicki study encompassed 46 individuals, 8 of which were blue eyed, the residual 38 having a non-blue eye colour. The haplotype defined by Sturm et al.30 narrows the field down further: individuals who shared the Iceman's A-AAA haplotype had blue eyes in under 5% of cases, 40% having green or more often hazel eyes, while 55.7% of individuals had brown eyes.

Further, phenotypically relevant variants include a homozygous deletion (coverage depth ×7) at rs8176719, a T allele at rs505922 and lack of significant deletions in the RHD gene, which are characteristic of blood group O Rh-positive carriers32,33.

Metagenomic analysis

Metagenomic analysis was performed using the BlastN algorithm. Despite the presence of an exceptionally high number of human reads (77.6%), only a small proportion of the assignable reads derive from the bacterial domain (0.84%). Most prevalent bacterial species were found to be in the phylum Firmicutes, mostly the spore-forming Clostridia (72%). Within the Spirochaetes phylum, 45.7% of the reads (0.16% of the total bacterial hits) were assigned to sequences of the pathogen B. burgdorferi, which is known to cause Lyme disease in humans (Fig. 5). To confirm this result of the metagenomic approach, the entire number of reads was mapped against the B. burgdorferi reference genome and the strain-specific B. burgdorferi plasmids. One million sequence reads covered about 60% of the B. burgdorferi genome including a high representation of the bacterial episome, which was covered up to 67% at least once. While several reads were specific for Borrelia, the 60% coverage still represents an upper boundary as reads may map to other species besides Borrelia. The coverage of selected Borrelia genome and plasmid sequences for our NGS data is presented in Supplementary Table S9. Remarkably, B. burgdorferi sensu stricto infections have been linked to vascular calcifications34,35. In summary, our data point to the earliest documented case of a B. burgdorferi infection in mankind. To our knowledge, no other case report about borreliosis is available for ancient or historic specimens.