Factors influencing the number of detected NUMTs

To detect polymorphic NUMTs, genomes were scanned for read pairs mapping both to chromosomal DNA and mtDNA as described in [8]. A total of 221 genomes from Indonesia and Oceania, from three studies, the Indonesian Genome Diversity Project (IGDP) [28], Simons Genome Diversity Project (SGDP) [29] and Vernot et al. [26] were analysed. A summary of the results for each study is shown in Table 1. In total, 1222 distinct NUMTs not annotated in the human reference genome were found (Additional file 2), with 1197 not detected in the 1000 Genomes Project (1000 GP) [30] and Human Genome Diversity Project (HGDP) samples analysed previously by Dayama et al. [8]. The 25 NUMTs already detected by Dayama et al. [8] were all present in samples from Europe, East Asia, America and Sub-Saharan Africa. The majority of NUMTs found in one study were not found in either of the other studies. These differences in NUMT patterns, even between populations within Oceania, supports previous observations of ongoing NUMT insertions in humans [4]. On average, 16.3 NUMTs were detected in each sample from the three high-coverage studies, compared to the average of 1.5 per sample found in the low to medium-coverage 1000 GP dataset. In addition, the ratio of distinct NUMTs to the total number found is very low in the 1000 GP dataset compared to others. Higher coverage thus seems to strongly increase the detection rate of this approach.

Table 1 Summary of discovered NUMTs in the different studies analysed; results for the 1000 GP dataset are taken from Dayama et al. [8] Full size table

We further analysed differences in NUMT diversity between different studies and larger geographical regions, using rarefaction analysis. Figure 2a shows a comparison of rarefaction curves for the different studies. For the 1000 GP dataset the rarefaction curves saturate at a low level, compared to the curves for the high-coverage studies, which do not reach saturation. Coverage therefore has a significant impact on the discovery of NUMTs. We also investigated NUMT diversity in different geographic regions, controlling for coverage by focusing on samples from the SGDP study. NUMT diversity here is higher in sub-Saharan Africa than in Oceania and South Asia (Fig. 2b). Additionally, we downsampled high coverage genomes to lower coverages and screened them for NUMTs. This analysis showed a moderate correlation between coverage and the amount of NUMTs detected (r2=0.35,p≤2.2e−19, Additional file 1: Figure S1), suggesting that many NUMTs may be missed in genomes with a coverage below 10.

Fig. 2 Comparison of rarefaction curves. Rarefaction curves plotted as the mean out of 100 repetitions. a Results from Island South East Asia (ISEA) and Oceania (continuous lines) are compared with worldwide 1000 GP data (dashed lines). b Comparison of different geographic regions within the Simons Genome Diversity Project (SGDP) dataset Full size image

Ancestral and archaic NUMTs in Indonesian and Oceanian genomes

A sequence for each NUMT was reconstructed from the supporting reads. Those with a total sequence length below 20 base pairs (bp) were discarded. This cutoff was chosen to keep sequences that might contain enough phylogenetic information for further classification, but still filter out sequences that can not be classified. In most cases, one single sequence was generated. For some, multiple fragments were obtained and concatenated. These NUMTs were too long to be covered completely by short split reads or did not have sufficient read coverage in all positions. From one side of an insertion, a split read pair could span a part of the NUMT with a length of the sum of the mitochondrial read length, plus the insert size and the length of the clipped half of the chromosomal read. With an average read length of 150 bp and a median insert size up to 400 bp in the analysed samples, no single fragment longer than 1250 bp would be expected. The length of the fragments generated ranged between 26 and 774 bp (median: 70 bp), with the majority being shorter than 100 bp (Additional file 1: Figure S2a). For the high-coverage genomes used in our study, NUMTs with an overall mean coverage below 5x might result from sequencing artefacts. Sequences generated from such a low coverage are also more likely to be influenced by sequencing errors and therefore were discarded. As the coverage distribution for the NUMTs (up to 59x, median 14x, Additional file 1: Figure S2b) was not different from the rest of the genome sequence, filtering for excessive coverage was not required. GC-content for the NUMTs varied between 0.27 and 0.64 (median: 0.46) (Additional file 1: Figure S2c), similar to the the GC-content for the mtDNA genome (0.45), so no filtering based on GC-content was applied.

After filtering, a total of 2041 assembled fragments were obtained from all samples combined. These fragments belong to 172 distinct NUMTs. For each of these fragments the phylogeny was analysed by building a tree with the corresponding mtDNA fragments from various humans and hominins, using chimpanzee as an outgroup (Fig. 3).

Fig. 3 Phylogenetic trees for NUMTs. Maximum likelihood trees for putative ancestral NUMTs (a, b, c) and putative Denisovan NUMTs (d, e) in relationship with other hominin mtDNA sequences using distances based on nucleotide substitution rates. Ancestral NUMTs form a sister clade to at least all modern humans and are present in 1000 Genomes Project (1000 GP) samples from around the world, except for (c). Denisovan NUMTs are more similar to Denisovan mtDNA than to modern human mtDNA and are not present in 1000 GP samples. Bootstrap values over 50 are indicated at branch locations Full size image

The rooted trees were used to infer the origins of the NUMTs according to where their sequences fell within the hominin mtDNA phylogeny. To be able to classify a NUMT as either archaic, modern human or ancestral, at least some phylogenetic information is required, but the partial mtDNA sequences of some of the NUMTs are too short or conserved for accurate placement on the tree. Therefore only those trees for which either Denisovans, Neanderthals, modern humans, or Neanderthals and modern humans formed a monophyletic clade were considered for further analysis. A tree with a monophyletic clade of Neanderthals and Denisovans was also allowed as they both might represent the ancestral state for a region where modern humans show derived alleles. Additionally, trees placing the NUMT outside of all humans were used to classify it as ancestral to all humans. As this study aimed to detect archaic NUMT insertions, only those classified as not arising from modern humans were further analysed. To exclude that these are still part of modern human variation, pairwise nucleotide distances within and between modern humans, Neanderthals and Denisovans were calculated as in Fig. 4. For 98 NUMTs no useful tree could be generated. Either the alignments did not contain more than four distinct sequences or the trees did not contain any reasonable clades, as the sequences were too short or from mtDNA regions that were too conserved to contain enough phylogenetic information.

Fig. 4 Pairwise nucleotide distances between NUMTs and mtDNA. Pairwise nucleotide distances vs. frequency (in logarithmic scale) within and between 97 modern humans (MHU, purple), 17 Neanderthals (NEA, blue), four Denisovans (DEN, green) and a specific NUMT (NUM, empty bars) for two ancestral NUMTs a, b and one Denisovan NUMT c Full size image

Most of the remaining NUMTs could be classified as modern human mtDNA according to their trees. For five NUMTs, the trees suggest an origin other than modern human mtDNA. Out of these five, two were also found by Dayama et al. [8] and are present in genomes from sub-Saharan Africa, Europe, Asia and America. The corresponding names from Dayama et al. [8] are given in brackets below.

Putative ancestral NUMTs

For NUMT 1_5966 (Poly_NumtS_67) the 58 bp sequence is identical with all corresponding Neanderthal and Denisovan mtDNA sequences, but differs at only one position from most modern humans (Fig. 3a). Therefore it cannot be confidently classified as archaic mtDNA. Taking into account its presence in worldwide modern human genomes, this NUMT was likely inserted in an ancestor of all modern humans, possibly even before the split with archaic humans, and therefore might resemble the ancestral state of this mtDNA region (Additional file 1: Figure S3). NUMT 2_3389 (Poly_NumtS_1239) forms a sister clade to all modern humans and Neanderthals (Fig. 3b). The 246 bp sequence inferred here is identical to that obtained by Dayama et al. [8] through Sanger sequencing. It is equally distant to modern humans, Neanderthals and Denisovans, but does not completely fall outside of modern human variation (Fig. 4a). Based on the comparison with an inferred ancestral human mtDNA, its age of insertion is estimated to be around 720,000 years ago, although this comparison does not allow a precise estimation [8]. This estimation falls into the proposed time range of the population split between archaic and modern humans of between 550,000 and 765,000 years ago [31] (Additional file 1: Figure S3). The presence of these two NUMTs in genomes from various populations around the world, including sub-Saharan Africa, suggests an insertion before the worldwide expansion of modern humans. NUMT hs37d5_2745 is 446 bp and was detected in the decoy sequences. These sequences are found in several de novo assemblies of the human genome, but are not present in the hg19 reference [32]. A BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi) for a 240 bp region flanking this NUMT in the National Center for Biotechnology Information database (https://www.ncbi.nlm.nih.gov/) showed the best hit for a human Bacterial Artificial Chromosome (BAC) clone from chromosome 4 (AC124864.3). The exact location of this BAC clone on chromosome 4 is not given, therefore the exact insertion site of the NUMT can not be localised. Its sequence forms a sister clade to all modern humans (Fig. 3c), but it does not fall outside of their variation (Fig. 4). The alignment contains only 15 positions with more than one allele in hominins. For one of these positions, the NUMT shares the ancestral state with chimpanzee and all archaic humans, whereas all modern humans share the derived allele (Additional file 1: Figure S4). This could indicate that the NUMT originated from an ancestral extinct or unsampled mtDNA lineage, but could also be due to a convergent mutation in the NUMT. Therefore it cannot be classified confidentially as ancestral.

Putative Denisovan NUMTs

The sequences for two NUMTs were identical to Denisovan mtDNA. NUMT 4_1787 was detected in five samples from west Indonesian populations speaking Austronesian languages (Additional file 1: Table S1) and is identical to the mtDNA sequence of all Denisovan individuals (Fig. 3d). However, the sequence obtained is only 43 bp long and differs from most other humans at just one position; thus, it cannot be confidently identified as Denisovan mtDNA.

NUMT 3_1384 is present in 15 samples from eastern Indonesia and New Guinea (Additional file 1: Table S1). A sequence of 251 bp was generated, which is identical to two Denisovan mtDNAs. It forms a clade with Denisovans and Sima de los Huesos, distinct from all other humans (Fig. 3e) and falls outside of all modern human and Neanderthal variation (Fig. 4c). The alignment contains 13 variable positions within hominins (Additional file 3). For five of these positions, Denisovans and the NUMT share an allele which differs from all modern humans. This suggests that it originated from Denisovan mtDNA rather than from mtDNA of a modern human or an ancestor of Denisovans and modern humans (Additional file 1: Figure S3). The phase of this NUMT was inferred in each sample using five phased genotypes (Additional file 1: Table S2).

A Denisovan NUMT introgressed as nuclear DNA

To determine if the archaic NUMT introgressed within Denisovan nuclear DNA as explained in Fig. 1, the flanking regions were analysed for Denisovan ancestry using phased genotype data. For NUMT 3_1384, alleles shared between modern humans and Denisovans were identified in the flanking region. These alleles could be shared due to an introgression event, incomplete lineage sorting or homoplasy. As Denisovan ancestry is low or absent in populations outside of Oceania [23, 27], any Denisovan alleles in these populations presumably reflect the latter two cases, and hence can be used to distinguish them from true introgression. Figure 5a shows the frequencies of alleles in the SGDP and 1000 GP dataset shared between a Denisovan and an Oceanian sample containing NUMT 3_1384. Shared alleles with low frequencies in these worldwide datasets are abundant before and around the NUMT, suggesting the presence of an introgressed haplotype. Shared alleles further away from the insertion point mostly show higher frequencies in the worldwide datasets, suggesting that these alleles are either shared due to incomplete lineage sorting or homoplasy. Similar distributions of low and high frequency shared alleles were observed for all haplotypes containing NUMT 3_1384, indicating the end of that putatively introgressed haplotype around position chr3:13851000. In contrast, Fig. 5b shows the frequencies of shared alleles in the same region for a sample from sub-Saharan Africa, containing very few low-frequency shared alleles.

Fig. 5 Worldwide frequencies of alleles shared with the Denisovan genome. Worldwide frequencies of alleles shared with the Denisovan genome in the region flanking NUMT 3_1384 in (a) Oceanian sample UV925, and (b) the same region in the sub-Saharan African sample NA19309. The alleles are within 20 kbp before or after the insertion site (vertical dashed line) and on the same phase. Frequencies were calculated in the 1000 Genomes Project (1000 GP) (black) and Simons Genome Diversity Project (SGDP) (grey) datasets. X-axis intervals are not linear, but indicate positions of shared alleles. High-frequency shared alleles reflect either homoplasy or incomplete lineage sorting; the greater abundance of low-frequency alleles shared with the Denisovan genome close to the insertion point (grey area) in the Oceanian sample vs. the sub-Saharan African sample suggests a Denisovan-introgressed haplotype in the Oceanian Full size image

For the region around the insertion site of NUMT 3_1384, match ratios with the Denisovan genome were calculated as described above. Introgressed haplotypes are expected to show a higher match ratio than non-introgressed haplotypes. Due to the absence of Denisovan ancestry in European and sub-Saharan African populations, these populations can be used to estimate the expected distribution of match ratios for the haplotype of interest due to incomplete lineage sorting or chance; values higher than expected from this distribution suggest Denisovan ancestry. On average, 53 phased sites contained a non-reference allele, and the distribution of match ratios is shown in Fig. 6 (min = 0, Q1 = 0.13, mean = 0.25, Q3 = 0.35, max = 0.83,). Match ratios for haplotypes containing NUMT 3_1384 range from 0.59 to 0.83 with the exception of sample S_Papuan-5 (0.48). Except for this sample, they represent the highest 0.4% match ratios together with eight other haplotypes from two Europeans, two Southeast Asians, three Oceanians and one Indonesian. The regions flanking NUMT 3_1384 show higher similarity with the Denisovan genome than the same regions in all other analysed samples, suggesting a Denisovan origin of these flanking regions. The exception for sample S_Papuan-5 may reflect depletion of Denisovan ancestry through recent recombination around the insertion site.

Fig. 6 Match ratios of modern human genomes with the Denisovan Genome. Distributions of match ratios between modern humans and Denisovans for a 20 kbp region around NUMT 3_1384 in all samples from the 1000 GP, SGDP, IGDP and Verot et al. [26]. Match ratios were calculated for each phase individually counting shared non-reference alleles. Black bars represent haplotypes containing the NUMT insertion, grey bars represent haplotypes without the NUMT insertion Full size image

For samples containing a heterozygous Denisovan NUMT insertion, the average match ratio of haplotypes containing the insertion (mean = 0.74) was significantly higher (one tailed paired t-test, p = 2∗10−9) than the average match ratio of haplotypes lacking the insertion (mean = 0.25). This further suggests that the NUMT insertion is part of a haplotype introgressed from Denisovans, whereas haplotypes for the same region but lacking the NUMT are not derived from Denisovans.

The reference genome influences the detection of archaic NUMTs

NUMTs originating from Denisovan mtDNA might not be detected in genomes mapped against modern human mtDNA due to their diverged sequences. To test the effect of this potential reference bias, we remapped ten samples from Vernot et al. [26] (Additional file 1: Table S3) against a reference genome containing Denisovan mtDNA instead of modern human mtDNA. No additional NUMTs were detected. On average only 1.9 NUMTs were detected per sample, many fewer than the 16 NUMTs per sample in the original mapping files (Table 1). Calculating the coverage along the Denisovan reference mtDNA reveals gaps where only few or no modern human reads could be aligned (Additional file 1: Figure S5). Many modern human reads containing mtDNA sequences could not be aligned to the Denisovan reference, which reduces the number of detectable NUMTs. This also suggests that some Denisovan NUMTs might not be detected due to the use of a modern human reference for all analysed samples; however the regions that differ greatly between Denisovan and modern human mtDNA are short enough that reliable detection of NUMTs involving only those regions would be difficult anyway.

Short read length complicates the identification of archaic NUMTs

Read pairs originating from NUMTs longer than 600 bp might not overlap the insertion site. Such read pairs will not be detected by our approach, but might contain important phylogenetic information. Therefore we analysed reads overlapping diagnostic mtDNA positions for different hominin lineages from Meyer at al. [31]. For the Neanderthal and Denisovan lineages, we could not detect more than five reads supporting the same diagnostic allele. For the Sima de los Huesos lineage, five reads supported a diagnostic allele at position 13391 in sample UV1134. An average of seven diagnostic positions for the Denisovan-Sima de los Huesos lineage were supported by 5-47 reads in each sample. This range of coverage is closer to the range of the nuclear chromosomal coverage of up to 50 x than to the mitochondrial coverage of around 1000 x.

For some of these positions, the sequence generated for the surrounding region falls outside of modern human variation and shows a phylogeny more similar to Denisovans and Sima de los Huesos than to modern humans (Fig. 7). This phylogeny, and the fact that the coverage is similar to the chromosomal coverage, suggests that these reads might originate from NUMTs that introgressed from an archaic hominin. They might belong to a detected NUMT which is too long to be covered by split reads, and the end of the NUMT might consist of conserved mtDNA regions that therefore do not distinguish this NUMT from NUMTs arising from other hominins.