We analyzed (i) genome-wide DNA data extracted from teeth of the skull (MCE1356, Fig. 2b , Supplementary Fig. S1 ) and from tissue samples of eight belugas and eight narwhals sampled in Disko Bay, West Greenland, and (ii) stable carbon and nitrogen isotopic compositions of bone collagen from MCE1356 and a reference panel of 18 belugas and 18 narwhal skulls from West Greenland (Fig. 1c ). Tissue samples (skin) were stored in 96% ethanol. The samples were collected by scientists from the Greenland Institute of Natural Resources under the general permit for biological sampling of the Inuit from the Greenland Government and exported to Denmark under CITES permit IM 0401-897/04, IM 0721-199/08, IM 0330-819/09 and 116.2006. Sample information is detailed in Supplementary Table S2 .

Unlike the tissue samples, old bone and teeth samples have relatively low DNA concentrations, which necessitates different extraction and library build protocols. Approximately 0.5 g of bone powder from five teeth and one bone shard from specimen MCE1356 was drilled using a hand-held Dremel. DNA was extracted from the tooth/bone powder in a dedicated ancient DNA clean laboratory at the Natural History Museum of Denmark, University of Copenhagen, using the extraction buffer described in 11 with the addition of a 30 minute predigest stage 12 . Instead of using Zymo-Spin V columns (Zymo Research), the extraction buffer was concentrated using Amicon Ultra 30 kDa Centrifugal Filter Units and further concentrated and cleaned using Qiagen Minelute tubes. The purified extracts were then built into Illumina libraries following the protocol described by 13 . We used qPCR to check that the library build was successful, to select which libraries to sequence, and to calculate the appropriate number of PCR cycles required to sufficiently amplify each library without causing overamplification. In total, four libraries were amplified with unique 6 bp indexes, and screened for endogenous content on the Illumina MiSeq platform using 250 bp SE sequencing. The best libraries were then re-sequenced on the Illumina HiSeq 2500 platform using 80 bp SE technology.

DNA from tissue samples was extracted using the Qiagen Blood and Tissue Kit following the manufacturer’s protocol. The DNA was fragmented using the Covaris M220 Focused-ultrasonicator to create ~350–550 base pair (bp) fragment lengths. Libraries were built from the fragmented DNA extracts using Illumina NeoPrep following the NeoPrep Library Prep System Guide applying default settings. PCR amplification, quantification, and normalization were all carried out by the NeoPrep Library Prep System. The libraries were screened for size distribution on an Agilent 2100 Bioanalyzer and pooled in equimolar ratios before sequencing on an Illumina HiSeq 2500 with 80 bp SE technology.

All mapping and DNA damage analyses were performed within the Paleomix pipeline 1.2.12 14 . Reads were trimmed with AdapterRemoval 2.2.0 15 using default settings except minimum read length which was set to 25 bp. Reads were inspected using FastQC and aligned with BWA 16 applying the Backtrack algorithm and disabling the starting seed length. If reads mapped to multiple positions or had mapping quality scores (MAPQ score from BWA) less than 30, they were removed using SAMtools 17 . Sequence duplicates were removed using MarkDuplicates from Picard (available from: http://broadinstitute.github.io/picard ) and the final alignment was realigned around indels using GATK 18 . Deamination of cytosine to uracil in specimen MCE1356 was assessed using the output from mapDamage v2.0.6 19 . MapDamage results did not show any clear signal of deamination in the sequences (Supplementary Fig. S3 ).

To determine the maternal lineage of MCE1356, DNA sequencing reads were mapped to the beluga (GenBank accession: KY444734) and narwhal (GenBank accession: NC_005279) mitochondrial reference genomes and mean coverage was compared. Reads from the eight beluga and eight narwhal samples that comprised the reference panel, were mapped to their respective mitochondrial reference genomes. We constructed mitochondrial consensus sequences of MCE1356 and the 16 reference panel samples with regions covered by more than five reads using BEDtools 20 , SAMtools 17 and GATK 18 . We created two sequence alignments using ClustalW 21 , applying default settings, which included the 16 reference panel samples and either the version of MCE1356 mapped to the beluga mitochondrial reference genome or the the version of MCE1356 mapped to the narwhal mitochondrial reference genome. We used the two alignments to construct median-joining haplotype networks 22 using Popart 1.7 23 (available from: http://popart.otago.ac.nz ), excluding any sites with indels or missing data. Subsequently, both post-mapping coverage and the two haplotype networks were used to determine the species of MCE1356′s maternal lineage.

Nuclear DNA analyses

We mapped the DNA sequencing reads from all samples to the killer whale (Orcinus orca) reference genome (GCA_000331955.2). A high-quality beluga whale genome was recently published24, but mapping to one of the two potential parental species could create biases in the analyses. Hence we mapped the reads to the killer whale genome, as it is well assembled and killer whales are still relatively closely related to belugas and narwhals (divergence time of 12 MYA)25, yet distant enough to minimize the risk of introgression that would complicate our analyses.

For all further filtering we used ANGSD v0.92326, a software package that uses genotype likelihoods instead of called genotypes, which is particularly useful when analysing low-coverage NGS data. We used the SAMtools17 method implemented in ANGSD to estimate genotype likelihoods, and inferred major and minor alleles directly from the genotype likelihoods using a maximum likelihood approach as described in27.

To visualise the mapped data, we plotted the read depth distribution from all individuals, excluding sites with more than two alleles and sites with a Phred score below 25. Visualising the data from all individuals combined enabled us to estimate the mean read depth of 4.14x and identify sites with elevated read depth. Such sites were more likely to come from paralogs and repetitive regions of the genome. The dataset was visually inspected and further filtered, excluding all sites with read depth greater than nine (6.9% of sites). In ANGSD, SNPs were called based on their allele frequencies. The minor allele frequency (MAF) was estimated from the genotype likelihoods and a likelihood ratio test was used to determine if the MAF was different from zero. If the p value from the likelihood ratio test was <1e-4 the site was considered polymorphic and retained in the dataset. Applying this SNP filter meant that no sites with less than four reads were retained, as sites covered by fewer reads could not obtain p values this low. These filters were applied to a data set including all 17 samples, and a dataset without MCE1356.

To determine whether MCE1356 was a beluga/narwhal hybrid, we further filtered the dataset of 17 individuals, excluding sites with no reads in MCE1356. At this point the mean read depth of MCE1356 was only 1.15x, so in order to ensure that we were not analyzing multicopy loci, we excluded sites covered by more than one read in MCE1356.

Our aim was to compare the alleles found in MCE1356 to the alleles in the reference panel, so we estimated the probability of assigning the allele found in MCE1356 to the wrong parental species given different reference panel minimum unique read depths and allele frequencies. Unique read depth was defined as the number of reads covering a specific site, where all reads came from a unique individual. This probability P was calculated as in Equation 1:1 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$P={f}_{(ps)\,}^{\,ur{d}_{(psp)}}\times \,(1-{f}_{(ps)})$$\end{document} P=f(ps)urd(psp)×(1−f(ps)) Where f (ps) is the allele frequency in the parental species, and urd (psp) is the species specific unique read depth in the reference panel.

The parental species allele frequency giving the highest probability f (ps-max) could be described as in Equation 2:2 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${f}_{(ps-max)}=(\frac{ur{d}_{(psp)}}{ur{d}_{(psp)}+1})$$\end{document} f(ps−max)=(urd(psp)urd(psp)+1)

By inserting Equation 2 into equation 1, the maximum probability P (max) of assigning the allele found in MCE1356 to the wrong parental species was calculated as in Equation 3:3 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${P}_{(max)}={(\frac{ur{d}_{(psp)}}{ur{d}_{(psp)}+1})}^{ur{d}_{(psp)}}\times (1-(\,\frac{ur{d}_{(psp)}}{ur{d}_{(psp)}+1}))$$\end{document} P(max)=(urd(psp)urd(psp)+1)urd(psp)×(1−(urd(psp)urd(psp)+1))

Results revealed that with a unique read depth of two, three and four in each parental species, the maximum probability of assigning the allele found in MCE1356 to the wrong parental species was 0.148, 0.105 and 0.082, respectively. These maximum values should not be interpreted as error rates, as they would only be obtained if all sites were variable within the parental species, and all sites had a MAF of exactly (1 − (urd/urd + 1)). Furthermore, assuming that the MAF distributions in belugas and narwhals were similar, erroneous assignment of alleles found in MCE1356 would affect both species equally, and therefore have a minimal influence on inferences of hybridization. A benefit of using the unique read depth in equations 2 and 3 was that it combined the number of individuals and number of reads in the estimation of the probability of assigning the allele found in MCE1356 to the wrong parental species. Subsequently, the read depth distribution, MAF (with fixed major and minor allele), and number of individuals with data in each site were calculated separately for each parental species.

Three datasets were constructed, which besides MCE1356 included parental species panels with minimum unique read depths of two, three and four. The number of sites retained in the three datasets was 61,105, 11,764 and 360, respectively. As a compromise between maximizing the number of sites and minimizing the wrong assignment of alleles found in MCE1356, we chose to use the dataset with one read in MCE1356 and minimum unique read depths of three in each parental species. That dataset included 11,764 sites, which were used in subsequent analyses.

Summary statistics were performed on the dataset with 17 individuals, including number of sites that were (i) fixed for different alleles in the beluga and narwhal species panels; (ii) polymorphic in belugas, but not in narwhals; (iii) polymorphic in narwhals, but not in belugas; (iv) polymorphic in both belugas and narwhals. The sites that are estimated to be fixed between the two parental species panels will be enriched for markers that are highly differentiated, i.e. have large allele frequency differences, between the two parental species. These markers, although not necessarily fixed differences between the two parental species, are still highly informative for ancestry in MCE1356.

We used the genotype likelihoods from the dataset without MCE1356 to verify that the belugas and narwhals in the reference panel were not themselves recently admixed individuals. We estimated their individual admixture coefficients while specifying two populations (K = 2) using NgsAdmix28. One hundred runs were performed and mean and standard deviation of the admixture coefficients were used for subsequent interpretation. To confirm that the filters had not revealed previously hidden admixed genetic profiles in the reference panel, this analysis was performed both before and after the unique read depth filters were applied.

We analysed the admixture proportions of MCE1356 using fastNGSadmix29, applying 1,000 bootstraps. fastNGSadmix uses allele frequencies of reference populations/species and the genotype likelihoods of a single individual to estimate its admixture proportions. The software has proven robust with NGS data with coverage as low as 0.00015x29, and was therefore ideal for our study.

We further estimated the hybrid status of MCE1356 by investigating sites fixed for different alleles in the beluga panel and the narwhal panel (9,178 sites), and comparing the observed proportion of reads matching the beluga-specific allele and the narwhal-specific allele in MCE1356 to the expected proportions under different hybridization scenarios. To determine how well seven different hybridization scenarios matched the observed data, we computed a Pearson’s Chi-square goodness of fit statistic. The test statistic is computed as in Equation 4:4 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$T=\Sigma (\frac{{(Oi-Ei)}^{2}}{Ei})$$\end{document} T=Σ((Oi−Ei)2Ei) where O i and E i are the observed and expected counts of alleles derived from parental species i, respectively. Under the null hypothesis where the chosen scenario corresponds well to the observed data, the test statistic T follows a central χ2 distribution. Thus, the scenario that corresponds best with the observed data would lead to the lowest test statistic.

To further investigate the seven different hybridization scenarios, we computed the likelihood of the observed alleles at sites that were fixed for different alleles in the parental populations. Specifically, we computed the likelihood of the observed alleles in the hybrid MCE1356, under a binomial model for the inheritance of the alleles from the parental species, further assuming independence between markers. We would like to note, however, that the violation of the independence assumption still leads to unbiased estimates. Assuming that the hybrid MCE1356 is composed of a proportion b of beluga ancestry and (1-b) of narwhal ancestry, we can write the likelihood of b as a product of the likelihood at k independent sites, given the number of reads n ib that match the beluga allele at site i and n in that match narwhal ancestry, as in Equation 5.5 \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L(b|{n}_{b},{n}_{n})={\prod }_{i=1}^{k}L(b|{n}_{ib},{n}_{in})={\prod }_{i=1}^{k}P({n}_{ib},{n}_{in}|b)={\prod }_{i=1}^{k}(\begin{array}{c}{n}_{ib}+{n}_{in}\\ {n}_{ib}\end{array}){b}^{{n}_{ib}}{(1-b)}^{{n}_{in}}$$\end{document} L(b|nb,nn)=∏i=1kL(b|nib,nin)=∏i=1kP(nib,nin|b)=∏i=1k(nib+ninnib)bnib(1−b)nin

We computed the log-likelihood of b across its whole range, from 0 to 1, and compared the log-likelihoods across the seven different hybridization scenarios.