Contamination estimates and SNP discovery

We sequenced DNA extracted from the inside of the gourd using Illumina technology (Supplementary Fig. S2 and Supplementary Methods). Due to the inefficiency of the sample, several lanes were used in different platforms resulting in a total 2.5-fold genome wide effective coverage and 7.3-fold exome coverage (Supplementary Tables S1–S2). About 24% of the sequences were human, 46% corresponded to Pseudomonas (a group of bacteria comprising plant pathogens), 27% to unknown origin and 1.3% to other organisms, including fungi (0.47%) and of course, Cucurbitaceae (0.1%) (Fig. 2) most likely derived from the gourd itself. The filtered human reads displayed the characteristic DNA damage patterns seen at the end of ancient sequences, although the signal was lower in comparison to that seen in much older samples (Supplementary Fig. S3).

Figure 2 Metagenomics analysis and species composition of the gourd's sample. Full size image

Considering the highly heterogeneous use of a handkerchief and the gourd itself, we anticipated the presence of human contaminants in the sample. To estimate this, we made use of the higher coverage obtained from the mitochondria (~140×, mtDNA genome coverage). Different haplotypes were reconstructed from the mtDNA reads (Supplementary Table S3) with a predominant (73%) N1b1a2 haplotype, concordant with what had been previously determined by polymerase chain reaction (PCR)7; however, three additional haplotypes (H1a, J1c2c2 and K*), not shared with people involved in the experimental procedures were present in decreasing ratios (13%, 9.6% and 1.9%, respectively) adding up to a total upper limit of contamination of ~24%.

To obtain nuclear contamination estimates we followed a previously developed procedure11 that takes advantage that only one X chromosome is expected to be present in a male individual (and thus, a single allele at each site). The contamination tests are based on the analysis of known polymorphic positions (present at 1000 Genomes Project Phase 1 data1) with low coverage (4× to 10×) and the comparison of their mismatch rate to adjacent sites (e.g., less than 5 bases apart). If the sample is contaminated, the number of mismatches at the known polymorphic positions will tend to be significantly higher than in adjacent sites. We obtained a 19.4% (CI 95% = 0.179–0.200) value of potential nuclear DNA contamination (Supplementary Table S4). We also checked the diagnostic haplogroup positions at the Y-chromosome finding a similar value of 17% (CI95% = 0.063–0.277) of contamination (8 out of 47 reads) (Supplementary Table S5). Considering uncertainties associated to sampling and low coverage, it is likely that contamination is equally prevalent at the nuclear and at the mitochondrial genome.

The significant degree of contamination should emerge as heterogeneities in the resulting genotypes. However, we can safely assume that in regions with significant coverage, the 17–24% of background contamination will consistently emerge as the minor allele. Therefore in subsequent analyses we systematically assessed the major alleles.

To remove contaminants by selecting the major allele, we developed a procedure of eliminating SNPs with a substantial skewed allele imbalance (Supplementary Tables S6–S7, Supplementary Fig. S4). Thus, we finally obtained two sets of single-nucleotide polymorphisms (SNPs): a genome wide set of SNPs with >3× coverage (mean coverage = 5×) and a second set of SNPs with >9× coverage (mean coverage = 12×) restricted to the higher coverage exome data. In both sets, we used an allele imbalance removal model accounting for a contamination ratio of up to 30% (Supplementary Methods). While contamination can still remain problematic a lower coverages, we have estimated that in the case of 12× coverage the remaining contamination, according to a Bernoulli statistical model and conditional probabilities, range between 0.25% and 0.39% (Supplementary Methods).

The second set has a higher confidence due to the increased coverage; nevertheless we have calculated a genotype concordance between both datasets of 98%. Obviously, the conservative procedure selected had the side effect of removing real heterozygotes from the gourd's genome. Subsequent Sanger sequencing of 9 sites confirmed the undercalling of heterozygous SNPs (false negative rate of 33.3%) (Supplementary Table S8). Combining the two datasets in a non-redundant set of SNPs, we characterized a total of 1,208,005 SNPs contained in dbSNP, from which 314,549 (26.0%) were reported as heterozygous. Only in the exome, we reported 36,928 SNPs from which 43.75% are heterozygous. The overall Ti/Tv ratio is 1.91 (Supplementary Table S6).

We finally reasoned that our power to make functional interpretations from alleles at individual level could still be compromised; therefore, we applied a final filtering strategy, based on known linkage disequilibrium (LD) patterns in European populations. It is expected that contaminants (we known that there are at least three within the sample) will generate mixed haplotypes, thus breaking the LD blocks on our genome as compared to present-day Europeans. We checked if a random sample of haplotypes found in 60 CEU (Northern and Western European Ancestry) individuals from the 1000 Genomes Project pilot phase12 were shared with our genome (Supplementary Methods). We found 17.898 haplotypes (involving ~250,000 SNPs) composed by the same linked variants, ranging in number from 3 to 851 SNPs and showing an average similarity to the described CEU haplotypes over 93.5% (Supplementary Fig. S5). Moreover, the haplotypes in the gourd's genome tend to be among those found in higher frequencies in modern Europeans (average frequency over 60%) (Supplementary Fig. S6). These observations suggest that despite the contamination of the sample it is still possible to obtain reliable information on the prevalent genome present in the gourd.

Ancestry inference

According to the historical records, the genealogical track of the king Louis XVI was highly admixed, with a major Central European contribution (Supplementary Table S9). His 16 great-great-grandparents were from present-day Germany (N = 8), Austria (N = 1), Poland (N = 4), Italy/France -House of Savoy- (N = 1) and France (N = 2, one corresponding to Louis, Grand Dauphin of France (1661–1711)) (Supplementary Table S9).

We first determined the gourd's blood paternal ancestry, using diagnostic SNPs in the Y-chromosome, as defined by ISOGG (2014). Several derived SNPs define a G2a2a haplogroup (Supplementary Table S5); again, this result is concordant with that previously found with the AmpFlSTR Identifiler kit7.

An identity-by-descent (IBD) tract sharing analysis showed very few IBD tracts with extant Europeans, likely due to gaps in the genomic coverage and undercalling of heterozygous sites. The sequenced individual shared the highest number of fragments with a French individual, but shared the longest tract with a Belgium individual (Supplementary Table S10).

We also performed a principal component analysis (PCA) on the >9× SNP exome dataset, using 236 individuals from the 1000 Genomes Project1 for comparison (16,635 SNPs). To minimize differences between our low coverage genome and the reference genotypes, we randomly sampled a single haploid allele from each individual prior to PCA analysis13. The gourd's individual can be found close to the CEU distribution, slightly displaced towards the Iberian samples and Tuscan samples from Northern Italy (Fig. 3). Subsequently, to place the gourd's genome in a more precise European context, we performed a genome wide PCA using the European populations of POPRES14. Because only 3% of our exome data intersect with POPRES markers, we utilized all SNPs passing our filters genome-wide in this analysis -SNP dataset increased to about 100,000. We found the sequenced individual clusters with Northern Italian individuals (Fig. 4a) which is not what we could expect owing the known ancestry of Louis XVI. This data set is obviously more influenced by the contamination of the sample, due to the less efficient allele imbalance removal associated to the low coverage. To explore if Louis XVI ancestry at the level of his great-great grandparents was compatible with the PCA results, we have randomly generated 30 composite genomes with the proportions of his known ancestry and generated a PCA with POPRES European populations. This simulation showed that the expected position of the king in the PCA is among present day Central European populations (Germany and Poland) (Fig. 4b).

Figure 3 Principal component analysis (PCA) of the gourd's genome SNP data at >9× coverage and compared to European individuals of the 1000 Genomes Project. The analysis was based on 47 randomly selected individuals per population (FINS, CEU, GBR, TSI), except for the Iberians, where all six available individuals were included. A total of 32,701 SNPs were included after filtering for heterozygous positions in order to account for the background contamination. Full size image

Figure 4 (a) PCA of the merged POPRES dataset and the gourd's genome data. For POPRES individuals, abbreviations and colors for each sample follow conventions of a previously published analysis24 in which two letter codes represent population identifiers. The black dot represents the individual data from the gourd's genome. A total of 101,107 SNPs (>3× coverage) were included. (b) PCA of the merged POPRES and 30 randomly composite genomes generated from the known ancestry of Louis XVI at the great-great grandparents generation. For POPRES individuals, abbreviations and colors for each sample follow conventions of a previously published analysis24 in which two letter codes represent population identifiers. Composite individuals are represented by uppercase C. A total of 196,526 SNPs were included. Full size image

In summary, the results of these analyses do not support the royal identity of the sequenced genome. However, given Louis XVI's complex great-great-grandparental ancestry, the low coverage and the aggressive filtering to remove minor alleles from the current dataset, we consider it could still be possible, although implausible, that the gourd's blood could be that of the French king. Alternatively, it could be that one of the so-called contaminants detected at mitochondrial level may in fact correspond to the king's blood; however, we do not have any evidence (e.g., Bourbon Y-chromosome SNPs among our reads) that could support this assumption.

Functional exome assessment and phenotypical traits

We performed a functional characterization of the variants found in his genome, including a catalogue of features unique to this individual. A total of 3,850 genetic variants not described in dbSNP were found in the exome (Supplementary Table S11). In the European 1000 Genomes, this figure is on average 16,059 (min = 15,549, max = 16,526); the difference can be attributed to the low coverage and the contamination removal filtering we applied to the gourd's genome. Despite of this, we still found 13 stop codons and 8 non-synonymous changes (all in heterozygosity) within genes that are associated with known human Mendelian diseases, as well as 7 stop codons and 10 non-synonymous changes in genes that are not, but occurred as homozygote for the alternative allele (Supplementary Tables S12–S15). None of these variants could be easily related to any known phenotype -including height- in the king.

We then reviewed different contemporary historical sources, including Marie-Antoinette's correspondence and transcribed conversations from testimonies and relatives (Supplementary Methods) to characterize physical and behavioural traits of Louis XVI. We subsequently contrasted the sequenced genome with several phenotypical features that the king displayed (blue eyes, tall height, obesity), as well as some possible disorders such as type-2 diabetes. The eye colour has an already well established genetic background15,16 and tools such as IrisPlex17 can yield an accurate prediction of eye colour based on the six most-informative single nucleotide polymorphisms (SNPs).

The remaining phenotypes have complex genetic backgrounds that are currently being investigated from genome-wide association studies (GWAS). Due to the gaps and reduction of heterozygotes associated to the low sequence coverage and the background contamination, we don't have information for all the published SNPs; however, we took advantage of linkage disequilibrium (LD) patterns to recapture SNPs not genotyped and also to confirm those present in the genomic draft.

The conservative final set of SNPs after selection and filtering (Table 1) did not allow us to screen for type-2 diabetes and obesity since virtually no data was available to evaluate the risk susceptibility of the individual's genome. Therefore, we continued the analysis on two of the traits, eye color and height.

Table 1 Number of SNPs considered for the analysis for each complex phenotype, selected on the grounds of historical accounts about the king Louis XVI Full size table

The king is depicted with blue eyes in all his portraits, despite the fact that his parents (Louis, Dauphin of France and Maria Josepha of Saxony) were brown-eyed. In the previous study, we retrieved by PCR the critical rs12913832*C SNP at the HERC2/OCA2 locus, finding the nucleotide substitution associated to blue eyes in heterozygosity. To further explore the gourd's genome phenotype we determined the allelic state at five additional loci and predicted the eye color with Iris-Plex17 (Supplementary Table S16). The fact that most of the remaining SNPs are associated with brown eyes resulted in a low probability for blue (2.4%) or intermediate (8.3%) eye color.

Complex traits such as height are more difficult to assess from genetic data alone. However, we could evaluate 23 high-quality SNPs influencing this trait (Table 2). We assumed an additive multiple regression linear model for the increase in centimeters that each allele adds to the final individual's height (based on the Beta-coefficients compiled from literature). To put in context the gain in height inferred for the sequenced individuals, we calculated the distribution of centimeter gain that each combination of height-increasing alleles would confer over an undetermined basal height in Europeans (performing 100,000 resamplings from CEU allele frequencies under Hardy-Weinberg assumption). The allelic combination found in the gourd's genome generates an absolute increase of 7.22 cm. Given that the average increase in the population is 9.05 cm (SD = 1.03), this individual is located in the 3.9 bottom percentile. We repeated the analysis using only the genome-wide significant SNPs (P < 5 × 10−7 and <5 × 10−8 in GWAS Catalog), but the individual was placed at the 8.37 and 9.78 bottom percentiles, respectively. Similar results were obtained when, rather than assuming HW equilibrium, resamplings were made from the genotypic frequencies in extant CEU samples. In this case, the absolute gain in height for this individual is 5.94 cm, which placed the gourd's individual in the 4.71 bottom percentile (CEU average 6.52 cm; SD = 0.77). The additional search for possible mutations in genes associated to cases of gigantism produced no results.