Sampling and extraction

All laboratory procedures were performed in the dedicated ancient DNA facilities of the Max Planck Institute for the Science of Human History in Jena, Germany.

Teeth from nine individuals (one tooth from each), buried in the Mikhaylovka II tombs of the Samara region in Russia, were sectioned in the cementoenamel junction using a coping saw and 50–100 mg of dental pulp was removed from each tooth using a dental drill.

Extraction of 50–60 mg of dental pulp from each tooth sample was performed using a previously described protocol; optimised for the recovery of short DNA fragments, most typical of ancient DNA53. An initial lysis step was performed over a 12–16 h incubation of the dental pulp powder in 1 ml of extraction buffer (0.45 M EDTA, pH 8.0, and 0.25 mg ml−1 proteinase K) at 37 °C. Following extraction, DNA was bound to a silica membrane using a binding buffer containing guanidine hydrochloride (protocol previously described in ref. 53) and purified in combination with the High Pure Viral Nucleic Acid Large Volume Kit (Roche). DNA was eluted in 100 μl of TET (10 mM Tris-HCl, 1 mM EDTA, pH 8.0 and 0.05% Tween20). One extraction blank and one positive extraction control (previously assessed cave-bearing specimen) were taken along for the extraction slot.

Illumina library preparation and sequencing

To screen all samples for the presence of Y. pestis and human endogenous DNA, 10 μl of each extract was converted into double-stranded Illumina NGS libraries, using a previously described protocol54, without initial uracil-DNA-glycosylase (UDG) treatment55. A positive control (cave-bearing specimen) and a negative library control (H 2 O) were taken along for the experiment. A total of 1 μl from each library was subsequently quantified using IS7/IS8 primers. A combination of two unique indexes (8 bp length of each index sequence) also containing the universal IS5/IS6 priming sites was assigned to each sample for subsequent multiplex sequencing56. The libraries were then indexed through a ten-cycle amplification reaction using the Pfu Turbo Cx Hotstart DNA Polymerase (Agilent). Indexed PCR products were purified using a Qiagen MinElute kit (Qiagen), eluted in TET (10 mM Tris-HCl, 1 mM EDTA, pH 8.0 and 0.05% Tween20) and then qPCR quantified using IS5/IS6 primers, to assess the efficiency of the indexing reaction. After this, indexed libraries were amplified for different amounts of cycles, to achieve a total of 1013 DNA copies per reaction in order to avoid polymerase saturation and heteroduplex formation. PCR products were again purified using a Qiagen MinElute kit (Qiagen) and eluted in TET (10 mM Tris-HCl, 1 mM EDTA, pH 8.0 and 0.05% Tween20). The concentration (ng μl−1) of the indexed and amplified libraries was then measured using a 4200 Agilent Tape Station Instrument (Agilent). Finally, all samples were diluted and pooled at equimolar ratios to achieve a final 10 nM pool that would serve as template for sequencing.

In silico screening for Y. pestis reads

The sample pool was single-read sequenced on a HiSeq 4000 platform using a 1 × 76+8+8 cycles chemistry kit according to the manufacturer's protocol, to produce between 5,969,436 and 8,215,620 raw demultiplexed reads per sample. Pre-processing of reads was performed using the automated pipeline EAGER v1.9257 to clip adaptors (using ClipAndMerge) and to filter reads for sequencing quality (minimum base quality 20) and length (keeping all reads ≥30 bp). Mapping was performed using BWA58 implemented in EAGER to Y. pestis CO92 (NC_003143.1), using a –n parameter of 0.01, a –l seedlength of 16 and subsequently using SAMtools to filter for reads with a mapping quality (–q) of 37. The MarkDuplicates tool in Picard (1.140, http://broadinstitute.github.io/picard/) was used to remove duplicates.

In addition, the Megan ALignment Tool (MALT)27 was used to assess the metagenomic composition of the samples, as well as a screening tool for the identification of Y. pestis. All bacterial genomes available at GenBank were used as a reference database for the programme (NCBI RefSeq, December 2015). Pre-processed reads were used as input for MALT (version 0.3.6), and the parameters were set to 85 for the minimum percent identity (--minPercentIdentity), 0.01 for the minimum support parameter (--minSupport), using a top percent value of 1 (--topPercent) and the semi-global alignment mode. All the remaining parameters were set to default. The results were viewed in MEGAN659. Putatively positive Y. pestis samples were evaluated by comparing the amount of reads mapping to Y. pestis CO92 (NC_003143.1) to the reads assigned by MALT on the Y. pestis and Y. pseudotuberculosis complex nodes (Supplementary Table 1).

In-solution Y. pestis capture and deep-shotgun sequencing

Rich double-stranded DNA libraries were prepared for in-solution capture and deep-shotgun sequencing of putatively positive Y. pestis samples, using 50 μl of extract (or 2 × 25 μl of extract), according to a previously described protocol54, with an initial partial-UDG treatment step60, where UDG in combination with endonuclease VIII (USER enzyme, New England Biolabs) were used to remove all deaminated cytosines (uracils) with the exception of terminal uracil nucleotides that lack 5′ phosphate. Double-indexing and subsequent library amplification steps were carried out as mentioned in the previous section “Illumina library preparation and sequencing”. At this stage, the sample RT5 was diluted to 10 nM for deep-shotgun sequencing on a HiSeq 4000 platform using a 1 × 76+8+8 cycles chemistry kit. In addition, 1–2 μg of samples RT5 and RT6 were in-solution captured as described previously3, where a combination of the following Y. pestis and Y. pseudotuberculosis genomes were used as templates for probe design: CO92 chromosome (NC_003143.1), CO92 plasmid pMT1 (NC_003134.1), CO92 plasmid pCD1 (NC_003131.1), KIM 10 chromosome (NC_004088.1), Pestoides F chromosome (NC_009381.1) and Y. pseudotuberculosis IP32953 chromosome (NC_006155.1). Samples were captured in separate wells of a 96-well plate, whereas extraction and library blanks (data not shown) with non-overlapping index combination were pooled and captured in a single well. Sequencing was performed on a HiSeq 4000 platform using both single-end (1 × 76+8+8 cycles) as well as paired-end (2 × 76+8+8 cycles) chemistry kits.

Y. pestis read authentication and genome reconstruction

Sequencing resulted in up to 1,140,960,213 raw reads per sample. Adaptor trimming of raw, demultiplexed, reads was performed using leeHom61. Subsequently, length and quality-filtering steps were performed in EAGER, as mentioned in the previous section “In silico screening for Y. pestis reads”. After pre-processing, captured paired-end and single-end reads from the same individuals were merged into a single file for mapping. BWA58 integrated in EAGER was used for mapping against the Y. pestis CO92 reference (NC_003143.1)62 using the following parameters: –n 0.1, –l 32, and subsequently SAMtools was used to filter for reads with mapping quality of 37 (–q 37). Mean coverages were estimated using QualiMap v.2.2.163 and DNA deamination profiles typical of aDNA were calculated using MapDamage 2.064. For genome reconstruction, and for downstream SNP calling, the same pipeline was followed with a single alteration: after adaptor trimming, reads were inputted into EAGER and 2 bp were trimmed from each end using ClipAndMerge prior to length filtering and mapping to eliminate post-mortem damage that might affect downstream SNP calling.

Read-length comparison of capture and shotgun Y. pestis reads

Two datasets derived from the same individual (RT5), sequenced using the 1 × 76+8+8 cycles kit parameters, were used to compare the read-length distributions of shotgun-sequenced reads and captured reads. For this analysis, datasets were limited to the same genomic coverage (~9-fold), to ensure uniform comparison and avoid any biases that might arise from unequal coverage. Reads shorter or equal to 74 bp were considered for the analysis, to avoid the incorporation of reads that still contain traces of adaptor, or are longer than 76 bp. Box-plot comparisons and Student's t-test were calculated using R version 3.4.165.

Y. pestis SNP calling and phylogenetic analysis

For SNP calling, we used the UnifiedGenotyper of the Genome Analysis Toolkit (GATK)66. The newly produced RT5 shotgun-sequenced and captured genomes were analysed alongside 177 previously published Y. pestis genomes (179 in total), including one previously published historical strain from the Plague of Justinian6, nine genomes from the second plague pandemic4,8,36, eight previously published LNBA genomes2,3 and a global dataset of 159 modern Y. pestis genomes (Supplementary Data 2)11,12,14,15,47,62,67,68,69,70. A Y. pseudotuberculosis strain (IP32953)18 was used as an outgroup. A vcf file was produced for every sample using the “EMIT_ALL_SITES” in GATK66, which generated a call for all positions in the reference genome. In addition, the custom Java programme MultiVCFAnalyzer v0.8571 (https://github.com/alexherbig/MultiVCFAnalyzer) was used to produce a combined SNP table, including all variable positions across our dataset, with the exclusion of previously defined noncore regions and homoplasies, as well as repeat regions, tRNAs, rRNAs and tmRNAs11,12. In addition, we used ClonalFrameML72 to identify additional homoplasies or recombinant regions in our dataset by using a full-genome alignment and a RAxML73 ML SNP tree as input, as well as the -em and the -ignore_incomplete_sites options for running the programme. Through this analysis, we identified seven additional regions (resulting in 26 SNPs) and two homoplastic SNPs (28 SNPs in total), which were also excluded from the comparative SNP analysis (Supplementary Table 11). For the remaining data, SNPs were filtered according to the following criteria: (1) homozygous SNPs and reference alleles were called when covered at least three-fold with a minimum genotyping quality of 30, (2) in cases of heterozygous positions, a SNP or reference base was called when supported by at least 90% of the reads covering the respective position and (3) if none of the criteria could be fulfilled, a “N” was inserted in the respective position. A total of 3821 SNP positions were called in the current dataset.

From the resulting SNP alignment, two ML phylogenetic trees were inferred with RAxML (version 8.2.9)73, using the generalised time-reversible (GTR) substitution model with gamma-distributed rates (six rate categories). The first included all data. For the second tree reconstruction, all columns with missing data were excluded (complete deletion), which resulted in a total of 1054 SNP positions to be considered for the phylogeny. A total of 1000 bootstrap replicates were carried out to estimate the topology support of each tree.

In addition, the phylogenetic positioning of RT6 (present study) and RISE3972 was manually explored using the following methods:

For RT6: To check whether RT5 and RT6 form the same phylogenetic branch, SNPs specific to the RT5 genome ( n = 5, Supplementary Table 7) were assessed for their presence in RT6. RT6 possesses four out of five SNP positions covered at least one-fold. All positions covered encompass identical alleles to RT5.

For RISE397: The SNP table produced by MultiVCFAnalyzer was filtered for all diagnostic positions leading from the root of the tree towards the RT5 node (n = 46 SNP positions). In addition, the SNP table was independently filtered for all positions leading from the RT5 node to the Justinianic node (branch represented by Justinian 2148 strain) (n = 23 SNP positions), for which RT5 appears to have ancestral alleles. Missing data (N’s) were excluded from this SNP analysis. The state of all alleles in RISE397 was then manually inspected using the Integrative Genomics Viewer74. The reads covering all respective positions were visually authenticated by assessing whether they include terminal substitutions that could be explained by aDNA damage (Supplementary Data 3, 4).

Divergence date and demographic analyses

TempEst v1.575 was used to assess for the presence of temporal signal in the dataset. Inclusion of the 14C or archaeological dates for all ancient isolates resulted in a 0.6 correlation coefficient, which permitted the proceeding with molecular dating analysis. The software package BEAST v1.839 was used to estimate the divergence time of Y. pestis lineages, using the coalescent constant size40 and the coalescent Bayesian skyline38 models. The SNP alignment including only Y. pestis strains was used as input, after removal of all missing data (complete deletion). The tip dates of all modern Y. pestis strains were set to 0 years before present (BP). The ages of all ancient and historical Y. pestis genomes were estimated with their prior uniform distributions based either on the two-sigma (95.4%) 14C date interval2,3,6 or the archaeological dates4,8,36 in years BP, as follows: RISE509 (4836–4625, median: 4729), RK1.001 (4828–4622, median: 4720), GEN72 (4833–4592, median: 4721), Gyvakarai1 (4571–4422, median: 4485), Kunila2 (4524–4290, median: 4427), 1343UnTal (4346–4098, median: 4203), 6Post (3957–3832, median: 3873), RT5 (3868–3704, median: 3789), RISE505 (3694–3575, median: 3635), London Black Death (602–604, median: 603), observance (228–230, median: 229), Bolgar 2370 (550–588, median: 569) and Justinian 2148 (1382–1524, median: 1453). In addition, we used MEGA776 to test whether there is an equal evolutionary rate across our phylogeny. The strict clock rate was rejected, and therefore we applied a lognormal relaxed clock77 for all dating analyses, along with the GTR model of nucleotide substitution (six gamma categories), as previously performed2,3. An ML phylogeny was reconstructed using RaxML73 and was used as a starting tree. The LNBA lineage and the rest of Y. pestis strains were constrained to be two separate monophyletic groups in BEAUti v1.839. A single chain of 300,000,000 states was run for each model setup, sampling every 10,000 states. In addition, we estimated MLE using PS/SS sampling41 as part of the same set-up in BEAUTi to assess which of the two demographic models is best fit for our data. We run this analysis for an extra 300,000,000 states divided between 100 steps (3,000,000 states each), using an alpha parameter of 0.3. After run completion, the molecular dating results were viewed in Tracer v1.6 (http://tree.bio.ed.ac.uk/software/tracer/) to ensure that all expected sample sizes were above 200. TreeAnnotator was used to produce a maximum clade credibility tree with a 10% burn-in (excluding the first 3000 trees), which resulted in processing of 27,001 trees for each analysis with a Jeffreys prior distribution (1.0) for the population sizes. In addition, for the coalescent skyline analysis, we used 20 as the dimension for the population and group sizes. Once the chain was complete, we used LogCombiner39 to resample MCMC states at lower frequency (every 300,000) with a 10% burn-in, and the resultant .log and .tree files were used as input for the skyline plot construction in Tracer v1.6 (http://tree.bio.ed.ac.uk/software/tracer/).

Analysis of virulence factors

As map-quality filtering may influence read mappability in certain chromosomal and plasmid regions, we used a mapping quality filter of 0 (–q parameter) to evaluate the presence or absence of chromosomal and plasmid virulence-associated genes in RT5 compared to previously published modern and LNBA Y. pestis genomes48. Bedtools78 were used to calculate the percentage of gene covered across each region, and a heatmap was plotted using the ggplot279 package of R version 3.4.165.

In addition, the virulence-associated genes flhD, PDE-2, PDE-3, ureD and rcsA, which are known to have become inactivated in Y. pestis by either mutation or single-nucleotide insertions/deletions19, were instead manually explored using IGV74. Gene flhD, associated with flagellar biosynthesis and whose silencing contributes to immune evasion, is inactivated by a frameshift caused by a T insertion, present at position 1,892,659 in CO9246. PDE-2, a phosphodiesterase gene contributor in biofilm degradation is inactivated by a T insertion in a six-T stretch at position 1,434,044 in CO92. In addition, PDE-3, also part of the same biofilm-degradation mechanism, is affected by two mutations, a C > T change (also called the PDE-pe’ allele19), and a nonsense G > A substitution, which are, respectively, shown at positions 3,944,166 and 3,944,534 in Y. pseudotuberculosis IP3295318. The urease enzyme, ureD, that causes toxicity in fleas, is inactivated in Y. pestis by a G insertion in a six-G stretch, shown at position 2,997,296 in CO92. Finally, the rcsA gene, a component of the Rcs system that functions as an inhibitor to biofilm formation, is known to have become inactivated in Y. pestis by a 30 bp internal duplication, previously described in the strain KIM (NC_004088.1)19.

In addition, as certain Y. pestis strains present in the closely related 0.PE4 “microtus” lineage have been previously associated to a reduced pathogenicity, genes associated to this attenuated phenotype48 were explored in RT5, in relation to 0.PE4. The following regions were integrated into the presence/absence heatmap analysis described previously, as they are absent in microtus: YPO1986 to YPO1987, YPO2096 to YPO2135, YPO2469, YPO2487 to YPO2489 and YPO3046 to YPO3047 (region annotations are given as they appear in CO92) (Fig. 3). Additional genes, which appear to have been disrupted by IS elements, or by mutations/InDels (Supplementary Table 9), were visually inspected in IGV (Supplementary Fig. 10)74.

Data availability

Raw sequencing data of the deep-sequenced RT5 and RT6 isolates have been deposited into the European Nucleotide Archive under project accession number PRJEB24296. Other data supporting the findings of the study are available in this article and its Supplementary Information files, or from the corresponding authors upon request.