Sampling and DNA extraction

DNA was extracted48 from a total of 81 ancient human samples (teeth and bones) from the Eastern Baltic region, ranging from the Mesolithic Kunda culture to the Late Bronze Age (Supplementary Table 8). From Scandinavia (Sweden), we sampled 22 human remains from Mesolithic, early TRB and LN contexts. Two samples from north-western Russia were associated with Mesolithic contexts. The samples and their archaeological context are described in Supplementary Note 1 and presented in a tabular overview with sequencing results in Supplementary Data 1.

Sampling was performed in the cleanroom facilities at the Institute for Archeological Sciences in Tübingen for the Eastern Baltic material, at the Australian Centre for Ancient DNA for the Popovo sample, at the cleanroom facilities of the Max Planck Institute for the Science of Human History, Jena, for the samples from Olsund and Uzhni/Yuzhny Oleni Ostrov, and in the ancient DNA laboratory of the Archaeological Research Laboratory, Stockholm, for the remaining Swedish material. The human remains were treated with ultraviolet (UV) light from all sides for 10 min to reduce surface DNA contamination. Teeth were sawed transversally at the border of root and crown before sampling dentine powder from the inside of the crown with a sterile dentistry drill. Bone powder was taken from the inner parts of the bones with a sterile dentistry drill after removing the surface layer of the bone.

Between 30 and 200 mg of powder was used for each DNA extraction (Supplementary Data 1, column M1). The extraction was performed following a silica-column-based protocol optimized for the recovery of small ancient DNA molecules48 with use of the High Pure Viral Nucleic Acid Large Volume Kit (Roche). Extraction buffer (0.45 M EDTA, pH 8.0 (Life Technologies), 0.25 mg/ml Proteinase K (Sigma-Aldrich)) was added to the bone powder aliquot, and rotated overnight at 37 °C. The powder was then pelleted by centrifugation at 14,000 rpm. The supernatant was added to 10 ml binding buffer (5 M GuHCl (Sigma-Aldrich), 40% Isopropanol (Merck)) with 400 μg sodium acetate, pH 5.5 (Sigma-Aldrich) and mixed. The mixture was then transferred to the High Pure Extender Assembly funnel with a purification column attached and contained in a 50-ml Falcon tube. The tube was then spun at 1500 rpm for at least 8 min with slow acceleration until the binding buffer had mostly passed the purification column. Then the column was transferred into a new collection tube and the liquid remaining in the funnel was transferred to the column that was then centrifuged at 14,000 rpm. This was followed by a wash step consisting of adding 450 μl of wash buffer (supplied with the High Pure Viral Nucleic Acid Large Volume Kit) to the column and spinning it at 8000 rpm for 1 min, the wash step is repeated and then followed by two dry spins at 14,000 rpm for 1 min. The DNA was eluted in a fresh siliconized Eppendorf tube in two elution steps of 50 μl TET (1 mM EDTA, 10 mM Tris-HCl, pH 8.0 (AppliChem), 0.05% Tween-20 (Sigma-Aldrich)) centrifuged for 1 min at 14,000 rpm, resulting in 100 µl of DNA extract for each sample. Negative controls were taken along for each extraction setup.

Library preparation and targeted enrichment of human mtDNA

Double-stranded next-generation sequencing libraries were prepared from a 20-µl aliquot of extract following a protocol established for ancient DNA49. Negative controls were taken along for each library preparation setup. First, a blunt-ending step was performed by adding the template to a mix of 1× NEB buffer 2 (NEB), 100 µM dNTP mix (Thermo Scientific), 0.8 mg/ml BSA (NEB), 0.4 U/µl T4 Polynucleotide Kinase (NEB), 0.024 U/µl T4 Polymerase (NEB) and 1 mM ATP (NEB) and incubating at 15 °C for 15 min, then for 15 min at 25 °C, followed by purification with the MinElute kit (QIAGEN) and elution in 18 µl of TET. The following adapter ligation was performed by adding 1× Quick Ligase Buffer (NEB), 250 nM Illumina Adapters (Sigma-Aldrich) and 0.125 U/µl Quick Ligase (NEB) for a final reaction volume of 40 µl. The mix was incubated for 20 min at room temperature after which another MinElute purification was performed and the DNA is eluted in 20 µl TET. The following fill-in step consisted of adding the 20 µl DNA to 1× Isothermal Buffer (NEB), 125 nM dNTP mix (Thermo Scientific) and 0.4 U/µl Bst Polymerase 2.0 (NEB) for a final reaction volume of 40. The mix was incubated for 20 min at 37 °C and for 20 min at 80 °C. After the fill-in step, libraries were quantified via qPCR to ensure that the reactions were efficient. Some DNA extracts showed evidence of inhibition of enzymatic reactions, possibly due to the presence of humic acids or chemicals (glue or hardener) used for bone treatment50. To overcome inhibition, library preparation was repeated for samples that had a low DNA yield after initial library preparation or had abnormal extracts (e.g. dark colouring, floating particles, etc.) using 10-fold less extract as template to dilute the potential inhibiting factors.

Libraries were then barcoded in a PCR-reaction using primers containing sample-specific index sequence combinations51 and limiting the amount of template molecules to 2e+10 per reaction (0.2 mM of the library-specific P5 and P7 primers, 1× Buffer Pfu Turbo (Agilent), 0.25 mM dNTP mix, 0.3 mg/µl BSA, 0.025 U/µl Pfu Turbo (Agilent) for a total reaction volume of 100 µl). The amplification took place in a modern DNA lab with an initial denaturation of 2 min at 95 °C, then 10 cycles of: 30 s at 95 °C, 30 s at 58 °C, 1 min at 72 °C; followed by elongation for 10 min at 72 °C.

Libraries were enriched for human mitochondrial DNA using a bead-based hybridization protocol52, pooling at most five different sample libraries into one capture pool.

Sequencing for screening

Libraries and mtDNA-enriched library pools were quantified on an Agilent 2100 Bioanalyzer DNA 1000 chip and pooled at equimolar concentrations. Libraries not enriched for human mtDNA were shotgun sequenced to determine the percentage of endogenous human DNA in every DNA library and assign the genetic sex of the individuals53. Libraries enriched for mtDNA were sequenced separately to allow for reconstruction of the mitochondrial genome of each individual and estimation of modern mitochondrial contamination. Library pools were sequenced according to the manufacturer’s protocols on an Illumina HiSeq2500 at the department of Medical Genetics at the University of Tübingen for 2 × 100 cycles to a depth of ~1.5 million reads per sample.

Data processing for screening

After demultiplexing, resulting sequencing reads were processed using a computational pipeline developed for aDNA54 that merges paired-end reads (default parameters) and mapping of reads against a user-specified reference genome. Between 326 and 10,039,616 shotgun sequenced reads (Supplementary Data 1, column N) went into mapping with BWA (v0.6.1)55 against UCSC genome browser’s human genome reference GRCh37/hg19. For mtDNA capture, data between 375 and 7,454,704 reads (Supplementary Data 1, column Q) went into mapping against the human mtDNA reference rCRS56 using the circular mapper implemented in the pipeline54. The low number of reads mapping for Spiginas3 and Motala313 indicated a failure of reagents during library preparation.

The proportion of endogenous human DNA in shotgun sequencing ranged from 0.00% to 59.6% (Supplementary Data 1, column O). Genetic sex could confidently be determined for 55 individuals53 (Supplementary Data 1, column U).

The mtDNA reconstruction and contamination estimation was performed by an iterative likelihood-based approach, taking into account that the consensus mtDNA sequence should be reconstructed from molecules that originate from a single individual and that show characteristics of aDNA58. Complete mitochondrial genomes (covered at least 85%) could be reconstructed for 61 individuals and less than 5% mitochondrial contamination (Supplementary Data 1, column S). For these, the percentage of deamination at the molecule ends exceeded 20%, a characteristic of authentic ancient DNA58 (Supplementary Data 1, column T).

The three extracts produced for the sample Olsund did not undergo the screening procedure; the mtDNA haplogroup and mtDNA contamination reported for this sample was determined from the nuclear capture data, see below.

Nuclear capture and sequencing for genome-wide data

Forty-one samples (including two previously studied north-western Russian samples25) were chosen for nuclear capture or deep shotgun sequencing. Uracil–DNA–glycosylase treated (UDG-half) libraries59 were prepared out of the DNA extracts of these samples by adding the extract to a reaction of total volume 60 µl with 1× Buffer Tango (Thermo Fisher Scientific), 100 µM dNTPs, 1 mM ATP and 0.06 U/µl USER enzyme (NEB) and incubating for 30 min at 37 °C. The reaction was then inhibited by adding 0.12 U/µl UGI (NEB).

These libraries were then barcoded with sample-specific index sequence combinations60, subsequently amplified with Herculase II Fusion (Agilent) and enriched using an in-solution hybridization protocol24 for a targeted set of ~1.2 million nuclear SNPs (1240k SNP set)2,4.

Enriched libraries from the Eastern Baltic and Swedish samples were paired-end sequenced on a NextSeq500 at the department of Medical Genetics at the University of Tübingen using 2 × 75 bp cycles and a HiSeq4000 at the IKMB in Kiel, using 2 × 150 bp cycles, and single-end sequenced on a HiSeq4000 for 75 cycles at the Max Planck Institute for the Science of Human History in Jena. The UDG-treated library of UzOO77 was processed at the Max Planck Institute for the Science of Human History in Jena, Germany, and was sequenced there on a HiSeq4000 for 2 × 75 cycles, and the UDG-treated library for Popovo2 was processed at Harvard Medical School, Boston, USA, and sequenced here on a NextSeq500 for 2 × 75 cycles.

Additionally, the non-UDG-treated screening library of Gyvakarai1 was paired-end sequenced on two lanes of a HiSeq4000 for 2 × 75 cycles, and on a full run of a NextSeq500 for 2 × 75 cycles. The screening library for Kunila2 was paired-end sequenced deeper on 80% of one lane of a HiSeq4000 for 2 × 100 cycles. Additionally, 40 μl of DNA extract of Kunila2 was converted into a UDG-treated library, and pair-end sequenced on one lane of a HiSeq4000 for 2 × 75 cycles. The three UDG-half libraries for Olsund were single-end sequenced on a HiSeq4000 for 75 cycles.

Furthermore, DNA was extracted from the dense petrous portion of individual MotalaAA and converted into a UDG-half library which was shotgun single-end sequenced on a HiSeq4000 for 75 cycles. Sequencing strategies and facilities are summarized in Supplementary Data 1, column AE.

After demultiplexing, resulting sequence data were further processed using EAGER54. This included mapping with BWA (v0.6.1)55 against UCSC genome browser’s human genome reference GRCh37/hg19, and removing duplicate reads with same orientation and start and end positions. To avoid an excess of remaining C-to-T and G-to-A transitions at the ends of the reads, two bases of the ends of each read were clipped for each sample except for the non-UDG-treated data for Gyvakarai1, where 10 bases from each end were clipped.

For each of the targeted 1240k SNP positions, a read was chosen at random to represent this position using the genotype caller pileupcaller (https://github.com/stschiff/sequenceTools).

Quality control of genome-wide data

The samples that were covered at <10,000 SNPs of the 1240k SNP set were excluded from further analyses. We evaluated the authenticity of the samples by observing typical patterns of deamination towards read ends (Supplementary Data 1), estimating heterozygosity on the mtDNA with schmutzi58 and heterozygosity on the X chromosome in male samples with ANGSD61 (Supplementary Data 1), and evaluating the ratio of the reads mapping to X and Y which showed no outliers (see below).

We observe that all our individuals predating the LN appear genetically distinct from any modern-day population that could have contaminated them, and that female samples cluster together with their male counterparts from the same archaeological cultures (Fig. 2), which gives indirect support to the authenticity of our data.

We excluded Saxtorp5158 from our analysis due to its high degree of contamination on the mtDNA, and Saxtorp389 as it showed unusual ancestry for its dating and archaeological context, consistent with modern European contamination.

Using the software READ62, we determined individuals Kretuonas2 and Kretuonas5 to be identical twins, consistent with a value of >0.5 for f 3 (Kretuonas2, Kretuonas5; Mbuti). We do not include the lower coverage sample Kretuonas5 in ADMIXTURE analysis and other analyses that cluster the individuals into one population, thereby mitigating bias resulting from a defined population consisting of closely related individuals.

We merged our data of Kunila2 with previously published data of the same individual22 after confirming the identity with outgroup f 3 and READ.

Sex assignment

Genetic sex of the 41 selected samples was assigned using shotgun and SNP capture data by calculating the ratio of average X chromosomal and Y chromosomal coverage to average autosomal coverage at the targeted SNPs (X and Y rate, respectively). Samples with an X rate between 0.65 and 1 and a Y rate between 0 and 0.15 were assigned female and those with an X rate between 0.35 and 0.55 and a Y rate between 0.4 and 0.7 were assigned male, supporting the informative value of the Y rate over the X rate using this method (Supplementary Fig. 9), as demonstrated previously5.

Population genetic analyses

Due to the nature of ancient DNA research, no pre-determination of sample size by statistical methods was carried out and there was no randomization of experiments or blinding of investigators to allocation during experiments and outcome assessment.

Reference datasets for ancient populations are taken from the publicly available dataset used in refs. 5,6,21,22,] (which includes genotypes from samples published earlier in refs. 2,3,4,9,64, among others), as well as genotyping data from worldwide modern populations (Human Origins or HO dataset) published in the same publications and provided by the David Reich lab6. When analyzing only ancient samples, we make use of the 1,196,358 SNPs targeted by the 1240k SNP capture, using the genotypes of deep shotgun sequenced modern Mbuti as the outgroup. For analyses involving modern populations, we restrict to the intersection of 597,503 SNPs between the 1240k SNP set and the HO dataset.

PCA was performed with smartpca in the EIGENSOFT package64 by constructing the eigenvectors from modern West Eurasian populations (Supplementary Fig. 7) and projecting ancient individuals on these eigenvectors (Fig. 2a, Supplementary Fig. 8).

Admixture analysis (Fig. 2b) was carried out with ADMIXTURE on 3784 modern and 378 ancient individuals for ancestral clusters k = 2 to k = 16 with 100 bootstrap replicates. The SNP dataset was pruned for linkage disequilibrium with PLINK using the parameters --indep-pairwise 200 25 0.5. We considered the cross-validation (CV) error and report in Fig. 2b and Supplementary Fig. 4 the results of k = 11, where the CV error levels out at a minimum.

To quantify population affinities and admixtures suggested in the PCA and ADMIXTURE analysis, we carried out f-statistics using the programs qp3Pop and qpDstat in the ADMIXTOOLS suite (https://github.com/DReichLab) for f 3 - and f 4 -statistics, respectively. f 3 -statistics of the form f 3 (X,Y; Outgroup) measure the amount of shared genetic drift of populations X and Y after their divergence from an outgroup. Admixture f 3 -statistics of the form f 3 (Test;X,Y) indicate when significantly negative that population Test is intermediate in allele frequencies between populations X and Y and could be considered an admixed population. This test was performed with parameter inbreeding:YES and cannot be done for populations with less than two individuals. D-statistics of the form D(X,Y; Test, Outgroup) show if population Test is symmetrically related to X and Y or shares an excess of alleles with either of the two. Results are only reported for statistics based on more than 10,000 SNPs.

To formally test for the number of source populations and the proportion of ancestry these contributed to our studied populations, we used the qpWave and qpAdm programs from ADMIXTOOLS. These programs implement the methodology of using regression of f 4 -statistics of a Reference population with various outgroups to relate its ancestry to a Test population2,6. With qpWave, we identified potential source populations for our population under study by testing if a set of Left populations (the Test population under study and its potential proximate source Reference populations) is consistent with being descended from n waves of admixture which have unique relationships to the Right outgroup populations (Mbuti, Papuan, Onge, Han, Karitiana, Mota, Ust Ishim, MA1, Villabruna). This is given when rank n−1 cannot be rejected (p > 0.05), and rejected (i.e. more than n waves of admixture are needed to explain the ancestry of Test and Reference) if rank n−1 can be rejected (p < 0.05).

To estimate admixture proportions, we used qpAdm to model the Test population as a mixture of various source populations postulated from the qpWave test, setting as Left populations the Test and source populations and as the Right populations the various outgroups named above.

Data availability

The sequence data reported in this paper are deposited in the Sequence Read Archive (Accession numbers: SAMN08139261–SAMN08139301) and complete mitochondrial consensus sequences are deposited in GenBank (Accession numbers MG428993–MG429049).