European populations display low genetic differentiation as the result of long-term blending of their ancient founding ancestries. However, it is unclear how the combination of ancient ancestries related to early foragers, Neolithic farmers, and Bronze Age nomadic pastoralists can explain the distribution of genetic variation across Europe. Populations in natural crossroads like the Italian peninsula are expected to recapitulate the continental diversity, but have been systematically understudied. Here, we characterize the ancestry profiles of Italian populations using a genome-wide dataset representative of modern and ancient samples from across Italy, Europe, and the rest of the world. Italian genomes capture several ancient signatures, including a non–steppe contribution derived ultimately from the Caucasus. Differences in ancestry composition, as the result of migration and admixture, have generated in Italy the largest degree of population structure detected so far in the continent, as well as shaping the amount of Neanderthal DNA in modern-day populations.

The geographic location of Italy, enclosed between continental Europe and the Mediterranean Sea, makes the Italian people relevant for the investigation of continent-wide demographic events to complement and enrich the information provided by aDNA studies. To characterize the genetic variability of modern-day European populations and their relationship to early European foragers, Neolithic farmers, and Bronze Age nomadic pastoralists, we investigated the population structure of present-day Italians and other Europeans in terms of their ancestry composition as the result of migration and admixture, both ancient and historical. To do this, we assembled and analyzed a comprehensive genome-wide single nucleotide polymorphism (SNP) dataset composed of 1616 individuals from all 20 Italian administrative regions and more than 140 worldwide reference populations to give a total of 5192 modern-day samples (fig. S1, A and B, and data file S1), to which we added genomic data available for ancient individuals (data file S1).

The arrival of farming in Europe led to admixture between incoming Anatolian farmers and autochthonous hunter-gatherers, a process that generated individuals genetically close to present-day Sardinians ( 4 , 5 ). During the Bronze Age, the dispersal of a population related to the pastoralist nomadic Yamnaya from the Pontic-Caspian steppe further markedly affected the genetic landscape of the continent ( 1 , 6 ). This migration, supported by archeological and genetic data, has also been linked to the spread of the Indo-European languages in Europe and the introduction of several technological innovations in peninsular Eurasia ( 7 ). Genetically, ancient steppe populations have been described as a combination of Eastern and Caucasus hunter-gatherer/Iran Neolithic (EHG and CHG/IN) ancestries ( 4 ). However, the analysis of aDNA from Southern East Europe has identified the existence of additional contributions ultimately from the Caucasus ( 8 , 9 ) and suggests a more complex ancient ancestry composition for Europeans ( 4 ).

Our understanding of the events that shaped European genetic variation has been redefined by the availability of ancient DNA (aDNA). In particular, it has emerged that, in addition to the contributions of early hunter-gatherer populations, major genetic components can be traced back to Neolithic and Bronze Age expansions ( 1 ). Extensive gene flow across the continent over the last few thousand years ( 1 , 2 ) has further contributed to the correlation between geography and genetic variation observed in modern Europe ( 3 ).

RESULTS

Distinctive genetic structure in Italy We initially investigated patterns of genetic differentiation in Italy and surrounding regions by exploring the information embedded in the SNP-based haplotypes of modern samples [full modern dataset (FMD) including 218,725 SNPs]. The phased genome-wide dataset was analyzed using the ChromoPainter (CP) and fineSTRUCTURE (fS) pipeline (see the Supplementary Materials) (10, 11) to generate a tree of groups of individuals with similar “copying vectors” (clusters; Fig. 1A). The fraction of pairs of individuals placed in the same cluster across multiple runs was, on average, 0.95 for Italian clusters and 0.96 across the whole set of clusters (see Materials and Methods and the Supplementary Materials). Non-European clusters were pooled into larger groups in subsequent analyses (see Materials and Methods and the Supplementary Materials). Fig. 1 Genetic structure of the Italian populations. (A) Simplified dendrogram of 3057 Eurasian samples clustered by the fS algorithm using the CP output (complete dendrogram in fig. S1C). Each leaf represents a cluster of individuals with similar copying vectors. Clusters with more than five individuals are labeled in black. Italian clusters are color coded. Gray labels ending in the form <<NAME>>_D refer to clusters containing less than five individuals or individuals of uncertain origin that have been removed in the following analyses. (B) Principal components analysis (PCA) based on the CP chunkcount matrix [colors as in (A)]. The centroid of the individuals belonging to non-Italian clusters is identified by the label for each cluster. The plot was rotated to the left by 90° to highlight the correspondence with the geography of the Italian samples. (C) Pie charts summarizing the relative proportions of inferred fS genetic clusters for all the 20 Italian administrative regions [colors as in (A)]. (D) Between-cluster F ST estimates within European groups. Clusters were generated using only individuals belonging to the population analyzed (see Materials and Methods and the Supplementary Materials). The number of genetic clusters analyzed for each population is reported within brackets. For the comparisons across Europe, the cluster NEurope1 containing almost exclusively Finnish individuals was excluded (F ST estimates for Italian and European clusters are in data file S2). F ST distributions statistically lower than the Italian one are in colors other than green. (E) Estimated effective migration surfaces (EEMS) analysis in Southern Europe. Colors represent the log 10 scale of the effective migration rate, from low (red) to high (yellow). Italian clusters separated into three main groups: Sardinia, Northern (North/Central-North Italy), and Southern Italy (South/Central-South Italy and Sicily); the first two were close to populations originally from Western Europe, while the last was closer to Middle Eastern groups (Fig. 1, A and B; figs. S1D and S2, A to C; and data file S1). These observations were confirmed using a subset of the dataset genotyped for a larger number of SNPs [high-density dataset (HDD) including 591,217 SNPs; see Materials and Methods and the Supplementary Materials; fig. S1D and data file S1]. To highlight the geographic distribution of the identified clusters along the Italian peninsula, we reconstructed the cluster composition of the various administrative regions of Italy by using the best sampling origin information available for each individual in our dataset (Fig. 1C and data file S1). Recent migrants and admixed individuals, as identified on the basis of their copying vectors (fig. S3, data file S2), were removed in subsequent CP/fS analyses (see Materials and Methods and the Supplementary Materials). A sharp north-south division in cluster distribution was detected, the separation between northern and southern areas being shifted north along the peninsula (Fig. 1B) (12). The reported structure dismissed the possibility that the Central Italian populations differentiated from the Northern and Southern Italian groups (Fig. 1A) (13). Individuals from Central Italy were, in fact, assigned mostly to the Southern Italian clusters, except for samples from Tuscany, which grouped instead with the Northern Italian clusters (Fig. 1, A and B) (12). Contrary to previous results, no outliers were detected among the Northern Italian clusters (12). We evaluated the impact of geography in shaping the genomic variability of Italy by testing the correlation between geographic and genetic coordinates, applying a Procrustes analysis (Fig. 1B; Materials and Methods). A significant correlation was observed in our dataset, in agreement with reports for Italy and Europe (3, 13) (correlation in Procrustes rotation, 0.77 and 0.78, P < 0.05 including or excluding Sardinia, respectively). Intracluster variation within Southern and Northern Italian clusters was comparable (data file S2). Sardinian clusters were characterized by a high proportion of genome copied by individuals from the same cluster (self-copying), in agreement with previous indications of drift in Sardinian groups (12, 13). Southern Italian samples showed higher among-cluster F ST values than the Northern Italian ones, but lower TVD (total variation distance) values (data file S2; Materials and Methods) (1, 2). Sardinian clusters showed both high TVD and F ST intercluster variation, combining the effects of drift and variation in ancestry composition. We compared the degree of variation among genetic clusters in Italy with those in several European countries (11, 14–16) and across the whole of Europe (Fig. 1D). Among-cluster variability in Italy was significantly higher than in any other country examined (F ST ; median Italy, 0.004; data file S2; range medians for listed countries, 0.0001 to 0.002) and showed differences comparable with estimates across European clusters (median European clusters, 0.004; Materials and Methods and the Supplementary Materials), as previously suggested using unilinear data (17, 18). The analysis of the migration surfaces [estimated effective migration surfaces (EEMS)] (19) highlighted several barriers to gene flow within and around Italy, but also suggested the existence of migration corridors in the southern part of the Adriatic and Ionian seas and between Sardinia, Corsica, and continental Italy (Fig. 1E and fig. S4) (9).

Multiple ancient ancestries in Italian clusters To further characterize the observed genetic structure, we investigated the ancestry composition of modern clusters. We tested different combinations of ancient putative sources using a “mixture fit” approach [non-negative least square (NNLS) algorithm] (11, 20). We applied this approach to ancient samples using the “unlinked” mode implemented in CP, similar to other routinely performed analyses based on genotype data, such as qpAdm and ADMIXTURE. In addition, data from modern individuals (from the FMD dataset) were harnessed as donor populations (Materials and Methods and Supplementary Materials). Following Lazaridis et al. (8), we performed two separate CP/NNLS analyses, “ultimate” and “proximate,” referring to the least and the most recent putative sources, respectively (Fig. 2 and fig. S5, A to E). In the ultimate analysis, all the Italian clusters were characterized by relatively high amounts of Anatolian Neolithic (AN) contributions, ranging from 56% (SItaly1) to 72% (NItaly4), distributed along a north-south cline (Spearman ρ = 0.52, P < 0.05; Fig. 2, A to C, fig. S5A, and data file S3), with Sardinians showing values above 80%, as previously suggested (1, 21). A closer affinity of Northern Italian than Southern Italian clusters to AN was also supported by D-statistics (fig. S6A). The remaining ancestry was mainly assigned to WHG (western hunter-gatherer), CHG, and EHG. In particular, the first two components were more present in populations from the South of Italy (P < 0.05, Student’s t test), while the latter was higher in Northern Italian clusters (P < 0.05, Student’s t test). These observations suggest the existence of different secondary source contributions to the two edges of the peninsula, with the north affected more by EHG-related populations and the south by CHG-related groups. IN ancestry was detected in Europe only in Southern Italy (Fig. 2 and fig. S5A). Fig. 2 Ancient ancestries in Western Eurasian modern-day clusters and Italian ancient samples. CP/NNLS analysis on all Italian and European clusters using as donors different sets of ancient samples and two modern clusters (NAfrica1, North Africa; EAsia2, East Asia) [full results in fig. S5 (A and B)]. (A) Ultimate sources: AN, Anatolian Neolithic (Bar8); WHG, western hunter-gatherer (Bichon); CHG, Caucasus hunter-gatherer (KK1); EHG, Eastern hunter-gatherer (I0061); IN, Iranian Neolithic (WC1). (B) EHG and (C) CHG ancestry contributions in Western Eurasia, as inferred in (A) and figs. S8A and S5A. (D) Same as in (A), using proximate sources: WHG, western hunter-gatherer (Bichon); EEN, European Early Neolithic (Stuttgart); SBA, Bronze Age from steppe (I0231); ABA, Bronze Age from Anatolia (I2683). (E) SBA and (F) ABA ancestry contributions, as inferred in (D) and fig. S5B. Triangles refer to the location of ancient samples used as sources (data file S1). (G) Ratio of the residuals in the NNLS analysis (see Materials and Methods and the Supplementary Materials) for all the Italian and European clusters when ABA was excluded and included in the set of proximate sources; (H) as in (G), but excluding/including SBA instead of ABA. (I) Ancient Italian and other selected ancient samples projected on the components inferred from modern European individuals. Labels are placed at the centroid of the individuals belonging to the indicated clusters. North-south differences across Italy were also detected in the proximate analysis. When proximate sources were evaluated, significantly higher ABA (Anatolia Bronze Age) and SBA (Steppe Bronze Age) ancestries were detected in Southern and Northern Italy, respectively (Fig. 2, D to F, and fig. S5B; P < 0.05, Student’s t test; P < 0.05, Wilcoxon rank sum test; Supplementary Materials), in line with the results based on the D-statistics (fig. S6, A and B) and mirroring the CHG and EHG patterns, respectively (Fig. 2, A to C). Contrary to previous reports (4), the occurrence of CHG as detected by our CP/NNLS analysis did not mirror the presence of SBA, with several populations testing positive for the latter but not for the former (Fig. 2 and fig. S5, A and B). When we compared this analysis and the one using a different CHG sample (SATP) (5), the two were highly correlated (Spearman ρ = 0.972, P < 0.05; fig. S5F). We therefore speculate that our approach might, in general, underestimate the presence of CHG across the continent; however, we note that even considering this scenario, the excess of Caucasus-related ancestry detected in the south of the European continent, and in Southern Italy in particular, is notable and unexplained by currently proposed models for the peopling of the continent. The different impact of ABA and SBA across Italy was additionally supported by the reduced fit of the NNLS (sum of the squared residuals; Materials and Methods and Supplementary Materials) when the proximate analysis was run excluding one of these two sources. The residuals were almost twice as much for Southern Italians when ABA was not included as a source, while a substantial increase in the residual values was observed for Northern Italians when SBA was removed from the panel of proximate sources (Fig. 2, G and H). The different affinities of Southern and Northern Italians for ABA and SBA were also highlighted in the principal components analysis (PCA) and ADMIXTURE analysis on ancient and modern samples (Fig. 2I and fig. S7). These results were confirmed by the qpAdm analysis, where all the analyzed Italian clusters could be modeled as a combination of ABA, SBA, and European Middle-Neolithic/Chalcolithic populations, their contributions mirroring the pattern observed in the CP/NNLS analysis (fig. S5G and data file S4). Sardinian clusters were consistently modeled as AN + WHG + CHG/IN across runs, with the inclusion of North Africa and SBA when a different number of sources were considered. The qpAdm analyses of Italian HDD clusters generated similar results (Materials and Methods, Supplementary Materials, and data file S4). We noted that in the Balkan peninsula, signatures related to ABA were present but were less evident than in Southern Italy across modern-day populations, possibly being masked by historical contributions from Central Europe (Figs. 2 and 3 and fig. S5B) (2, 21, 22). Overall, SBA and ABA appeared to have different distribution patterns in Italy and to reflect those present in Europe as a whole: more common in North Italy and across the continent in the case of the former, more localized in the south of Europe and Italy in the case of the latter. Similar results were obtained when other Southern European ancient sources replaced ABA in the proximate analysis (see fig. S5, C to E, Materials and Methods, and the Supplementary Materials). Fig. 3 Admixture events inferred by GT. (A) Dates of the events inferred in the GT noItaly analysis on all the Italian clusters (labels as in Fig. 1A and data file S1; full results in fig. S8 and data file S5; see Materials and Methods and the Supplementary Materials); lines encompassed the 95% confidence interval. GT events were distinguished in “one date” (black squares; 1D in data file S5) and “one date multiway” (white squares; 1 MW). (B) Correlation values between copying vectors of first source(s) identified by GT and the best proxy in the noItaly analysis (circles) or the best proxy among Italian clusters (diamonds). (C) Same as in (B), referring to second source(s) copying vectors. Empty symbols refer to additional first (B) and second (C) sources detected in multiway events. African best proxies in (B) for clusters SItaly1 and SItaly2 were plotted on the 0.90 boundary for visualization only, the correlation values being 0.78 and 0.87, respectively. The symbols referring to the best Italian proxies for the African sources identified for clusters SItaly1, SItaly2, Sicily1, and Sardinia3 in (B) are not included as the correlation values are lower than the African ones and below the threshold used in the figure. The colors of the symbols refer to the ancestry to which proxies were assigned (see Materials and Methods and the Supplementary Materials).

Modeling the ancestry composition of ancient Italian samples To obtain temporal insights into the emergence of the differences between Northern and Southern Italy in relation to SBA and ABA ancestries, we performed the same qpAdm analysis on post-Neolithic/Bronze Age Italian individuals (data file S4). Iceman and Remedello, the oldest Italian samples included here [3400 to 2800 Before Common Era (BCE)], were composed of high proportions of AN (74 and 85%, respectively). The Bell Beaker samples of Northern Italy (2200 to 1930 BCE) were modeled as ABA and AN + SBA and WHG. Although ABA estimates in these samples were characterized by large standard errors (SE), the detection of steppe ancestry, at approximately 14%, was more robust. In contrast, Bell Beaker samples from Sicily (2500 to 1900 BCE) were modeled almost exclusively as ABA, with less than 5% SBA (data file S4). Despite the fact that the small number of SNPs and prehistoric individuals tested prevents the formulation of conclusive results, differences in the occurrence of AN ancestry, and possibly also of Bronze Age–related contributions, are suggested to be present between ancient samples from North and South Italy. Differences across ancient Italian samples were also supported by their projections on the PCA of modern-day data (Fig. 2I). Remedello and Iceman clustered with European Early Neolithic samples, together with one of the three Bell Beaker individuals from North Italy, as previously reported (23), and modern-day Sardinians. The other two Bronze Age North Italian samples clustered with modern North Italians, while the Bell Beaker sample from Sicily was projected in between European Early Neolithic, Bronze Age Southern European, and modern-day Southern Italian samples (Fig. 2I). These results suggest a differentiation in ancient ancestry composition between different areas of Italy, dating at least in part back to the Bronze Age.

Historical admixture To investigate the contribution of historical admixture events in shaping the modern distribution of ancient ancestries and the observed population structure, we characterized admixture events for Italian and European populations using GLOBETROTTER (GT) (Fig. 3, fig. S8, and data file S5)(22). We discuss here the results based on the nonlocal (2) (GT “noItaly”) analysis of the FMD (data file S5) as it provides a wider coverage at the population level; concordant results were found when the HDD was similarly analyzed (data file S5). We ran the analysis excluding the Italians as donors to reduce copying between highly similar groups (Fig. 3). The events detected in Italy occurred mostly between 1000 and 2000 years ago (ya) and extended to 2500 ya in the rest of Europe (Fig. 3A and fig. S8). The profile of inferred sources varied across Italy. Clusters from the Caucasus and North-West Europe were identified for many Italian clusters as best proxies for the admixing sources in agreement with previous studies (21), while Middle Eastern and African groups were detected for Southern Italy and Sardinia (Fig. 3, B and C). When Italian clusters were included among putative sources, they were as good as, or better proxies than, clusters from the Caucasus and the Middle East. On the other hand, North-West European and African clusters were mostly confirmed as better proxies than groups from any other area (Fig. 3, B and C), as observed when GT was run including all clusters as donors (“GTall” analysis; data file S5). Overall, these results supported a scenario in which gene flow mostly occurred between Italian and African/other Eurasian populations. SBA and ABA ancestries were detected in Italian and non-Italian best proxies (Figs. 2D and 3 and data file S5), which suggests that part of these ancestries arrived from outside Italy in historical times (21), but also that these components were already present in Italian groups at the time of these admixture events. The timing of the admixture events and the sources involved differ between Northern and Southern Italian clusters, pointing to different admixture histories for the two areas. Episodes of gene flow were also detected in Sardinia, combining signals from both the African continent and North-West Europe. MALDER results for the more recent episodes replicated the admixture pattern identified by GT (fig. S8 and data file S5).