By following the evolution of populations that are initially genetically homogeneous, much can be learned about core biological principles. For example, it allows for detailed studies of the rate of emergence of de novo mutations and their change in frequency due to drift and selection. Unfortunately, in multicellular organisms with generation times of months or years, it is difficult to set up and carry out such experiments over many generations. An alternative is provided by “natural evolution experiments” that started from colonizations or invasions of new habitats by selfing lineages. With limited or missing gene flow from other lineages, new mutations and their effects can be easily detected. North America has been colonized in historic times by the plant Arabidopsis thaliana, and although multiple intercrossing lineages are found today, many of the individuals belong to a single lineage, HPG1. To determine in this lineage the rate of substitutions—the subset of mutations that survived natural selection and drift–, we have sequenced genomes from plants collected between 1863 and 2006. We identified 73 modern and 27 herbarium specimens that belonged to HPG1. Using the estimated substitution rate, we infer that the last common HPG1 ancestor lived in the early 17 th century, when it was most likely introduced by chance from Europe. Mutations in coding regions are depleted in frequency compared to those in other portions of the genome, consistent with purifying selection. Nevertheless, a handful of mutations is found at high frequency in present-day populations. We link these to detectable phenotypic variance in traits of known ecological importance, life history and growth, which could reflect their adaptive value. Our work showcases how, by applying genomics methods to a combination of modern and historic samples from colonizing lineages, we can directly study new mutations and their potential evolutionary relevance.

A consequence of an increasingly interconnected world is the spread of species outside their native range—a phenomenon with potentially dramatic impacts on ecosystem services. Using population genomics, we can robustly infer dynamics of colonization and successful population establishment. We have compared hundred genomes of a single Arabidopsis thaliana lineage in North America, including genomes of contemporary individuals as well as 19 th century herbarium specimens. These differ by an average of about 200 mutations, and calculation of the nuclear evolutionary rate enabled the dating of the initial colonization event to about 400 years ago. We also found mutations associated with differences in traits among modern individuals, suggesting a role of new mutations in recent adaptive evolution.

Funding: This study was supported by the President’s Fund of the Max Planck Society (project “Darwin”) to HAB and by an ERC grant (AdG IMMUNEMESIS) and core funds of the Max Planck Society to DW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Here, we focus on 100 HPG1 individuals that do not show any evidence of outcrossing with other lineages. We combine genomes from herbarium specimens and live individuals, collectively covering the time span from 1863 to 2006, to infer mutation rates, to date the birth of the HPG1 lineage, and to investigate the evolutionary forces that shape genetic diversity and potentially adaptive trait variation. Our analyses of this lineage serves as a model for future studies of similar colonizing or otherwise recently bottlenecked plant populations, in order to better understand how diversity is generated and to which extent it contributes to adaptation in nature.

The self-fertilizing plant Arabidopsis thaliana is native to Africa and Eurasia [ 21 , 22 ] but has recently colonized N. America, where it likely experienced a strong founder effect [ 23 ]. At nearly half of N. American sites sampled during the 1990s and early 2000s, more than 80% of plants belong to a single haplogroup, HPG1, as inferred from genotyping with 149 intermediate-frequency markers evenly spread throughout the genome [ 23 ]. The HPG1 lineage has been reported from many sites along the East Coast and in the Midwest as well as at a few sites in the West [ 23 ] ( Fig 1 , S1 Table ). The great ubiquity of HPG1 in comparison to any other haplogroup could be due to either some adaptive advantage, or, more parsimoniously, be the result of HPG1 being derived from one of the first arrivals of A. thaliana in the continent.

The analysis of colonizing populations can also contribute to resolving the “genetic paradox of invasion” [ 17 ]. This paradox comes from the observation that colonizing populations can be surprisingly successful and spread very widely and in multiple even when strongly bottlenecked, suggesting some level of adaptation to new environments that goes beyond the exploitation of unoccupied ecological niches [ 17 ]. Much of the work in plant ecology and evolution has focused on evidence that populations can rapidly adapt from standing variation [ 18 ]. In invasive lineages, initial standing variation may originate from incomplete bottlenecks, multiple introductions, or admixture with local relatives [ 19 ]. Much less work has been done with respect to the role of de novo mutations as a solution to the genetic paradox of invasion, although this has been proposed as an alternative explanation for rapid adaptation by colonizing lineages [ 3 , 17 , 20 ].

Colonizing or invasive populations sampled through time [ 1 , 2 ] constitute “natural experiments” where it is possible to study evolutionary processes in action [ 3 ]. Colonizations, which are dramatically increasing in number [ 4 , 5 ], sometimes are characterized by strong bottlenecks and genetic isolation [ 6 , 7 ], and thus greatly facilitate the observation of new mutations and potentially their effects under natural population dynamics and selection [ 8 ]. Colonizations thus offer a complementary approach to other studies of new mutations, which often minimize natural selection, for example in laboratory mutation accumulation experiments [ 9 ] and parent-offspring comparisons [ 10 ]. The study of colonizations is also complementary to the investigation of genetic divergence over long time scales, e.g., between distant species [ 11 ], where the results are largely independent of short-term demographic fluctuations. There is broad interest in understanding how genetic diversity is generated [ 12 ], and how new mutations can provide a path for rapid adaptive evolution [ 13 – 15 ]. Additionally, accurate evolutionary rates permit dating historic population splits, which is fundamental to the study of population history [ 16 ].

Results and discussion

Historic and modern genomes In a self-fertilizing species, a single individual can give rise to an entire lineage of millions of offspring, which then diversify through new mutations and eventually intra-lineage recombination. If self-fertilization is much more common than outcrossing, the founder is likely to have been homozygous throughout almost the entire genome. Because it is so wide spread, HPG1 presents an opportunity to sample many natural populations that have been potentially derived from a common, very recent ancestor with such characteristics. In the best possible case, this would allow for new mutations to be directly observed through time. To test these assumptions and to better understand the evolution of HPG1, we sequenced two different groups of plants. The first group were live descendants of 87 plants that had been collected between 1993 and 2006 (Fig 1; S1 Table), and which had been identified as likely members of the HPG1 lineage with 149 genome-wide markers spaced at roughly 1-Mb-intervals [23]. We aimed for broad geographic representation, with at least two accessions per collection site, where available. The second group comprised 36 herbarium specimens, collected between 1863 and 1993, for which we had no a priori information whether they may or may not belong to the HPG1 lineage, but which were selected from the herbarium records to cover the full historical geographic range and overlap with modern samples when possible (Fig 1). The DNA from the herbarium specimens showed biochemical features typical of ancient DNA (aDNA) from plants, which we have previously described in detail [24]. Such DNA damage included a median fragment length of 60 bp, an excess of C-to-T substitutions of about 2.5% at the first base of sequencing reads and a 1.5 to 1.8 fold enrichment of purines at DNA breakpoints (S1 Fig, S2 Text). To remove aDNA associated damage and produce high-quality genomes, chemically-repaired libraries (see Methods) were later sequenced. These reads were mapped against an HPG1 pseudo-reference genome [25], focusing on single nucleotide polymorphisms (SNPs) because the short sequence reads of herbarium samples preclude accurate calling of structural variants. Genome sequences were of high quality, with herbarium samples covering 96.8–107.2 Mb of the 119 Mb reference, and modern samples covering 108.0–108.3 Mb (S1 Table).

Genetic diversity of HPG1 and delineation from other lineages We visualized the relationships between the sequenced historic and modern plants building a neighbor joining tree of all 123 samples and confirmed that the majority fell within an almost-identical clade, the HPG1 (Fig 2A) [23]. Because any degree of introgression from other non-HPG1 lineages would confound the discovery of new mutations downstream, we removed all divergent samples and built a neighbour joining tree (n = 103 samples), which revealed that the HPG1 samples were very similar to each other, with very little within-population structure (Fig 2B). A parsimony network was used to detect recombinant genomes within this HPG1 clade (Fig 2C), which led us to remove three potential intra-lineage recombinants. Repeating the parsimony network cleared all previously inferred reticulations due to recombinations (Fig 2D). After such stringent filtering, we kept 27 of the 35 herbarium samples, and 73 of the 87 modern samples (S1 Table). These constitute a set of non-admixed, non-recombined and quasi-identical HPG1 individuals. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Relationship among herbarium and modern samples. (A) Neighbor joining tree with all 123 samples (dots) and rooted with the most distant sample. The black clade of almost-identical samples is the HPG1 lineage. Scale line shows the equivalent branch length of over 25,000 nucleotide changes. (B) Neighbor joining tree only with the HPG1 black clade from (A). Colors represent herbarium (blue) and modern individuals (green). Scale line shows the equivalent branch length of 80 nucleotide changes. Note that no outgroup was included. (C, D) Network of samples using the parsimony splits algorithm, before (C) and after (D) removing three intra-HPG1 recombinants (in red). Note that the network algorithm returns in (D) a network devoid of any reticulation, which indicates absence of intra-haplogroup recombination. https://doi.org/10.1371/journal.pgen.1007155.g002 Pairs of HPG1 herbarium genomes differed by 28–207 SNPs genome-wide, pairs of HPG1 modern genomes by 2–259 SNPs, and pairs of historic-modern HPG1 genomes by 56–244 SNPs. That is, whole-genome identity was at least 99.9997% in any pairwise comparison. Of the approximately five to six thousand segregating SNPs in the HPG1 population, the vast majority, about 95% (Supplementary Text 3), have not been reported outside of this lineage [21]. Importantly, the density of SNPs along the genome was low and evenly distributed (typically fewer than 20 SNPs / 100 kb) with no peaks of much higher frequency, which makes us confident that chunks of introgressions from other lineages do not exist in this putatively pure HPG1 set (Fig 4). For comparison, random pairs of A. thaliana accessions from the native range or pairs of non-HPG1 typically differ by about 500 SNPs / 100 kb [21] (see scale in Fig 2A). There were no SNPs in mitochondrial nor chloroplast genomes, which already suggested a recent common origin, and genome-wide nuclear diversity (π = 0.000002, θ W = 0.00001, with 5,013 full informative segregating sites) was two orders of magnitude lower than in the native range of the species (θ W = 0.007) [21] (S1 Table) (Supplementary Text 6). The population recombination parameter was also four orders of magnitude lower (4N e r = ρ = 3.0x10-6 cM bp-1) than in the native range (ρ = 7.5x10-2 cM bp-1) [26] (Supplementary Text 6). While recombination occurs in every generation, regardless of self-fertilization or outcrossing, it is only observable after outcrossing between genetically non-identical individuals. We must stress that because A. thaliana can outcross at rates of several percent per generation [23,27], but because the HPG1 population is genetically so homogeneous, we are mostly “blind” to the consequences of outcrossing in this special case. The lack of “observable recombination” in the genome is important, as it allows for the use of straightforward phylogenetic methods to calculate a mutation rate. The enrichment of low frequency variants in the site frequency spectrum (Tajima’s D = -2.84; species mean = -2.04, [21]) and low levels of polymorphism are consistent with a recent bottleneck followed by population expansion, which usually generates star-like phylogenies (Figs 2 and 3). The obvious explanation is that the strong bottleneck corresponds to a colonization founder event, likely by few closely related individuals or perhaps even a single plant. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Substitution rates. (A) Bayesian phylogenetic analyses employing tip-calibration. A total of 10,000 trees were superimposed as transparent lines, and the most common topology was plotted solidly. Tree branches were calibrated with their corresponding collection dates. (B) Maximum Clade Credibility (MCC) tree summarizing the trees in (A). Note the scale line shows the equivalent branch length of 50 nucleotide changes. The grey transparent bar indicates the 95% Highest Posterior Probability of the root date. (C) Regression between pairwise net genetic and time distances. The slope of the linear regression line corresponds to the genome substitution rate per year. (D) Substitution spectra in HPG1 samples, compared to greenhouse-grown mutation accumulation (MA) lines. (E) Comparison of genome-wide, intergenic, intronic, and genic substitution rates in HPG1 and mutation rates in greenhouse-grown MA lines. Substitution rates for HPG1 were re-scaled to a per generation basis assuming different generation times. Confidence intervals in HPG1 substitution rates were obtained from 95% confidence intervals of the slope from 1,000 bootstraps (S4 Table for actual values). https://doi.org/10.1371/journal.pgen.1007155.g003 Altogether these patterns indicate that the collection of HPG1 plants we investigated constitute a quasi-clonal and quasi-identical set of individual genomes, mostly devoid of observable recombination and population structure, and thus eminently suited for the study of naturally arising de novo mutations.

The genome-wide substitution rate It is important to distinguish between the mutation rate, which is the rate at which genomes change due to DNA damage, faulty repair, gene conversion and replication errors, and substitution rate, which is the rate at which mutations survive and accumulate under the influence of demographic processes and natural selection [28,29]. Under neutral evolution, mutation and substitution rates should be equal [29]. The simple evolutionary history of the HPG1 population enables direct estimates of substitution rates, and the comparison of theses between different genome annotations, as well as with mutation rates from controlled conditions experiments, could reveal the role played by both demographic and selective forces. To estimate the substitution rate in the HPG1 lineage, we used distance- and phylogeny-based methods that take advantage of the known collection dates (Supplementary Text 7). The distance method is independent of recombination and has been previously applied to viruses [30] and humans [31]. The substitution rate is calculated from correlation between differences in collection time in historic-modern sample pairs, and the number of nucleotide differences between those pairs relative to a reference (Fig 3C), scaled to the size of the genome accessible to Illumina sequencing. This method resulted in an estimated rate of 2.11x10-9 substitutions site-1 year-1 (95% bootstrap Confidence Interval [CI]: 1.88–2.33x10-9) using rigorous SNP calling quality thresholds. Relaxing the thresholds for base calling and minimum genotyped rate affects both the number of called SNPs and the length of the interrogated reference sequence [32]. These largely cancelled each other out, and the adjusted estimates were relatively stable, between 2.1–3.2x10-9 substitutions site-1 year-1 (S3 Table, Supplementary Text 3). The second method, a Bayesian phylogenetic approach, uses the collection years for tip-calibration and assumes a relaxed molecular clock. It summarizes thousands of plausible coalescent trees, and it has been extensively used to calculate evolutionary rates in various organisms [33–35]. This method yielded a substitution rate of 4.0x10-9, with confidence ranges overlapping the above estimates (95% Highest Posterior Probability Density [HPPD]: 3.2–4.7x10-9). Based on the similar results obtained with two very different methods, we can confidently say that the substitution rate in the wild populations of HPG1 is between 2 and 5 x10-9 site-1 year-1. To date the colonization of N. America by HPG1 A. thaliana and to improve the description of intra-HPG1 relationships compared to that from a NJ tree, we further used a Bayesian phylogeny. At first sight, the 73 modern samples appeared separated from the herbarium samples (Fig 3B), but the superimposition of thousands of possible trees showed that the apparent separation of samples was less clear near the root (Fig 3A). Long terminal branches reflected that the majority of the variants are singletons, typical of populations that expand after bottlenecks. The mean estimate of the last common HPG1 ancestor, the average tree root, was the year 1597 (HPPD 95%: 1519–1660) (Fig 3A and 3B), and an alternative non-phylogenetic method gave a similar estimate, 1625. Both estimates are older than a previously suggested date in the 19th century, using a laboratory mutation rate estimate and having no information from herbarium samples [25]. Because HPG1 appears to have been the most abundant lineage in N. America since the 1860s, we believe it could have been one of the first, if not the first A. thaliana colonizer that could establish itself in N. America. If that is true, the time of coalescence of the HPG1 diversity could be close to the time of HPG1 introduction to N. America. During the colonial period, many European immigrants settled on the East coast, consistent with N. American A. thaliana lineages being genetically closest to British and coastal West European populations [21]. Coincidently, the oldest herbarium samples (12 out of the 27) were HPG1 and came from the East Coast, and we found a significant correlation between collection date and both latitude and longitude (Fig 1C). This could indicate that after the colonization they moved from the East Coast to the Midwest—the other main area of the distribution that experienced an agricultural expansion in the 19th century [36]. Still, these conclusions need to be treated with caution, since regardless of the robustness of the results and our attempts to sample evenly from available collections, there could be unknown biases in the 19th century herbaria.

Mutation spectra across genome annotations Although for dating divergence events a substitution rate expressed in years is ideal, in order to compare substitution and mutation rates, both need to be expressed per generation. While A. thaliana is an annual plant, seed bank dynamics generate a delay of average generation time at the population scale. A comprehensive study of multiple A. thaliana populations in Scandinavia found that dormant seeds could wait for longer than a year in the seed bank, generating overlapping generations and an delayed average generation time of 1.3 years [37] with a notable variance across populations. Multiplication by the mean generation time led to an adjusted rate of 2.7x10-9 substitutions site-1 generation-1 (95% CI 2.4–3.0x10-9) (Fig 3E). To be able to compare this rate with a reference, we also re-sequenced mutation accumulation (MA) lines in the Col-0 reference background grown under controlled conditions in the greenhouse that had been analyzed before with less advanced short read sequencing technology [38]. From the new re-sequencing data, we obtained an updated rate of 7.1x10-9 mutations site-1 generation-1 (95% CI 6.3–7.9x10-9) (S2 and S3 Tables Supplementary Text 4 and 7). This mutation rate is two- to three-fold higher than the per-generation substitution rate estimate in the wild, but within the same order of magnitude. The same holds for rates in different genome annotations, i.e. genic, intronic and intergenic regions, but the confidence intervals overlapped in many cases (S3 Table). Differences in per-generation rates between laboratory and wild populations could stem from both methodological as well as biological causes. For instance, if the true average generation time was actually over 3 years / generation, the differences would cancel out (Fig 3E). Limitations in mapping structural variation in non-reference samples could lower the substitution rate, which may explain why we calculated an atypically low substitution rate in regions with transposable elements (see Supplementary Text 7.2.1). Environmentally-driven effects that are not yet well understood, such as variable methylation status of cytosines, account for much of the variation in local substitution rates [39], and could increase or decrease the rate (see Supplementary Text 7.2.3, S4 Fig). An alternative evolutionary explanation to the aforementioned laboratory and wild populations’ rates differences is that purifying selection in the wild would slow down the accumulation of mutations by removing deleterious mutations (Fig 3E). This has been observed before and is one of the accepted causes of the discrepancy between the so called long- and short-term substitution rates in a range of organisms [40]. In order to provide evidence for negative purifying selection acting in the wild, we performed three types of analyses involving comparisons across genomic annotations within the HPG1 dataset. Firstly, by calculating contingency tables and computing a Fisher’s exact test, we compared the deviation of expected and observed SNPs between coding regions (more likely under purifying selection), with intergenic regions, intronic regions, and all non-coding regions of genome. All three pairwise comparisons showed a depletion of coding SNPs and an enrichment of intergenic, intronic and non-coding SNPs (odds ratio>2, p<10−16). An obvious explanation is that in genome annotations where a mutation is more likely to be deleterious, i.e. coding regions, the number of observed variants should be lower due to selection having removed them from the population before we could sequence them. Secondly, we studied the Site Frequency Spectrum (SFS) of genetic variants. The rationale was that because purifying natural selection is more efficient at removing intermediate-frequency variants, variants that tend to be deleterious or slightly deleterious should be found at lower frequency than those that only suffer neutral drift [41]. We built contingency tables of coding, intergenic, intronic and non-coding variants segregating above and below the conventional frequency cutoff of 5% to separate low- and intermediate-frequency variants [42]. We found that SNPs in coding regions were more likely to be at low frequency than those in intergenic (odds ratio = 2.34, p = 3.09x10-11), intronic (odds ratio = 1.48, p = 0.02), and all non-coding regions (odds ratio = 2.05, p = 1.29x10-8). We carried out the same analysis using nonsynonymous and synonymous SNPs, which are easily interpretable in terms of the selection regimes under which they evolve. We did not find an enrichment (p = 0.67), perhaps due to an insufficient number of testable mutations (S3 Table). Thirdly, to verify that the full frequency spectrum of coding SNPs was shifted to lower frequencies (i.e. the results were not dependent on the arbitrary 5% frequency cutoff), we used the nonparametric Kolmogorov-Smirnov test for two samples. We found that the cumulative distribution of the site frequency spectrum (CD SFS ) of coding regions is above (i.e., the frequency distribution is overall skewed to lower values) both the intergenic CD SFS (p = 3.25x10-6) and the non-coding regions CD SFS (p = 0.001), but not the intronic CD SFS (p = 0.60) (S5 Fig). As in our previous analysis, the comparison between the nonsynonymous and synonymous CD SFS yielded, likely for similar reasons, no differences (p = 0.53). All in all, these results support that purifying selection is a force shaping to some degree the diversity across the HPG1 genome and might therefore as well contribute to the differences between HPG1 and MA rates.