S. cerevisiae sequenced isolates

The isolates included in this project were carefully selected to be representative of the S. cerevisiae whole species. All the isolate details, including ecological and geographical origins, providers and references, are listed in Supplementary Table 1. We maximized the isolate ecological origins by including both human-associated environments such as wine and sake fermentation, brewing and dairy products, as well as natural environments such as soil, insects, tree exudate and fruit. Geographical origins are also highly diverse and have a worldwide distribution (Supplementary Table 1). In addition to the 918 isolates provided by research laboratories and yeast collections, we included 93 strains sequenced in previous studies6,7,8, to give a total of 1,011 samples analysed in this study. We sought to keep the isolates in their natural state before sequencing to provide a global picture of the ploidy and level of heterozogosity. However, among the 918 selected isolates, 124 were non-natural haploid with the HO gene deleted and the 93 external isolates were genetically manipulated and made homozygous before sequencing.

Sequencing and quality filtering

Yeast cell cultures were grown overnight at 30 °C in 15 ml of YPD medium to early stationary phase. Total genomic DNA was subsequently extracted using MasterPure Yeast DNA Purification Kit and Genomic Illumina HiSeq 2000 sequencing libraries were prepared for 918 strains with an insert size between 300 and 600 bp. Ten libraries were multiplexed per Illumina HiSeq2000 lane and subjected to paired-end sequencing, producing reads of 102 bases.

An in-house quality control process was applied to the reads that passed the Illumina quality filters. Illumina sequencing adapters and primers sequences were removed from the reads and the low-quality nucleotides (Q < 20) were discarded from both ends of the reads. Reads shorter than 30 nucleotides after trimming were removed. These trimming and removal steps were achieved using software designed in-house, based on the FastX package. The last step identifies and discards read pairs that mapped to the phage phiX genome (GenBank code: NC_001422.1) using SOAP51. A total of 3.35 Tb of high-quality genomic sequence was generated with a mean sequencing depth of 232 × per isolate (ranging from 50 × to 1,014 ×, Supplementary Fig. 1c). For the publically available Illumina paired-end reads related to 93 strains (see ‘S. cerevisiae sequenced isolates’), the mean sequencing depth is 169 × (from 20 × to 570 × , Supplementary Fig. 1c).

Reads mapping and variant calling

For each isolate, the reads were mapped to the S. cerevisiae S288C reference genome (version R64-1-1) with Burrows–Wheeler Aligner (BWA v.0.7.4-r385)52, using default parameters. Duplicated reads were marked with Picard-tools (v.1.124) (http://picard.sourceforge.net) and local realignment around indels and variant calling were performed with GATK (v.3.3-0)53. Default parameters were applied except for the realignment step (GATK IndelRealigner), in which the following parameters were set: ‘–maxReadsForConsensuses 500–maxReadsForRealignment 40000–maxConsensuses 60–maxPositionalMoveAllowed 400–entropyThreshold 0.2’. GATK Variant Annotator was run to add allele balance information in the.vcf files.

Ploidy, types of aneuploidy and segmental duplications

The natural ploidy of the 794 natural isolates (see ‘S. cerevisiae sequenced isolates’), as well as their aneuploidy and segmental duplication content were investigated by combining three complementary approaches:

First, measurement of the cell DNA content using high-throughput flow cytometry: DNA content was analysed using a propidium iodide (PI) staining assay. Cells were first pulled out from glycerol stocks in liquid YPD in 96-well plates (30 °C, overnight). Five microlitres of the culture were transferred into 195 μl of fresh YPD and incubated for 8 h at 30 °C. Then, 3 μl were taken and resuspended in 100 μl of cold 70% ethanol. Cells were fixed overnight at 4 °C, washed twice with PBS, resuspended in 100 μl of staining solution (15 μM PI, 100 μg/ml RNase A, 0.1% v/v Triton-X, in PBS) and finally incubated for 3 h at 37 °C in the dark. Ten thousand cells for each sample were analysed on a FACS-Calibur flow cytometer using the HTS module for processing 96-well plates. Cells were excited at 488 nM and fluorescence was collected with a FL2-A filter. The distributions of both FL2-A and FSC-H values have been processed to find the two main density peaks, which correspond to the two cell populations in G1 and G2. The peaks were detected using the densityClust R package after removing the cells reaching the FACS saturation (either FLS-A or FSC-H values equal to 1,000). We categorized the values of FLS-A, which correlate with the DNA quantity, to estimate the ploidy according to the following scheme: strains with G1 cell values between 39 and 181 and G2 values between 148 and 255 were labelled as haploid; strains with G1 cell values between 145 and 265 and G2 values between 295 and 500 were labelled as diploid; strains with G1 cell values between 245 and 355 and G2 values between 500 and 700 were labelled as triploid; strains with G1 cell values between 295 and 500 and G2 values between 700 and 905 were labelled as tetraploid; strains with G1 cell values between 395 and 605 and G2 values over 905 were labelled as over 4n; strains with other combinations of values have been manually evaluated.

Second, study of sequencing coverage: systematic analysis of the coverage depth along the genome was performed with 1-kb non-overlapping sliding windows, which enabled the survey of chromosomal copy number variations as well as segmental duplications. The ratio between the coverage of the aneuploid chromosomes and the rest of the genome was also used to validate the ploidy of isolates.

Third, investigation of the allele balance ratio associated with heterozygous SNPs, as heterozygous sites should fit an expected range of ratios according to the copy number of the chromosome being considered (see ‘SNPs filtering and matrix’).

The precise locations of segmental duplications were manually investigated in the .vcf files (Supplementary Table 16).

SNP filtering and matrix

For each sample, variants were first called with GATK HaplotypeCaller (see ‘Reads mapping and variant calling’). At this stage, isolates with less than 5% of heterozygous sites (average percentage of heterozygous sites detected in a sample of 104 haploid and/or homozygous diploid isolates) were considered as homozygous. The raw files were then post-processed to deal with highly confident variants to be included in our complete SNPs matrix, based on both coverage and allele balance information:

First, a minimal coverage depth of 50 × was required for a SNP to be retained for the 918 isolates that were sequenced in this study; this coverage depth was lowered to 10 × for the other 93 previously sequenced isolates.

Second, for the haploid and homozygous isolates (< 5% of heterozygous sites), the fraction of heterozygous SNPs detected was considered as representing false positives and was therefore filtered out.

Third, for heterozygous isolates, heterozygous sites were filtered according to their allele balance ratio (ABHet). The thresholds for allele balance ratios were determined according to the allelic frequency distribution all over the heterozygous samples at each level of ploidy (from 2n to 5n). A heterozygous site was rejected when its ratio did not fit the expected range according to the copy number of the chromosome being considered (or region, in the case of segmental duplication).

The joint calling method of GATK was run with the cleaned .vcf files to create a complete genotyping matrix (.gvcf format, see ‘Data availability’). This matrix of SNPs included 1,625,809 segregating sites across our 1,011 isolates (Supplementary Table 1).

SnpEff (v.4.1)54 was used to annotate and predict the effect of the variants. Non-synonymous SNPs predicted as deleterious by SIFT (v.5.2.2)17 as well as nonsense mutations were considered as deleterious for protein function. Insertions and deletions were considered to cause frame shifts when their sum produced a number not divisible by three in a single gene.

A sequence representative of each isolate was constructed by inserting these filtered SNPs into the reference sequence with GATK FastaAlternateReferenceMaker (see ‘Data availability’).

Pangenome

De novo genome assembly

We used Abyss software (v.1.5.2)55 with the option ‘-k 64’ to produce the de novo assemblies (see ‘Data availability’). The pre-assembly filtering step was performed with condetri v.2.2 to remove the 6 bases closest to the 5′ end and to discard low-quality 3′-end bases of reads. The resulting assemblies had a median N50 of 136 kb and a median number of contigs of 3,259. The median length of the genome is 12.1 Mb and the median GC content is 38.06 (Supplementary Table 17).

Detection of non-reference material

We set up a custom pipeline to identify non-reference genome material. Each genome was aligned to the reference sequence (S288C, version R64-1-1) using blastn with the following settings: ‘-gapopen 5 -gapextend 5 -penalty -5 -reward 1 -evalue 10 -word_size 11 -no_greedy’.

The CDH and CFH strains were excluded from the identification of non-reference genome material owing to the presence of Staphylococcus epidermis contamination. The sequences aligned with an identity greater than 95% were divided in three categories to be further processed. If the aligned sequence belonged to contigs shorter than 100 bp or if the aligned sequence was up to 200 bp and belonged to a contig with a length that was shorter than the length of the alignment plus 75 bp, the contig was discarded. If the aligned sequence was in the range 200–1,500 bp only the aligned sequence was discarded. If the aligned sequence was longer than 1,500 bp, it was divided into segments of 250 bp. Each sub-sequence was aligned again to the reference and discarded if found with an alignment identity of over 95% on an alignment length of at least 187 bp (75% of the subsequence length). After this step the relative position of the retained sequence has been evaluated. If two or more of them belonged to the same contig and were separated each other by less than 100 bp, the sequence from the starting of the first one to the end of the final one was kept as a whole. In the subsequent step, all the kept sequences from the 1,011 genomes were sorted for length in decreasing order. The set of sequence was then aligned against itself (with the same criteria as the first step) to eliminate repeated elements. When two sub-sequences were found to have an alignment identity of over 95%, the one belonging to the shorter sequence was eliminated. The process led to 12,325 sequences for 9.3 total Mb.

Annotation of non-reference material

To annotate ORFs in dispensable regions, we set up an integrative yeast gene annotation pipeline by combining different existing annotation approaches, which gave rise to an evidence-leveraged final annotation24. We independently ran the three individual components (RATT56, YGAP57 and MAKER58) for gene annotation, and subsequently integrated their results using EVM59. Proteomes of the Saccharomyces species (S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. arboricolus, S. uvarum and S. eubayanus) were retrieved and used in our annotation pipeline to provide protein alignment support for annotated gene models60,61.

Pangenome definition

We compiled the pangenome by adding the 2,245 non-reference ORFs annotated here to the 6,713 genomic reference ORFs listed in the set ‘orf_genomic_all’ from the Saccharomyces Genome Database (SGD) database (updated 13 January 2015, https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/). Three RDN genes were also added (RDN18-1, RDN25-1 and RDN37-1) from the set ‘rna_genomic’ available from the SGD database (https://downloads.yeastgenome.org/sequence/S288C_reference/rna/). We applied a graph-based pipeline to this set of ORFs, to remove duplicate and closely related sequences. This step also removed overlapping ORFs present in the ‘orf_genomic_all’ SGD dataset. A disconnected graph was created in which each node is an ORF and each edge represents an alignment identity of over 95% on at least 75% of the sequence of the smaller ORF in the couple. Each connected subgraph represents a single ORF family. For each of these families a representative has been chosen. The connectivity has been computed for each node. The first choice for the representative was the most central, non-dubious reference ORF, if any of them were present. The second choice was the most central reference ORF; if only non-reference ORFs were present in the family, the most central of these was taken.

This led to a catalogue of 7,796 non-redundant ORFs (see ‘Data availability’), which represent the S. cerevisiae pangenome (Supplementary Table 3). Among the reference ORFs, the number of similar ORFs collapsed for each final ORFs has a wide range (up to 67, which is the cluster of Ty1 elements), although usually this number does not exceed 2. Other large clusters are the Y′ elements one (59 ORFs) and the Ty2 elements (27 ORFs). Out of the 6,713 reference ORFs, 5,679 were not redundant and 1,032 were collapsed into 402 unique ORFs; 89% of these unique ORFs (n = 357) are duplicated ORFs.

Pangenome CNVs

To assess the copy number of each ORF of the pangenome, we mapped the reads from each strain to the pangenomic ORFs with BWA,using default parameters and the option –U 0. The result was then filtered using samtools view with the options –bSq 20 –F 260. The median coverage for each ORF was taken as coverage for the ORF in the specific isolate. The ratio between the values of individual ORFs and the values of genome coverage on the reference of the isolate (as the median of the median coverage for each nuclear chromosome) was considered as the copy number for the haploid genome (see ‘Data availability’).

The mapping was also used as a confirmation step for the presence of the ORFs in each strain, leading to the identification of 4,940 ORFs present in the 1,011 strains of the collection, representing the core genome plus 2,856 ORFs present in different subsets of the population. Fifty-two ORFs were removed because they were present in single strains with low coverage (~10% of the genome wide coverage) and were likely to be contaminations from Escherichia coli and Clavispora lusitaniae. Eighty-nine other ORFs that did not have sufficient coverage were kept in the pangenome, but were not used for subsequent analyses owing to poor mapping. Three of the core ORFs are present but not annotated in the S288C reference and were annotated by our annotation pipeline as 584-snap_masked-1700-AIE_1, 610-snap_masked-2999-BGP_1 and 611-snap_masked-3001-BGP_1.

To evaluate the difference between domesticated clades and wild clades, we normalized the data by calculating the clade copy-number median for each ORF to avoid sample bias. The distributions of medians in the domesticated and wild clades were then compared using the Mann–Whitney–Wilcoxon test (R function wilcox.test) (Supplementary Fig. 28 and Supplementary Tables 18, 19).

Inference of pangenome origin

We constructed a local ORFs database for 57 representative species that deeply probed both closely related Saccharomyces species as well as a highly divergent yeast species62 (Supplementary Table 20). In addition, we added the ORFs of 12 representative S. cerevisiae and S. paradoxus strains with complete genome sequenced by long reads (PacBio)24. For each annotated variable ORF, we first performed a BLASTN search (‘-evalue 1E-6’) against this local ORFs database to find its best hit. ORFs without hits in our local yeast ORFs database underwent a further round of BLAST searching (-evalue 1E-6) against the NCBI non-redundant database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). Based on the sequence identity and query coverage of the top hits, we classified the variable genes into different categories.

d N /d S

For all isolates, sequences of the protein-coding genes were inferred from the filtered SNPs and inserted into the reference sequence with GATK FastaAlternateReferenceMaker. For each gene, the coding sequences were aligned and the ratio of nonsynonymous to synonymous polymorphisms (d N /d S ) was computed with the yn00 program in PAML software63. Median values were used for comparison.

Genomic diversity characterization

Genomic and genetic distances

The 1,544,489 biallelic segregating sites were used to construct a neighbour-joining tree (Fig. 1), using the R packages ape and SNPrelate. The .gvcf matrix was first converted into a .gds file and individual dissimilarities were estimated for each pair of individuals with the snpgdsDiss function. The bionj algorithm was then run on the distance matrix that was obtained.

The genomic content distance (see ‘Data availability’) has been calculated as the number of ORF differences in the pangenome presence/absence profile (that is, the number of ORFs present in only one strain for each pairwise strain comparison) (see ‘Data availability’).

Genetic diversity

As an estimate of the scaled mutation rate, we computed π, the average pairwise nucleotide diversity θ w , the proportion of segregating sites and Tajima’s D value, which represents the difference between π and θ w .

Variscan 2.064 was run (runmode = 12, 10-kb non-overlapping windows) on multiple alignments of the concatenated chromosomes that were representative of the isolates.

Model-based ancestry

Model-based ancestry estimation was performed on the biallelic SNPs using ADMIXTURE v.1.2321 in unsupervised mode.

Principal component analysis

Principal components analysis on the biallelic SNPs was performed using EIGENSOFT v.6.0.1. The ‘-w’ argument was used to calculate the principal components using only a subset of the samples, with the remaining samples then being projected onto the resulting components.

Discriminant analysis of principal components

The matrix of presence/absence of ORFs in the population has been analysed using the discriminant analysis of principal components (DAPC) algorithm implemented in the R package adegenet 2.0.1. DAPC describes clusters by maximizing the between-cluster variance while minimizing the within-cluster variance. The number of components retained for the principal component analysis calculation was 150, accounting for > 88% of total variance. For the subsequent DAPC calculation, the alpha-score indicates 25 as the optimal number of discriminant principal components to be retained. Clustering was performed using the K-means with different number of groups (n = 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50).

Linkage disequilibrium

The Plink package65 was used to compute r2, the correlation coefficient between pairs of loci that stands as a measure of association for linkage disequilibrium. All pairs of polymorphic sites were investigated through .map and .ped files generated with vcftools66, excluding SNPs with a MAF lower than 5%.

We averaged r2 based on the SNP distance (100-bp intervals) over 25-kb regions and calculated the half-length of r2, which is the distance at which linkage disequilibrium decays to half of its maximum value.

Loss of heterozygosity

Heterozygous isolates were investigated for LOH regions with an R script generated in-house (see ‘Data availability’). Regions over 50 kb with less than 10 heterozygous sites per 50 kb were considered to be under LOH (Supplementary Table 7).

Saccharomyces rooted tree

To construct the tree, we used 22 S. cerevisiae isolates representative of the species genetic diversity that were sequenced with Oxford Nanopore technology67. We annotated these 22 assemblies with the pipeline described above. The annotated protein-coding genes were pooled together with the S. cerevisiae reference genome (SGD R64-1-1) and another 18 yeast strains for orthology identification. These 18 other yeast strains included 7 S. cerevisiae strains, 5 S. paradoxus strains and 6 out-group strains from other Saccharomyces yeast species as previously described24. The orthology identification was carried out using Proteinortho (v.5.15)68 and synteny information was considered (the PoFF feature of Proteinortho). This leads to the delineation of 2,018 1-to-1 orthologous groups across all the 41 sampled genomes. For each orthologous group, the protein sequences across the 41 strains were aligned with MUSCLE (v.3.8.1551)69, and the resulting protein alignment was further used to guide the corresponding CDS alignment using PAL2NAL (v.14)70. A concatenated multi-gene matrix was built for the CDS alignment of these 2,018 orthologous groups, which was further partitioned based on codon positions (for example, 1st, 2nd and 3rd codon positions). We used RAxML (v.8.2.6) to build the maximum likelihood tree based on the GTRGAMMA model with 100 rapid bootstraps. As an alternative, we also performed phylogenetic analysis using the consensus tree approach, in which we built individual gene trees for each of the 2,018 orthologous groups using the same method described for the concatenated tree. These individual gene trees were further summarized by ASTRAL (v.4.7.12)71 to produce the ‘species tree’. Both the concatenated tree and the consensus tree were visualized in FigTree (v.1.4.2) (http://tree.bio.ed.ac.uk/software/figtree/).

Phenotyping

Quantitative high-throughput phenotyping was performed using end-point colony growth on solid medium. Strains were pregrown in flat-bottom 96-well microplates containing liquid YPD medium. The replicating ROTOR HDA benchtop robot (Singer Instruments) was used to mix and pin strains onto a solid YPD matrix plate at a density of 384 wells. The matrix plates were incubated overnight at 30 °C to allow sufficient growth and replicated on 36 medium conditions, including YPD 30 °C as the pinning and growth control condition (Supplementary Table 2). Each isolate was present in quadruplicates on the corresponding matrix (interplate replicates) and at two different positions (intraplate replicates). The plates were incubated at 30 °C for 40 h and were scanned at a resolution of 600 dpi and 16-bit grayscale. Quantification of the colony size from plate images was performed using the software package Gitter72. Each value was normalized using the growth ratio between stress media and standard YPD medium at 30 °C (see ‘Data availability’). Pairwise Pearson’s correlations of fitness trait values between replicates were calculated for each condition.

Genome-wide association studies

Mixed-model association analysis was performed using FaST-LMM v.2.0719. We used the normalized phenotypes by replacing the observed value with the corresponding quantile from a standard normal distribution, as FaST-LMM expects normally distributed phenotypes. In this step, we used the markers showing a MAF > 5%. We also filtered missing genotypes as ‘fs’: an arbitrary threshold has been set to exclude all variants present in less than 1,000 individuals for the total matrix.

The command used for association was the following: ‘fastlmmc -bfile $snp -bfileSim $snp -pheno $pheno -out $assoc_file –verboseOutput’.

The mixed model adds a polygenic term to the standard linear regression designed to circumvent the effects of relatedness and population stratification. To quantify the extent of the bulk inflation and the excess false positive rate, we computed the genomic inflation factor, λ, for each condition (Supplementary Fig. 34). This factor is defined as the ratio between the median of the empirically observed distribution of the test statistic on the expected median. For example, the λ for a standard allelic test for association is based on the median (0.456) of the 1-degree-of-freedom χ2 distribution. Under a null model of no association and unlinked variants, the expectation is for the λ to be 1. A λ greater than 1 indicates inflated P values of association, possibly owing to a confounding factor that has not been accounted for.

We estimated a trait-specific P value threshold for each condition by permuting phenotypic values between individuals 100 times. The significance threshold was the 5% quantile (the 5th lowest P value from the permutations). Using this method, variants passing this threshold have a 5% family-wise error rate (Supplementary Fig. 33).

The estimations of genome-wide heritabilities were completed by dividing the genetic variance of the null model by the total variance of the null model (genetic variance and residual variance), computed using FaST-LMM (Supplementary Fig. 6). The values reported here are based on the quantile-normalized phenotypes. To compute the variance explained by our significantly associated markers, we included them in the covariance matrix with the ‘-bfileSim’ option and performed the same calculation again (Fig. 6).

Reporting Summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.

Code availability

The 1002 Yeast Genome website (http://1002genomes.u-strasbg.fr/files/) provides access to ‘-scripts.tar.gz’, which contains the Perl and R scripts used to (1) extract all the non-reference material from a large set of assemblies; (2) collapse ORFs that are more similar than a specific threshold; and (3) detect regions of LOH.

Data availability

All the strains are available on request except for 11 isolates, which cannot be distributed (see Supplementary Table 1).

The Illumina reads are available in the Sequence Read Archive under accession number ERP014555.

The 1002 Yeast Genome website (http://1002genomes.u-strasbg.fr/files/) provides access to:

-1011Matrix.gvcf.gz: all SNPs and indels called at the population level (.gvcf format).

-1011GWASMatrix.tar.gz: the matrix used for GWAS, which contains all biallelic positions known for 1,000 isolates or more with MAF > 5% as well as CNVs (encoded 0/1/2 for absence/0.5–1 copy/multiple copies) (.bed,.bim and.fam formats).

-1011DistanceMatrixBasedOnSNPs.tab.gz: for each pair of strains, the value is the percentage, based on SNPs, of non-identical bases. Heterozygous differences were half-weighted compared to the homozygous differences.

-1011DistanceMatrixBasedOnORFs.tab.gz: for each couple of strains, the value is the number of ORFs that are present in only one out of the two isolates.

-1011Assemblies.tar.gz: de novo assemblies of the 1,011 isolates (.fasta format).

-allReferenceGenesWithSNPsAndIndelsInferred.tar.gz: sequences of the genes found in the reference genome in which SNPs and indels have been automatically inferred for each isolate.

-allORFs_pangenome.fasta.gz: sequences of the 7,796 pangenomic ORFs (.fasta format).

-genesMatrix_PresenceAbsence.tab.gz: pattern of presence and/or absence of pangenomic ORFs for each isolate, in which the presence of an ORF is marked as 1 and its absence is marked as 0.

-genesMatrix_CopyNumber.tab.gz: estimated copy number for each pangenomic ORF, per isolate. Values are given for the haploid genome, so that non-integer values can be found (different copy number on homologous chromosomes).

-genesMatrix_Frameshift.tab.gz: for each isolate, the presence or absence (indicated by 1 or 0, respectively) of homozygous frameshift is reported in each gene, based on the number of bases affected by indels.

-phenoMatrix_35ConditionsNormalizedByYPD.tab.gz: growth ratio between 35 stress conditions and standard YPD medium at 30 °C, for 971 isolates.

All other data are available from the corresponding authors upon reasonable request.