Plant material

The cultivar ‘Camarosa’ was selected because of its importance to the industry; historically, it has been one of the most widely grown short-day varieties worldwide, and it remains an important genotype in breeding programs. The haploid genome size (~813.4 Mb) was estimated through flow cytometry with four technical replicates at the Flow Cytometry Core at Benaroya Research Institute at Virginia Mason (Supplementary Dataset 3).

Genomic sequencing

High-molecular-weight genomic DNA was isolated from young leaf tissue, after a 72-h dark treatment, through a modified nuclei-preparation method75,76, and the quality was verified through pulsed-field gel electrophoresis. A total of five PacBio 20-kb libraries were generated with a SMRTbell Template Prep Kit (PacBio) and were sequenced with 67 SMRT cells on the PacBio RSII platform at the UC Davis DNA Sequencing Facility. A total of 67 Gb (~82.4×) of PacBio sequence data was generated with an N50 read length of 17,699 bp (Supplementary Table 3). DNA fragments longer than 50 kb were used to construct a 10X Gemcode library with a Chromium instrument (10X Genomics) and sequenced on a HiSeqX system (Ilumina) with paired-end, 150-bp reads at the HudsonAlpha Institute for Biotechnology. A total of ~95 Gb (~117× fold coverage) of 10X Chromium library data was sequenced (Supplementary Table 1). Finally, five size-selected Illumina genomic libraries ranging from 470 bp to 10 kb were constructed (Supplementary Table 1). The ~470-bp and ~800-bp libraries were made with a Illumina TruSeq DNA PCR-free Sample Preparation V2 Kit. The two ~470-bp libraries were designed to produce ‘overlapping libraries’ after sequencing with paired-end, 265-bp reads on an Illumina Hiseq2500 system, producing ‘stitched’ reads of approximately 265 bp to 520 bp in length. To increase sequence diversity and depth, we constructed three separate mate-pair (MP) libraries with jumps of 2–5 kb, 5–7 kb, and 7–10 kb, with an Illumina Nextera Mate-Pair Sample Preparation Kit. The 800-bp library was sequenced on an Illumina HiSeq2500 system with paired-end, 160-bp reads, and the MP libraries were sequenced on an Illumina HiSeq4000 system with paired-end, 150-bp reads. A total of ~370 Gb (~455× fold coverage) of additional Illumina sequencing data was generated (Supplementary Table 1). Illumina library construction and sequencing were conducted at the Roy J. Carver Biotechnology Center, University of Illinois at Urbana-Champaign.

Genome assembly

The genome was assembled with the DeNovoMAGIC software platform (NRGene), a DeBruijn-graph-based assembler designed for higher polyploid, heterozygous and/or repetitive genomes32,77. The Chromium 10X data were used to phase haplotypes and support scaffold validation and further elongation of the phased scaffolds. Dovetail HiC libraries were prepared as described previously78 and sequenced on an Illumina HiSeqX system with paired-end, 150-bp reads to ~401× sequence depth of the genome (Supplementary Fig. 2). The initial de novo assembly, raw genomic reads, and Dovetail HiC library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity-ligation data to scaffold genome assemblies to chromosome-length pseudomolecules79. After HiRise scaffolding, the sequences were gap filled with PacBio reads with PBJelly33. Gaps filled with PacBio sequences were polished with Pilon (v 1.22)80 with Illumina paired-end data. Illumina reads were quality-trimmed with Trimmomatic81 and aligned to the draft contigs with bowtie2 (v 2.3.0)82 with default parameters. Parameters for Pilon were modified as follows: --flank 7, --K 49, and --mindepth 20. Pilon was run recursively three times, and there were minimal corrections in the third round, thus supporting accurate indel correction. A published genetic map34 and syntenic analyses against the F. vesca37 genomes with SynMap within CoGe83 were used to identify any assembly errors and haplotype variants, and to assign homoeologous chromosomes sets. Additional assembly details and results are summarized in the supplementary information.

Tissue collection, RNA library preparation, and sequencing

Plant tissue samples (flower before anthesis, flower at anthesis, leaf collected during the day and at night, leaves treated with methyl jasmonate (30 min, 4 h, and 24 h after treatment), runner, and salt-treated and untreated roots) were collected from Fragaria × ananassa cultivar ‘Camarosa’ grown in a growth chamber and immediately flash frozen in liquid nitrogen. Leaf tissues were also collected from wild diploid species grown in a growth chamber for phylogenetic analyses (Supplementary Table 7). Total RNA was isolated with a KingFisher Pure RNA Plant Kit (Thermo Fisher) and quantified with a Qubit 3 fluorometer (Thermo Fisher). RNA libraries were prepared with the KAPA mRNA HyperPrep Kit protocol (KAPA Biosystems). All samples were submitted to the Michigan State University Research Technology Support Facility Genomics core and sequenced with paired-end, 150-bp reads on an Illumina HiSeq 4000 system.

Transcriptome assembly and translation

Reads were cleaned with Trimmomatic v 0.32 (ref. 81) with adaptor trimming for TruSeq3 paired-end reads with a 1-bp mismatch, a palindrome clip threshold of 30, and a simple clip threshold of 10. Reads were then filtered on the basis of an average phred score calculated from a sliding window of 10 bp with a minimum threshold of 20 (Supplementary Dataset 4). The quality of trimmed reads was assessed afterward with FastQC84. Genome-guided and de novo transcriptome assemblies were generated with Trinity v 2.2.0 (ref. 85) for the genome annotation/expression and phylogenetic analyses, respectively. For genome annotation and expression analyses, reads were aligned to the Fragaria × ananassa cultivar ‘Camarosa’ genome with STAR v 2.5.3a86 with default options, except for --alignIntronMax, which was set to 10000. For genome annotation, the coordinate-sorted BAM output files from STAR were used for the genome-guided transcriptome assembly, and name-sorted SAM files were used for gene expression analysis (HTSeq in section 3). For the diploid species libraries used in the phylogenetic analyses, because transcriptome libraries were generated with a stranded method, the ‘SS_lib_type’ parameter with ‘RF’ option was used in the assembly. In addition, reads were normalized to a maximum read coverage of 100 with ‘normalize_max_read_cov’ in Trinity. The normalization option, which decreases the quantity of input reads for highly expressed genes, was used to improve assembly efficiency87. For homoeolog expression bias (HEB) analyses (described in the section below), counts of uniquely mapping reads were generated with HTSeq v 0.6.1 (ref. 88) with default options of htseq-count, except for feature type, which was set to ‘gene’ for all RNA-seq datasets of ‘Camarosa’. The fragments per kilobase per million reads mapped (FPKM) values were derived with the standard formula for FPKM = (read count/’per million’ scaling factor)/gene length in kilobases. For phylogenetic analysis, according to McKain et al.89, reads were aligned to the assembled transcripts with bowtie v 1.1.0 (ref. 90), and transcript abundance was estimated with RSEM v 1.2.29 (ref. 91) through the align_and_estimate_abundance.pl script packaged with Trinity. Transcripts were filtered by FPKM, an output from the aforementioned Perl script, with a minimum threshold of 1.0% of fragments per isoform mapped, as implemented in the filter_fasta_by_rsem_values.pl script. Filtered transcripts were BLASTed against the Fragaria vesca v 2.01 coding sequences with TBLASTX with a minimum e value of 1 × 10–10. The RefTrans package (see URLs) was used to translate assembled transcripts by filtering BLAST hits to identify the best hit with at least 75% bidirectional overlap between the transcript and F. vesca coding sequences. Best hits were used to guide translations with GeneWise (Wise2 v 2.2.0)92. The longest translations were used in downstream analyses.

Gene annotation

The genome was annotated with the MAKER-P annotation pipeline36. Protein sequences (Araport11 and UniprotKB plant database), expressed sequence tags (NCBI), and ten mRNA-seq datasets (described below) and additional RNA-seq data for Fragaria × ananassa downloaded from NCBI-SRA (BioProject PRJNA394190; red ripening fruit) were used as evidence during annotation. The RNA-seq datasets were assembled into transcripts through the StringTie genome-guided approach93. A custom repeat library (‘Repeat annotation’ section below) and MAKER repeat library94 were used for genome masking. Ab initio gene prediction was performed with the gene predictors SNAP95 and Augustus96, which were previously iteratively trained for F. vesca37. During annotation, gene models with annotation edit distance <1.0 were included in the MAKER gene set and scanned for the presence of protein domains. The predicted gene models were further filtered to remove those with TE-related domains. Briefly, the protein-coding genes were searched (BLASTp, e = 10–10) against a transposase database from a previous study36, and if more than 50% of gene length aligned to the transposases, the gene was removed from the gene set. However, if 60% or more of the amino acid matches were due to only three individual amino acids, the alignment was considered to be caused by low complexity and was excluded. In addition, to assess whether core plant genes were annotated, the gene set was searched against the BUSCO v 2 (ref. 35) plant dataset (embryophyta_odb9). lncRNAs, including long intergenic noncoding RNAs, antisense overlapping transcripts, and sense overlapping transcripts, were identified with the Evolinc lncRNA-discovery pipeline (v 1.5.1)97. Transcripts with fewer than three reads per base pair were discarded. Putative lncRNAs with similarity (BLASTn e value <1 × 1010) to known TEs or rFAM’s catalog (v 13.0)98 of housekeeping RNAs were removed.

Repeat annotation

The Fragaria × ananassa genome was searched for LTR-RTs with LTRharvest99 with parameters ‘-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes’ and LTR_finder100 with parameters ‘-D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.9’. The identified LTR-RT candidates were filtered with LTR_retriever101 with default parameters. Miniature inverted TEs (MITEs) were identified with MITE-Hunter102. Candidate MITEs were manually checked for TSD and TIR, which were used for superfamily classification. Those with ambiguous TSD and TIR were classified as unknowns. The Fragaria × ananassa genome was then masked with both MITE and LTR libraries through Repeatmasker103 (see URLs), and other repetitive elements were identified with Repeatmodeler104 (see URLs). The repeats were then grouped into two categories: sequences of known identity and sequences of unknown identity. The latter were then searched against the transposase database, and if they had a match, they were included in the TE library. The library was further filtered with ProtExcluder36 and an in-house Perl script to exclude gene fragments. The final TE library was used to annotate the Fragaria × ananassa genome with RepeatMasker103 with parameters ‘-q -no_is -norna -nolow -div 40’. Annotation results were summarized with the ‘famcoverage.pl’ script from the LTR-retriever package101.

Organellar genome annotation

The chloroplast genome was annotated with Verdant, a web-based software suite specifically designed for plant chloroplast genomes105. Automated annotation of protein-coding genes, tRNAs, and rRNAs was completed with annoBTD (see URLs). Five Rosaceae plastomes in the Verdant database were selected as a reference for annotation, including the Fragaria vesca ‘Hawaii 4’ chloroplast genome37. The previously identified ORFs were BLASTed against the reference genomes with TBLASTX106 with an e-value cutoff of 0.1 and a cutoff of 50% identity between references and high-scoring segment pairs. The best reference for each ORF was used for annotation. An optimized BLASTN106 was used to identify and annotate tRNAs and rRNAs on the basis of reference genomes. The best-scoring references were used to annotate the RNA. Finally, the boundaries of each feature was identified on the basis of the sequence and positional information for the orthologous features from the five reference chloroplast genomes (Supplementary Fig. 5). The mitochondrial genome was annotated with the webserver for Mitofy (see URLs), a program designed to annotate the genes and tRNAs in the mitochondrial genomes of seed plants107. Mitofy uses NCBI-BLASTX to annotated genes on the basis of databases of 41 protein-coding genes and uses NCBI-BLASTN and tRNAscan-SE108 to annotate tRNAs and rRNAs on the basis of databases of 27 tRNAs and 3 rRNAs found in seed-plant mitochondrial plant genomes. The annotated plastid and mitochondrial genomes have been deposited in Dryad (see URLs).

Synteny and comparative genomics

The ‘Camarosa’ and F. vesca37 genomes were aligned in CoGe’s SynMap program with LAST83. The maximum distance between two matches was set to 20 genes, and the minimum number of aligned pairs was set to ten genes. Neighboring syntenic blocks were merged with ‘Quota Align Merge’109, with the maximum distance between two blocks set to 40 genes. Syntenic depth was calculated with ‘Quota Align’, and the ratio of coverage depth for F. vesca to F. ananassa gene was set to 1:4. Tandemly duplicated genes were identified and filtered from CoGe outputs with a max distance of ten genes. Fractionation bias was then calculated, with the maximum query chromosomes set to 28 and the maximum target chromosomes set to seven. The analyses can be regenerated with CoGe (see URLs). The two genomes were also aligned with MUMmer v 3.2 (ref. 110) to identify homoeologous exchanges (Supplementary Table 10) with parameters (nucmer --maxmatch -l 80 -c 200) and visualized with dotPlotly (see URLs).

Phylogenetic analyses

Translated transcriptomes and whole-genome protein-coding genes for Fragaria × ananassa, F. vesca v 2.01, A. thaliana TAIR10 (ref. 111), and Malus domestica v 1.0 (ref. 112) (Phytozome v 12)113 were orthogrouped with Orthofinder v 0.3 (ref. 114) with Diamond v 0.8.36 (ref. 115) for similarity searches. Orthogroups were filtered so that a minimum of five unique accessions were present. Coding sequences and amino acid translations were separated into orthogroup-specific FASTA files. Amino acid sequences were aligned with MAFFT v 7.215 (ref. 116) with the ‘auto’ parameter, and PAL2NAL v 14 (ref. 117) was used under default parameters to create a codon alignment from MAFFT-aligned amino acids. Codon alignments were filtered by removal of alignment columns with 90% or more gaps and transcripts with unaligned lengths less than 30% of the alignment length, with scripts provided with McKain et al.89. Orthogroup trees were reconstructed with RAxML v 8.0.6 with 500 bootstrap replicates under the GTR + gamma evolutionary model. All 108,087 protein-coding genes from the F. x ananassa ‘Camarosa’ genome were used in the initial orthogrouping. After the filtering of orthogroups with fewer than five taxa, 51,737 ‘Camarosa’ genes remained in 8,405 gene trees. A total of 19,302 unique loci identified in large syntenic blocks forming 18,839 paralogous pairs were used to assess the evolutionary history of the subgenomes. Outgroups were chosen from either A. thaliana or M. domestica, with preference given to A. thaliana as an outgroup. To assess the evolutionary history of octoploid strawberry’s subgenomes, a novel tree-searching algorithm was developed called ‘phylogenetic identification of subgenomes’ (PhyDS; see URLs). The only parameters needed for PhyDS are a list of taxa, if any, to ignore in the gene trees and a minimum bootstrap value to set the threshold for acceptable subtrees. In this analysis, only genes from the ‘Camarosa’ genome were ignored (that is, PhyDS did not stop when it encountered an Fxa gene other than a sister paralog) to identify each of the diploid progenitors of octoploid strawberry. Results from varying bootstrap support cutoffs are provided. These homoeologs were than mapped back to each of the assembled chromosomes and, on the basis of their relative frequencies, used to assign each chromosome to a diploid progenitor species (Supplementary Table 8).

Gene expression analyses

HEB was assessed with the likelihood-ratio tests described in ref. 23, by analysis of the anther, root, and leaf transcriptome data. This test consists of a set of three nested hypotheses. The null hypothesis, H 0 , is that the homoeologs are expressed at equal levels after normalization for gene length and sequencing depth. The first alternative hypothesis, H 1 , is that one of the homoeologs is more highly expressed in all tissues, such that the difference can be explained by a single scaling factor. The second alternative hypothesis, H 2 , is that the homoeologs are expressed unequally and inconsistently across the three tissues. Homoeolog pairs for which H 0 can be rejected for H 1 , but H 1 cannot be rejected for H 2 , are therefore cases in which one of the homoeologs appears to be up- or downregulated consistently throughout the organism. For the first test, the Benjamini–Hochberg118 correction for multiple testing was applied. For the second test, because the question was being unable to reject a hypothesis, no correction was made. Both tests used a 1% significance level. Pairwise genomic alignments, described above, were used to identify homoeologs for each of the subgenomes, retained duplicate genes from tandem duplications, and orthologous genes to A. thaliana111, on the basis of ortholog assignments in F. vesca37. Thes complete list of Fragaria–Arabidopsis orthologs was then filtered to genes with functional data in the AraGEM Arabidopsis metabolic72,119 and STRING global protein interaction network120. These gene lists were used to investigate subgenome- and pathway-level-specific expression in fruit with an available transcriptome dataset in NCBI-SRA (BioProject PRJNA394190) (Supplementary Dataset 2).

Analysis of disease-resistance-gene familie

NBS-LRR genes were detected with HMMER v 3.1 (ref. 121) with default settings, by searching the protein sequences of the Fragaria × ananassa genome against the raw hidden Markov model for the NB-ARC-domain family downloaded from Pfam (family ID PF00931)122. Only genes identified by both HMMER and BLAST were used for subsequent analysis. TIR subdomains were detected with PfamScan on default settings by searching the identified NB-ARC genes against the Pfam-A hidden Markov model. The 423 Fxa NB-ARC-domain-containing proteins were batch-searched in the NCBI Conserved Domain Database (see URLs)123 and Pfam database (see URLs). Results from the CD database were used to assign the gene models that contained CC, TIR, RPW8, or ‘other’ (none of the three established N-terminal domains); gene models were further mapped onto the assembled octoploid genome to assign positions (Supplementary Fig. 12). The CD results were then filtered to remove established R-gene domains (CC, TIR, RPW8, LRR, and NB-ARC), thus resulting in a list of potential integrated domains (Supplementary Dataset 1). Eight Fxa proteins with predicted Sec7/ADP-ribosylation-factor and G-nucleotide-exchange-factor domains were aligned by ClustalW and FastME 2.0 (ref. 124), and their illustrated domain organization is displayed in Supplementary Fig. 13. The full protein sequences of the 423 Fxa NB-ARC-domain-containing proteins were aligned with MUSCLE v 3.8.31 (ref. 125) under default settings. This alignment was trimmed with trimAl v 1.4.rev22 build 2015-05-21 (ref. 126) under default settings. An unrooted maximum-likelihood tree was constructed with RAxML v 8.2.11 (ref. 127) with the PROTGAMMA substitution model. The tree was visualized with the APE package v 4.1 (ref. 128) in R v 3.3.3 (ref. 129) (see URLs).

Statistical analysis

The comparison of homoeolog-expression abundance between the dominant subgenome and the three submissive subgenomes was carried out with a likelihood-ratio test and combined with Benjamini–Hochberg correction for multiple testing with a 1% significance level. The Kolmogorov–Smirnov test was used to determine which subgenome had the lowest-overall TE densities near genes. The χ2 test, with three degrees of freedom, was used to analyze the subgenome bias of disease-resistance genes. Bootstrapping, with 500 replicates under the GTR + gamma evolutionary model, was used to assess node support in trees generated by phylogenetic analyses.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.