Genome sequencing

The 454 (single read, and 8-, 12- and 20-kb mate pairs) genomic libraries were prepared following the manufacturer’s protocol (GS FLX Titanium Library Preparation Kit, Roche Diagnostic, USA) using genomic DNA from a single homozygous doubled haploid YY male from the Swanson River (Alaska) clonal line29,30 (Supplementary Methods). Libraries were quantified and evaluated using a 2100 bioanalyzer (Agilent Technologies, USA). Each library was sequenced using Pico Titer Plates on a 454 GSFlx instrument with Titanium or long read chemistry (Roche Diagnostic, USA). Genomic Illumina libraries were constructed according to the Illumina standard procedure for shearing of genomic DNA, end repair and adaptor ligation. The enrichment PCR was performed using Platinum Pfx DNA polymerase (Invitrogen). Amplified library fragments were size-selected to 300–600 bp on a 3% agarose gel. Each library was sequenced using 76 or 100 base-length read chemistry in paired-end and single-read flow cells on the Illumina GA IIx/HiSeq2000 (Illumina, USA).

Genome assembly

Public Sanger BAC-end sequences (BES)31 and Roche/454 reads (Supplementary Table 5) were assembled together with Newbler (version MapAsmResearch-04/19/2010-patch-08/17/2010). The total size of the resulting assembly was 1.9 Gb with a scaffold N50 of 384 kb (half of the assembly is contained in 1,014 scaffolds longer than 384 kb, Supplementary Table 6). Sequence quality of scaffolds from the Newbler assembly was improved by automatic error corrections with Solexa/Illumina reads32 (70-fold genome coverage), which have a different bias in error type compared to 454 reads (Supplementary Methods). The genome sequence and annotation can be obtained and viewed at https://www.genoscope.cns.fr/trout/

Genome annotation

Repeated regions of the assembly (37.8%) were masked against: (i) a collection of 634 motifs that we characterized using RepeatMasker ( http://www.repeatmasker.org), (ii) low complexity sequences using DUST33, (iii) tandem repeats using Tandem Repeat Finder34, (iv) teleost repeats from RepBase35, and (v) simple repeats using Repeat Masker. In addition, we integrated predictions of repeated motif from RepeatScout36 in the final gene prediction models (Supplementary Methods).

To refine exon/intron junction locations, 305,000 teleost protein sequences from Uniprot37 and Ensembl38 were aligned on the genome sequence using the BLAT algorithm39 to first select the best match (plus matches greater than 0.8X best matches) and each matched protein was then realigned using Genewise40 on the same trout genomic region. 93% of these teleost proteins matched at 41,300 different genomic loci in the rainbow trout genome assembly.

For building gene models, rainbow trout GenBank mRNA sequences were aligned onto the genome assembly using BLAT39 and est2genome41 resulting in 93% of mapping of these 421,414 mRNA sequences. Only the best matches with at least 90% of nucleotide identity were kept. On average, similarity level was 97.8% and half of these alignments supported splicing evidence, with an average of 2.5 exons per mRNA. We also used publicly available rainbow trout Roche 454 EST sequences available in SRA (accession number SRX007396) that were assembled using Newbler, and aligned using blat and est2genome with the same setting used for mRNAs. A total of 97% of these cDNA contigs were mapped on the rainbow trout assembly at 45,600 different genomic loci. In addition, we generated Illumina reads of different tissue transcriptomes (see below) that were also used to predict exon/intron structure on the genome assembly using gMorse42. Using all these resources we predicted 69,676 transcripts with an average size of 4.8 Kb (median size of 2.1 Kb), and an average exon number of 6.7 (median=4). Overall, 7.7% of the assembly is targeted by a transcriptional signal.

Final gene models were built using Gaze43 leading to 55,735 gene models with an average of 6 exons per gene (median=4). At the genome level, coding bases cover 3% of the assembly. Because 3,088 exons were overlapping gaps in the assembly, we inserted in-frame introns to avoid a long stretch of N letters in the corresponding protein sequences. We also tagged 585 genes that still contained transposable elements despite repeated cleaning procedures. In summary, the final gene set can be categorized into 4 classes of decreasing confidence level: (i) 46,585 protein-coding gene models with supporting protein evidence from other vertebrates (Supplementary Table 7), (ii) 6,789 genes lacking protein evidence without any assembly gap and with a transcriptional signal deduced from cDNA, (iii) 1,451 genes lacking protein evidence, without any assembly gap, and without a transcriptional signal deduced from cDNA, and (iv) 890 genes lacking protein evidence which overlap assembly gaps.

Sequence anchoring using genetic and physical maps

Correspondence between linkage groups and chromosomes was done according to Phillips et al.44 A sequential use of data from linkage and physical maps was applied to anchor the sequence assembly to chromosomes. The first anchoring step was performed using markers from a consensus linkage map14. The sequence assignment was then expanded using BES information from the trout physical map45,46,47, and markers from a RAD based linkage map48. Using this linkage and physical map information, 4,413 sequences were assigned to 898 distinct loci on the genetic map and locally ordered representing a total sequence length of 1,023,288,475 bp, that is, 48% of the total assembly, and 54% of total length of the scaffolds (Supplementary Table 8 and Supplementary Methods).

Rainbow trout transposable elements

Annotation of transposable elements was done using BES of O. mykiss and Salmo salar, 60 completely sequenced rainbow trout BAC, cDNA Unigene library and the rainbow trout genome sequence. Classification of TEs was based on Wicker’s classification49. A database of TEs was built combining both manual and automatic annotation (Supplementary Methods). Transposable elements account for ∼38% of the rainbow trout genome sequence (Supplementary Table 9). To evaluate the age of TE copies, Kimura distances were calculated based on the alignment (consensus from the TE library versus copy in the genome) generated by RepeatMasker. The Kimura calculation uses the rates of transitions and transversions. Those rates are then transformed in Kimura distances using the formula K=−1/2 ln(1–2p–q)–1/4 ln(1–2q), where ‘p’ is the proportion of site with transitions and ‘q’ the proportion of site with transversions. Using Kimura distances, we estimated the relative age of the different TE families in the genome of the rainbow trout (Supplementary Fig. 5). It appears that two or three main bursts of transposition occurred in the genome. The most ancient one is mainly due to a high activity of Tc-Mariner families (Kimura value 41). In the second (around Kimura value 12), an increase of all families and particularly CR1 is highlighted. Finally, the last one (Kimura value 8) shows a new burst of Tc-Mariner activity.

Rainbow trout WGDs and comparative genome analyses

As a starting point for comparative genome analyses, we integrated predicted trout genes in vertebrate gene families based on Ensembl version 66 (February 2012)50. The 46,585 predicted trout proteins were compared against 13,264 gene families from 14 representative vertebrate species comprising mammals, birds and fish (Supplementary Fig. 6). Trout genes were included in 8,739 vertebrate gene trees (Supplementary Table 7). By comparison, other genes from other vertebrate genomes are included in 7,131 (takifugu) to 9,453 (Human) gene families, suggesting that annotated trout genes cover the vast majority of vertebrate gene families. A dedicated Genomicus server ( http://www.genomicus.biologie.ens.fr/genomicus-trout-01.01/) provides access to trout genes and their phylogenetic trees, as well as syntenic relationships with other genomes (Supplementary Fig. 7).

DCS blocks are defined as runs of genes in a non-salmonid (that is, non-duplicated by the Ss4R event) genome that are distributed on two different chromosomes (or non-anchored scaffolds) in the rainbow trout genome; the exact gene order does not need to be conserved. We systematically compared the gene locations in rainbow trout with those of medaka, stickleback, tetraodon and takifugu using ad-hoc scripts to identify pairs of regions in the rainbow trout genome that are syntenic with single regions in non-salmonid species, and that correspond to DCS blocks. Pairs of paralogous trout genes on two different chromosomes (or non-anchored scaffolds) that belong to a DCS block are most likely duplicates originating from the Ss4R WGD event and are called ohnologues; there were 6,733 pairs of ohnologues. Genes that are inserted in a DCS block based on synteny with a non-salmonid species, but have no paralogous gene on the other chromosome or scaffold, are most likely former Ss4R duplicates in which one of the duplicated genes was lost, and are called singletons. Each pair of duplicated regions within a DCS block is descended from a single ancestral region in the pre-duplication genome. The organization of these ancestral regions into an ancestral chromosome was deduced from the synteny relationships with non-salmonid genomes using a clustering method implemented in Walktrap51. The Ts3R-duplicated regions in the ancestral karyotype were obtained by orthology with the Ts3R-duplicated regions in the medaka genome, which were themselves deduced from the DCS blocks between the medaka and chicken genomes obtained as described above. DCS blocks can be very short, as they are dependent on assembly continuity and scaffold anchoring. Fine-scale analysis of duplicated regions and genes was restricted to 915 scaffolds that could be paired into 569 DCS blocks for at least part of their lengths, and that share at least 4 ohnologous genes. The longest scaffold in these DCS blocks is 5,466,130 bp long and the shortest is 25,207 bp long. These 915 scaffolds contain a total of 171 miRNAs and 13,352 genes (29% of the trout genome), of which 8,624 are ohnologues and 4,728 are singletons. These scaffolds were aligned using LastZ52, resulting in 85,050 local alignments with a mean identity of 86.7%.

To better understand the fate of inactivated gene copies, protein sequences predicted from a given gene model were also aligned to their paralogous region using exonerate53 with the ‘—model protein2genome’ option (Supplementary Methods). Rates of gene loss since the Ts3R WGD were calculated by linear extrapolation.

Rates of molecular evolution

Atlantic salmon (S. salar) coding mRNA sequences (12,062 sequences)54 were translated into protein sequences. Blastp reciprocal best hits between these salmon and trout proteins were aligned with MUSCLE55,56, and rates of silent substitution (dS values) of the corresponding coding sequences were calculated using the Yang and Nielsen method in PAML4.4 (ref. 57). Ohnologous gene sequences from fish genomes were obtained from Ensembl Treebest gene trees and DCS analyses. MUSCLE alignment of protein sequences followed by PAML4 analysis of the coding sequences (CDS) was used to compute the dS and dN values for pairs of ohnologous sequences originating from the Ts3R WGD in stickleback, tetraodon, medaka and zebrafish, and for trout Ss4R ohnologues. Trout Ts3R ohnologues represent a special case, because each copy (for example, A and B) was further duplicated by the Ss4R, and thus may be represented by one (if the other Ss4R ohnologue was lost) or two sequences (for example, A1, A2 and B1, B2). A given Ss4R ohnologue was always aligned separately to each Ss4R duplicate copy stemming from the Ts3R (for example, A1 to B1 and B2), when they existed. Alignments were then concatenated to compute dS values, which thus represents an average dS (resp. dN) value for the Ts3R duplication for a given family of ohnologues. When Ss4R ohnologues existed in more than two copies, because of subsequent local duplication (for example, A1, A2, A3), we aligned each possible combination of pairs using MUSCLE55,56 (for example, A1–B1, A2–B1, A3–B1, etc.) and then concatenated alignments as before (for example, A1–B1 with A2–B1), in all possible combinations of two concatenated alignments, each leading to a dS (resp. dN) value. The smallest dS value among all alignments was considered the most conservative and retained for further analysis, together with the corresponding dN. The rate of selective constraints on orthologues and ohnologues was calculated with PAML4.4 (ω=dN/dS) using the method of Yang and Nielsen58. A linear extrapolation from the dS comparison was used to infer the timing of the Ss4R.

Transcriptome sequencing and data analysis

Tissues for transcriptome analyses were obtained from a homozygous clonal 1-year-old female sampled 3 weeks after spawning. These doubled haploid females were first produced after gynogenetic reproduction of standard females plus inhibition of first embryonic cleavage59, and further reproduced by a second round of gynogenesis (inhibition of second meiosis)60. Homozygous clonal lines were further maintained during every generation by single within-line pair mating between one female and one hormonally sex-reversed male. Tissues (liver, brain, heart, skin, ovary, white and red muscle, anterior and posterior kidney, pituitary gland, stomach, gills) were collected and stored in liquid nitrogen until RNA extraction. Total RNA was extracted using Tri-reagent (Sigma, St-Louis, USA) at a ratio of 100 mg of tissue per ml of reagent according to the manufacturer’s instructions. RNA-Seq Illumina Libraries were prepared (Supplementary Methods) and sequenced using 101 base-lengths read chemistry on an Illumina GAIIx sequencer (Illumina, USA). In order to compare the expression levels of ohnologous genes, we restricted the analysis to the parts of the coding regions that can be confidently aligned using MUSCLE56 between the two genes, as non-alignable or low-quality alignment regions may result from errors in the automatic annotation process. We retained regions of the alignment where the majority of codons contain at most 1 nucleotide change, and masked all other codons with Ns. We mapped RNA-seq reads to these alignable regions using BWA61 with stringent mapping parameters (maximum number of mismatches allowed –aln 2). Mapped reads were counted using SAMtools62, with a minimum alignment quality value (–q 30) to discard ambiguous mapping reads. The numbers of mapped reads were then normalized for each gene across all tissues using DESeq63. As the alignable regions of both ohnologues are of the same length by construction, no additional normalization for length was necessary to compare expression levels within each ohnologue pair. Correlations between the expression levels of ohnologues were performed using Pearson’s correlation and paired Student’s t-tests in R on log2-transformed data. Log2-transformed expression profiles of rainbow Ss4R ohnologues were also analysed using supervised clustering methods. Hierarchical clustering was processed using centroid linkage clustering with Pearson’s uncentred correlation as similarity metric on data that were normalized and median-centred using the Cluster program64. Expression levels were normalized and centred independently for each Ss4R ohnologue pair to compare expression profiles (Fig. 4a) and normalized and centred across both ohnologues to highlight differences in relative levels of expression between both ohnologous genes (Fig. 4b). Results (colorized matrix) of hierarchical clustering analyses were visualized using the Java TreeView program65.

miRnome sequencing and data analysis

miRNA sequencing was performed from several adult tissues (brain, muscle, gills, intestine, heart, liver, pituitary, skin, leucocytes, kidney, reproductive tissue, intestine, stomach and spleen) and whole embryos (NCBI BioProject ID No. PRJNA227065). Total RNA was extracted as described for transcriptome sequencing. Because of the high egg yolk content of vitellogenic ovaries, ovarian samples were subsequently purified using a NucleoSpin miRNA kit (Macherey Nagel, Germany). RNA integrity was checked using an RNA 6000 Nano chip (Agilent). Small RNA libraries were constructed according to the Illumina Small RNA v1.5 sample preparation guide (Supplementary Methods) and were sequenced with a 36-bp chemistry on an Illumina HiSeq-2000 sequencer. A total of 3,484,155,614 reads were generated from the 38 sequenced libraries. Low-quality sequences were filtered and adaptors removed from raw small RNA sequence data using the FastQC and Cutadapt programs66, respectively. Intra- and inter-condition redundancy was eliminated and annotations of known miRNAs were performed as follows: reads were blasted to miRBase database67, Rfam database68, rRNA-silva database69 and tRNA database70. Reads with hits in miRBase and Rfam but not in rRNA or tRNA database were kept as potential miRNAs. We only kept reads with more than 1,000 hits (among all conditions) to build strong loci. Reads were subsequently aligned to the trout genome with the following filters: exact match or maximum 95% suboptimal best hit with a maximum of 15 localizations within the genome sequence. miRNA-5p/miRNA-3p loci were built as follows: sequences belong to the same locus if there is no base difference among sequences and if miRNA-5p and miRNA-3p are at least 30 nucleotides distant from each other. This allowed the identification of 495 miRNA loci corresponding to 84 different families and 164 mature sequences (Supplementary Data 1).

Gene ontology analyses

GO analyses were performed in two steps. Statistically enriched functional annotations were obtained in the sample set using a random sampling procedure (10,000 iterations, custom Perl script) with corrected false discovery rate for multiple testing (Benjamini–Hochberg FDR correction, with a 10% FDR threshold). The exact enrichment P-values for GO terms detected as significant through the random sampling procedure were then calculated using Fisher’s exact test in R (Supplementary Methods).