Genome assembly and annotation

A total of 98.22 Gb raw reads were generated by sequencing eight paired-end libraries with insert sizes ranging from 170 to 40 kb (Additional file 1: Tables S1 and S2) using the Illumina HiSeq 2000 platform, resulting in ~96X coverage of the k-mer estimated genome size of sunfish (see Methods and Additional file 1: Tables S2 and S3). The reads were assembled using SOAPdenovo [8] to generate an assembly spanning 642 Mb of an estimated 730 Mb genome size (see Methods), with a contig N50 length of 20 kb and a scaffold N50 length of 9 Mb (see Additional file 1: Tables S3 and S4). The sunfish genome comprises approximately 11 % repetitive sequences (transposable elements, tandem repeats and simple-sequence repeats; see Additional file 1: Tables S5 and S6), which is comparable to the repeat content of the fugu genome (Fig. 1 and Additional file 1: Tables S5 and S6). Using homology-based and de novo annotation methods, we predicted 19,605 protein-coding genes in the sunfish assembly (Fig. 1; see Methods). Around 95 % of the predicted sunfish protein sequences show similarity to protein sequences in public databases. Using a genome-wide set of 1690 one-to-one ray-finned fish orthologues (identified using a combination Ensembl Biomart data and InParanoid analysis) in sunfish and seven other ray-finned fishes (fugu, Tetraodon, stickleback, medaka, tilapia, zebrafish and spotted gar), we reconstructed a phylogenetic tree and estimated the divergence times of various fish lineages using MCMCtree [9] (see Methods). Our analysis (Fig. 1) suggests that sunfish (Molidae) and pufferfishes (Tetraodontidae) separated approximately 68 million years ago (mya; confidence interval 60.8 to 80.8 mya), which corroborates the results of other recent studies that are based on smaller datasets [10, 11].

Fig. 1 Divergence times and genome statistics of representative ray-finned fishes. a Divergence times of representative ray-finned fishes estimated using the topology obtained from the phylogenomic analysis (see Methods). The blue bars on ancestral nodes indicate the 95 % confidence intervals of divergence time estimates (MYA, million years ago). Numbers on each node represent bootstrap support values. b Genome statistics and (c) distribution of different types of orthologues in representative ray-finned fishes. The repetitive content of sunfish, fugu, Tetraodon, medaka, zebrafish, tilapia and stickleback were estimated in the present study (see Methods) whereas that for spotted gar is from Braasch et al. [79] Full size image

Population size history

We identified approximately 489,800 heterozygous single nucleotide polymorphisms (SNPs) in the genome assembly of the sunfish and estimated the heterozygosity to be 0.78 × 10−3, which is lower than other marine fishes such as Atlantic cod (2.09 × 10−3) and stickleback (1.43 × 10−3) [12]. Based on the identified heterozygous sites, we ran the pairwise sequentially Markovian coalescent (PSMC) model [13] to infer the historical changes in the effective population size (N e ) of sunfish. The PSMC analysis suggests that there was an increasing trend of N e from ~3 to ~0.9 mya (Additional file 1: Figure S1). Around 2.15 mya, a large asteroid (more than 1 km in diameter) is thought to have fallen into the Southern Ocean in the Eltanin Fault zone and generated a super-tsunami that resulted in a large-scale marine extinction [14]. This event might have released more habitats that enabled sunfish to expand its population size. Around 0.9 mya, the N e stopped expanding and began to decline slightly, which could be related to the mid-Pleistocene climate transition (MPT, ~1.2-0.55 mya, Additional file 1: Figure S1) [15]. The MPT period was accompanied by the extinction of many marine species such as Stilostomellidae and Pleurostomellidae [16]. We also found a N e peak around 150 thousand years ago (kya), followed by a rapid decrease of N e . However, the bootstrap support for these estimates is rather weak (Additional file 1: Figure S1).

Positively selected and fast-evolving genes

Using a set of 10,660 one-to-one teleost homologues (determined by reciprocal best BLASTP hit with an E-value cutoff of 1e-5) from five teleost species (sunfish, fugu, Tetraodon, medaka and zebrafish), we conducted positive selection analyses (see Methods for details). We identified a set of 1067 genes that are evolving noticeably faster in the sunfish lineage compared with other branches (branch model, Additional file 2). In addition, using the branch-site model, we identified 1117 genes that contain positively selected sites specifically in sunfish (Additional file 3). We examined genes involved in the growth pathway and found several fast-evolving genes and genes containing positively selected sites in the GH/IGF1 axis. Previous studies have shown that this axis has a crucial role in regulating the growth of the fish body [17, 18]. Given the massive body size and extraordinary growth rate of the sunfish, we analyzed these growth-related genes in more detail. The GH/IGF1 axis comprises several components - the insulin-like growth factors (IGFs), IGF receptors (IGFRs), IGF-binding proteins (IGFBPs), growth hormone (GH), growth hormone receptor (GHR) and the insulin receptor (INSR). GH is released by the pituitary whereas IGF-1 is released by the liver as a result of GH stimulation. GH and IGF-1 exert their effects through GHR, IGFR and INSR to modify cell growth and proliferation (Fig. 2). The overall result is increased growth and decreased differentiation or suppression of apoptosis [19, 20]. Using the branch models in Phylogenetic Analysis by Maximum Likelihood (PAML) [21], we found multiple genes that are evolving at a different rate from the rest of the tree. Among these, several genes in the GH/IGF1 axis (ghr1, igf1ra, ifg1rb, grb2, akt3, irs2a and jak2a) were found to be evolving rapidly in the sunfish lineage (dN/dS values higher than the background) (Fig. 2 and Additional file 1: Table S7).

Fig. 2 Fast-evolving and positively selected genes in the GH/IGF1 axis. a Schematic representation of GH/IGF-1 signalling, adapted from [80]. Arrows denote the direction of signal transduction, whereas the grey ellipse represents an enzyme or a cytokine. Purple stars indicate genes exhibiting elevated dN/dS (fast evolution), whereas the yellow stars indicate positive selection. b 3D structure of IGF1Ra and (c) IGF1Rb monomers as predicted by the SWISS-MODEL Workspace [81]. The bar above the 3D structure indicates the structural domains. Red lines in the bar and the red atom balls in the 3D structure represent the positively selected sites. Domains in the 3D structure are coloured according to the colour scheme on the bar. The pink-surface model in the centre of the IGF1R structure is IGF-1. Akt, protein kinase B; CR, furin-like cysteine rich region (PF00757); FnIII, fibronectin type III domain (PF00041); GH, growth hormone; GHR, growth hormone receptor; GRB2, Growth factor receptor-bound protein 2; IGF-1, Insulin-like growth factor 1; IGF1R, Insulin-like growth factor 1 receptor; INSR, Insulin receptor; IRS, Insulin receptor substrate; JAK2, Janus kinase 2; MAPK, Mitogen-activated protein kinase; PI3K, phosphoinositide 3-kinase; RL, receptor L domain (PF01030); SHC1, SHC-transforming protein 1; STAT5, signal transducer and activator of transcription 5 Full size image

We found that both copies of igf1r (igf1ra and igf1rb) contain positively selected sites in the sunfish (Fig. 2 and Additional file 1: Table S7). Interestingly, some of these positively selected sites are located within functional domains of these receptors. For example, positively selected sites within Igf1ra were located within the fibronectin type III domain (FnIII, Pfam identifier PF00041) and receptor L domain (RL, PF01030) (Fig. 2). Previous studies have shown that the FnIII domain is important for ligand binding [22, 23], thus mutations in this domain could change the affinity between IGF1R and its ligand. Although the RL domain does not directly bind ligands, many of the determinants responsible for hormone binding and ligand specificity map to this central site [24]. Conversely, Igf1rb contains three positively selected sites located within the furin-like cysteine-rich region (CR, PF00757) and two sites in the protein tyrosine kinase domain (TK, PF07714) (Fig. 2). These two domains are involved in receptor aggregation [25] and other functions such as enzyme activity, subcellular localization and interactions with other molecules [26]. Interestingly, mutations in the ligand-binding domain and the RL domain of IGF1R can result in growth retardation in humans [27]. Thus, positive selection of sites within these domains of Igf1ra and Igf1rb in the sunfish may have enhanced the function of these genes. We also found evidence for positively selected sites in the insulin receptor gene (insr), which is known to bind to IGF-1 and promote growth [28] (Fig. 2). Mutations in human INSR are known to cause Donohue syndrome, which is characterized by stunted growth [29]. A positively selected site (R457) in the sunfish Insr maps to a mutation (K487E) in the human INSR associated with Donohue syndrome.

Another interesting set of fast-evolving genes/genes with positively selected sites are those related to the ECM. The ECM provides the microenvironment of the cell as well as bulk, shape and strength to tissues such as bone and cartilage. In addition, the ECM also contains components required for the conversion of cartilage to bone and its homeostasis [30, 31]. We found a number of genes related to the ECM that exhibit fast evolution or positive selection (Additional file 1: Table S8). The gene COL2A1 encoding type II collagen, which normally represents approximately 80–90 % of the collagen content of the cartilage matrix [32] (Fig. 3), is present in two copies (col2a1a and col2a1b) and contains positively selected sites in the sunfish (Additional file 1: Table S8). Several ECM-related genes (col11a1a, col11a2, bmp1b, fkbp10b, lepre1, serpinf1 and sp7) also exhibit elevated dN/dS values (Additional file 1: Table S8) but as there are no signs of positive selection, these genes might be under relaxed selection.

Fig. 3 Genes related to bone and cartilage. Schematic diagram showing the extracellular matrix of cartilage (adapted from [82]). The figure illustrates collagens (mostly type II collagen), proteoglycans (primarily aggrecan), and other non-collagenous proteins including link protein (yellow circles) and fibronectin. Stars denote fast evolution or positive selection. COMP, cartilage oligomeric matrix protein Full size image

Genes involved in bone formation

In contrast to other bony fishes (Osteichthyes), the endoskeleton of the sunfish is mainly composed of cartilage [5]. The genetic mechanism underlying this derived phenotype is not known. To look for clues to this mechanism, we analyzed genes known to be involved in bone formation such as those encoding proteoglycans, the bone morphogenetic protein (BMP) signalling pathway, transcription factors, bone differentiation and secretory calcium-binding phosphoproteins (SCPP). However, the sunfish possesses intact orthologues for most of these genes except for some SCPP genes (see Additional file 4). The details of the searches and the genes identified are given below.

Proteoglycan-encoding genes

Genes encoding proteoglycans are important regulators of cartilage and bone formation. We searched for the small leucine-rich proteoglycan (SLRP) gene family clusters: (a) Fmod, Prelp, Optc; (b) Ecm2, Aspn, Omd, Ogn; (c) Dcn, Lum, Kera, Epyc; and (d) Ecm2-like and Bgn. We first ran BLASTP (with default settings) of the human, zebrafish and/or fugu reference proteins against the annotated sunfish proteins. For genes that could not be identified using this method, we proceeded to a TBLASTN of the human and fish proteins against the sunfish genome assembly followed by a BLASTX of the resulting sunfish genomic loci against the NCBI non-redundant (NR) protein database. Using this strategy, we identified orthologues for all the above genes in the sunfish genome on (a) scaffold10.1, (b) scaffold39.1, (c) scaffold20.1, and (d) scaffold77.1, except Optc and Omd. In addition, we identified second copies of Ogn and Fmod located within nine genes of each other on scaffold13.1 (Additional file 5). We BLASTX-searched (with default settings) the sunfish loci of (a) and (b) against the NCBI NR protein database to identify Optc and Omd respectively, but did not identify these genes. Optc and Omd are present in zebrafish and cavefish but are absent in the Percomorphaceae fishes such as fugu, Tetraodon, tilapia and platyfish. Thus, they may not be responsible for the cartilaginous skeleton of the sunfish. We also searched for the lectican-hyaluronan- and proteoglycan-binding link protein (HAPLN) gene family clusters (Hapln2 and Bcan; Vcan and Hapln1; Acan and Hapln3; Ncan and Hapln4) and other non-clustered proteoglycans (Fn1, Lepre1, Tuft1, Podn). Orthologues for all these genes are present in the sunfish (Additional file 4).

The BMP signalling pathway

We identified homologues for Bmp2, Bmp4, Bmp5, Bmp6 and Bmp7; the receptors Bmpr1a, Bmpr1b, and Bmpr2; the Smad family genes (Smad1, Smad4, Smad5, Smad6, Smad7) and the Smurf family genes (Smurf1, Smurf2) in the sunfish genome. For Smad4, we identified up to four copies in the sunfish (Additional file 4).

Transcription factors

We identified homologues for Sp7/osterix, Bapx1/Nkx3-2, Pdlim7, Mitf, Nfatc1, Msx1, Msx2 (the zebrafish homologue is known as msxd), Hand1 and Hand2 in the sunfish genome (Additional file 4). Runx2 is a key transcription factor involved in bone formation. Knockout of Runx2 in the mouse results in the formation of cartilaginous skeleton [33]. The sunfish Runx2 gene is located on scaffold14.1:1,387,125..1,422,518 (SUNFISH_GLEAN_10009694, see Additional file 4; the first 45 amino acids were removed because they are not present in other fish Runx2 proteins). An alignment of Runx2 proteins shows that the sunfish Runx2 is highly conserved (e.g. its DNA-binding domain is perfectly conserved and its central and C-terminal domains also look intact) (Additional file 1: Figure S2).

Bone differentiation genes

We searched for genes responsible for bone differentiation in the sunfish genome. We identified leptin and leptin receptor, Bglap/osteocalcin, Fam20c, Bmp1, osteocrin, osteopotentia homologue, sclerostin, Phospho1 and Phospho2. We did not find Mmp1 in sunfish. Mmp1 may have arisen through tandem duplications in tetrapods based on our observations from the Genomicus database (Ensembl release 76). For osteocrin, we identified a partial prediction (one exon only) on scaffold56.1:442 kb between the genes Gmnc and Iws1, which are linked to osteocrin in the tilapia genome (on LG14). Thus, sunfish seems to have all the important genes involved in bone differentiation in teleost fishes (Additional file 4).

SCPP gene family

The SCPP gene family encodes secretory calcium-binding phosphoproteins that participate in the mineralization of collagenous bone and dentin as well as noncollagenous enamel. The SCPP genes originated from tandem duplication of the secreted protein acidic and rich in cysteine (SPARC)-like 1 gene (Sparcl1) that was itself derived from SPARC around the time a mineralized skeleton arose in vertebrates [34]. SCPP genes encode ECM proteins and are divided into two categories: acidic and proline/glutamine-rich (P/Q-rich). Whereas acidic SCPP genes are involved in the mineralization of collagenous bone and dentin, P/Q-rich SCPP genes participate in the deposition of non-collagenous enamel [34]. The elephant shark, which is a cartilaginous fish (Chondrichthyes) and lacks endochondral bone, does not contain any SCPP genes [35], whereas bony vertebrates contain both categories of SCPP genes although the complement of each category varies between lineages due to lineage-specific expansion and losses [36].

We searched for SCPP genes in the sunfish genome to understand the genetic basis of the cartilaginous skeleton of the sunfish. We filled a sequencing gap in this intergenic region by sequencing a genomic PCR product to obtain the complete sequence for spp1. The orthology of the P/Q-rich SCPP genes in sunfish was verified by generating a Maximum Likelihood phylogenetic tree using sequences from sunfish, fugu, medaka and zebrafish (see Additional file 1: Figure S3). Sunfish contains two acidic SCPP genes (spp1 and scpp1) similar to fugu and zebrafish. However, it has lost two P/Q-rich SCPP genes (fa93e10 and scpp7) that are conserved in the other two teleosts (Fig. 4 and Additional file 4). This conclusion was reached after searching the genomic vicinity of the sunfish Sparcl1, the entire genome assembly and the raw reads of the sunfish genome using zebrafish fa93e10 and scpp7 protein sequences by TBLASTN. fa93e10 was first identified as an expressed sequence tag (EST) clone as part of a screen for genes that are expressed in regenerating zebrafish caudal fins [37]. Whole-mount in situ hybridization experiments in zebrafish suggested that fa93e10 is a growth marker that identifies cycles of growth in fin ray segments [37]. Its frequency of expression in fin rays decreases with the age of the fish in tandem with decreased distal mesenchymal cell proliferation [37, 38]. It is not clear whether the absence of fa93e10 in sunfish is somehow related to its unusual fin morphology. In addition to the complete loss of fa93e10 and scpp7, another P/Q-rich SCPP gene, scpp4, that is intact in fugu and medaka has become a pseudogene in the sunfish due to a single nucleotide insertion. This insertion was confirmed by PCR and sequencing of genomic DNA from two other unrelated specimens of sunfish (GenBank accession numbers KF737069 and KF737070), providing further evidence that scpp4 became nonfunctional in the sunfish lineage after it split from the pufferfish lineage (Fig. 4 and Additional files 1 and 4). However, the functional consequence of the loss of this gene in sunfish is not known. Zebrafish does not contain an orthologue of scpp4 but instead contains several other lineage-specific P/Q-rich SCPP genes (Fig. 4 and Additional file 4). Thus, the genetic basis of the cartilaginous skeleton in the sunfish remains unclear. It is possible that this distinctive phenotype is driven by a regulatory change and is therefore not evident in the bone gene repertoire.

Fig. 4 Scpp genes in the ocean sunfish. Upper panel: Comparison of the sunfish sparcl1 locus with those of zebrafish and fugu showing the missing Scpp genes in sunfish (fa93e10, scpp7 and scpp4). The scpp4 pseudogene in the ocean sunfish is shown as a dotted arrow. Lower panel: Alignment of sunfish, fugu and medaka scpp4 sequences showing the single base insertion in exon 2 of this gene in sunfish resulting in a premature termination codon followed by a frameshift in the rest of the open reading frame. This insertion was confirmed by PCR and sequencing of genomic DNA from two other specimens (GenBank accession numbers KF737069 and KF737070). The termination codon in sunfish is underlined in red. The accession numbers for fugu and medaka scpp4 genes used in the alignment are DQ066525.1 and XM_004065875.2, respectively. chr, chromosome; scaf, scaffold Full size image

Hox genes

Hox genes specify segmental identities along the anterior-posterior axis of developing vertebrate embryos and are also important for the anterio-posterior and proximal-distal patterning of limbs [39, 40]. Loss of Hox gene function can lead to morphological changes [41, 42]. Therefore, we hypothesized that loss of Hox genes could contribute to phenotypic evolution. To determine whether sunfish has uniquely lost any Hox gene(s), we analyzed the Hox gene complement in the sunfish. The sunfish possesses seven Hox gene clusters similar to fugu but contains more Hox genes (47 genes in sunfish compared with 45 in fugu) (Additional file 1: Figure S4). It possesses an intact hoxa7a and hoxb7a that are pseudogenes in fugu, making fugu the only known teleost completely lacking a Hox7 paralogous group member. In addition, sunfish has an intact hoxc1a, which is a pseudogene in fugu but is completely lost in medaka (Additional file 1: Figure S4). Conversely, sunfish lacks hoxd11b, which is present in fugu. However, this gene is also lost in fishes such as medaka and the East African cichlid (Additional file 1: Figure S4) [43]. Thus, it is unclear whether the loss of this gene has any effect on the overall morphology of sunfish. As such, the unusual morphology of the sunfish does not seem to be related to the loss of any specific clustered Hox gene. However, we cannot rule out the possibility of altered expression pattern(s) of Hox genes contributing to this unusual phenotype. In fact, the loss of the pelvic fin in fugu has been shown to be associated with the altered expression pattern of Hoxd9a [44].