Genome assembly statistics

Genome assembly statistics are summarized in Table 3. Reference-free analysis of the quality filtered data using the preqc tool (v. 0.10.13) [26] gave us an estimate of the genome size based on k-mer word frequency of 3.44 Gb, within the range reported size of other chondrichthyans [49, 50]. The assembly consisted of 1,213,000 contigs and 997,976 scaffolds, a contig N50 of 5304 bp, and a scaffold N50 of 5425 bp. We estimated that we had approximately 30-fold redundancy in coverage of the genome. The DNA composition of the assembled contigs was 41.3% G + C. The rather low N50 compared to other recent vertebrate genome projects suggests that the assembly could benefit from more mate-pair and long read sequences, as well as deeper coverage of Illumina sequence to help correct sequence. The assembly incorporated an Illumina mate-pair library of approximately 3 kb. Attempts to construct larger insert mate-pair libraries resulted in failure.

Table 3 Genome and predicted protein statistics. Percentages of total genome size calculated as proportion of assembly size rather than estimated genome size Full size table

Sequence contamination is an issue that has bedeviled whole-genome sequencing projects [51]. We therefore expected to see non-whale shark DNA originating from carryover from previous Illumina runs, and contamination from extrinsic laboratory sources during tissue preparation, the latter especially since the R. typus diet may contain unusually high levels of bacteria [52]. To determine the approximate extent of contamination, we used BLAST to compare the assembly to the highly conserved bacterial 16S gene and found only four contigs with low sequence coverage (5–7 fold redundancy) had greater than 75% matches to the whole gene. Therefore, we concluded that bacterial contamination was present but not a major factor in this project.

Immediately prior to the public release of these data (December 2014) there were only 110 nucleotide sequences in the NCBI database assigned a R. typus taxonomic origin. 109/110 of these sequences could be mapped to the contigs from this project with a threshold match significance BLAST score of 10−5 or lower. The one sequence that did not match was a putative recombination activating protein 2 ortholog (NCBI gid:315,571,864) that turned out to have best matches only to other bony fishes and thus may have been misidentified in its origin.

Predicted proteins

Use of the AUGUSTUS software [27] for de novo gene prediction resulted in 19,384 protein-coding genes predicted on the assembled contigs (available on Figshare [53]). While the largest predicted protein was 4709 amino acids in length, the majority of the proteins were less than 200 amino acids (Fig. 2). Of the predicted proteins, 14,736 (76%) of the proteins had a blastp match in the NCBI nr database. More than 99% of the protein best matches were to eukaryotes (Fig. 3), providing further evidence that prokaryotic contamination in the project was limited. Within the eukaryotes, 82% of the matches were to Chordata, with other fish species that have completed genomes as predominant matches (Fig. 4). The genome with the greater number of best matches (34% of Chordata) was the elephant shark. These results were therefore in line with what would be expected of a novel chondrichthyan genome sequence. Of the predicted proteins, 7038 (36.3%) of the proteins had a blastp match in the KOG database (Table 4).

Fig. 2 Histogram of predicted protein sizes Full size image

Fig. 3 Overview of taxonomy of whale shark protein best matches to the nr database. Figure was constructed from best BLAST matches to the nr database using Krona [31] tool Full size image

Fig. 4 Overview of best matches to the protein database that map to the Chordata taxonomy group Full size image

Table 4 Number of genes associated with general KOG functional categories. Percentages of genes is based on the total number of predicted proteins Full size table

Ortholog analysis

From comparisons of the whale shark genome with ten other fish genomes, we found that there was a ‘core’ set of 1846 ortholog groups with at least one protein member present in each of the eleven genomes, representing a set of highly conserved functions. Of these genes, 155 orthologs were present with exactly one protein member in each of the groups. The phylogeny based on concatenation of these core genes recapitulated the established evolutionary relationship of the species: the cartilaginous fishes R. typus and C. milii form a deep clade as the sister clade to bony fishes (Fig. 5).

Fig. 5 Phylogeny based on alignment of conserved single-copy proteins. Silhouettes are not to scale. Accessions: Petromyzon: GCA_000148955.1, Callorhinchus: GCA_000165045.2, Latimeria: GCA_000225785.1, Danio: GCA_000002035.3, Gadus: GCA_000231765.1, Gasterosteus: GCA_000180675.1, Oryzias: version MEDAKA1 (Ensembl), Oreochromis: GCA_000188235.1, Takifugu: GCA_000180615.2, Tetraodon: GCA_000180735.1. Silhouette credits: Petromyzon by Gareth Monger, CC-BY; Callorhinchus by Tony Ayling, CC-BY-SA; Rhincodon by Scarlet23, vectorized by T. Michael Keesey, CC-BY-SA; Latimeria by Maija Karala, CC-BY-NC-SA; Gadus, Oreochromis, Tetraodon, Gasterosteus by Milton Tan; Danio, Oryzias, Takifugu, no copyright Full size image

The ortholog analysis revealed that there were 865 protein families present in the other genomes that were missing in the whale shark. This number was of the same order as the outgroup lamprey genome (764 missing orthologs) and higher than that seen in the other fishes (the elephant shark genome had only 108 missing protein sequences). Further, there were 543 proteins missing from both the whale shark and lamprey but represented in all the other nine genomes. These absent proteins could be explained by some combination of the draft nature of the sequence data in this project, the preliminary de novo annotation, or the evolutionary divergence of the whale shark and lamprey from the other species. We mapped the orthologs of the missing proteins in the well-annotated zebrafish genome and tested for enrichment of terms in the Gene Ontology or Kyoto Encyclopedia of Genes and Genomes databases using the WebGestalt GSAT analysis tool [48]. We found no specifically enriched terms or pathways in the missing protein set compared to the entire zebrafish proteome. This suggested that the absent genes were not overrepresented in any particular functional category, as might have happened through adaptive gene deletion.

The remaining 4648 predicted proteins with no nr database match tended to be short (mean of 126.5 amino acids, compared to 179 for the protein dataset as a whole), suggesting that many were annotation overcalls, or fragments of proteins disrupted by contig gaps. Several of these proteins are large enough that they are unlikely to be the result of spurious translation (9 were >500 amino acids in length, the largest 1352 amino acids). These could represent novel chondrichthyan genes, although it is also possible that many of the proteins without a best match could be uncultivated microorganisms.

Preliminary comparisons

The only other cartilaginous fish for which a complete genome has been assembled is the elephant shark C. milii [18, 20, 23], which is not an elasmobranch but a member of the Holocephali (also known as ratfishes). There are striking differences between the genomes, most obviously in size. The whale shark genome, at 3.44 Gb, is approximately 3.5× the size of the elephant shark genome at only 950 Mb. The genomes were also diverged at the DNA level. In a discontiguous megablast alignment between the C. milii and whale shark scaffolds, the combined length of matches with an E value of <0.001 was only 42 Mb of the elephant shark genome (71% nucleotide identity). In addition, based on our phylogenetic analysis, the number of estimated substitutions is higher in whale shark than in elephant shark.

Comparisons of cartilaginous fishes such as C. milii and R. typus to other vertebrates can provide some insight into the evolution of jawed vertebrates. Some of the features of the protein set of R. typus recapitulated discoveries made in C. milii. For example, homologs of the human SCP and SIBLING proline-glutamine families of bone-deposition proteins were missing from the whale shark genome based on negative results of BLASTX alignment against the scaffolds, a result also seen in the other cartilaginous fishes [18]. C. milii is reported to have a pseudogenized copy of the important innate immunity protein Toll-like receptor 4 (TLR4), which detects lipopolysaccharide of infecting Gram negative bacteria [18]. We found that the human TLR4 protein had a significant match (BLASTP 1e-45) to a 925 residue protein containing multiple leucine-rich repeat domains and a C-terminal TIR domain (Toll/Interleukin receptor) of the nucleotide-binding TLR2 superfamily. BLAST of this sequence to nr found best hits of this TLR protein were to TLR21 and TLR13. Neither TLR13 nor TLR21 have been previously described in chondrichthyans, with representative taxa including amphibians, mammals, birds, and teleosts [54]. TLR13 and TLR21 have been previously found to be similar, and form a clade within the other TLRs [54]. This whale shark TLR may represent an ancient homolog of these TLRs, and demonstrates these TLRs may have originated in the most recent common ancestor of jawed vertebrates. The whale shark genome will be useful for comparative studies of the origins of jawed vertebrate genes, such as these TLRs.