Sample collection and DNA extraction.

DNA was collected from four species of marine mammals, a killer whale (O. orca), a bottlenose dolphin (T. truncatus), a Pacific walrus (O. rosmarus divergens) and a Florida subspecies of the West Indian manatee (T. manatus latirostris). The female killer whale 'Morgan' stranded on the coast of the Netherlands and was then transferred to the Harderwijk Dolfinarium. A comparison of Morgan's mitochondrial DNA sequence and learned vocal repertoire with a North Atlantic database indicated that she originated from the population of killer whales that forage primarily on the Norwegian spring-spawning stock of Atlantic herring, Clupea harengus21. A 10-ml sample of whole blood was taken and immediately stored in a PAXgene Blood DNA tube and PAXgene Blood RNA tube for DNA and RNA extraction, respectively. Additional biopsy samples from five killer whales feeding on Atlantic herring off the coast of Norway were collected and stored immediately in the preservative RNAlater. RNA was extracted and pooled from homogenized skin biopsies of the five free-ranging killer whales using the Qiagen RNeasy Mini kit and following the manufacturer's guidelines. Blood samples were similarly taken from two walruses from Harderwijk Dolfinarium: from an Alaskan male ('Igor'), with the sample immediately stored in a PAXgene Blood DNA tube for whole-genome sequencing, and from a Wrangel Island female ('Natasja'), with the sample immediately stored in a PAXgene Blood RNA tube for RNA sequencing to aid in annotation of the genome. DNA and RNA were extracted from whole blood using the PAXgene Blood DNA kit and PAXgene Blood RNA kit, respectively, and following the manufacturer's guidelines. Bottlenose dolphin tissue samples were obtained at necropsy from dolphins in the US Navy Marine Mammal Program. Spleen, liver, kidney and skin samples were from female animals, and muscle was from a male animal. Samples were used for cDNA sequencing after preparation with standard methods. Finally, blood samples were collected and DNA was extracted following standard protocols from a female Florida manatee, 'Lorelei', born in captivity and sampled at the Homosassa Springs Wildlife State Park in Homosassa, Florida, USA.

DNA and RNA sequencing and assembly.

Whole-genome shotgun sequences were generated using an Illumina HiSeq platform from DNA libraries for the killer whale, walrus, manatee and bottlenose dolphin. The dolphin genome had previously undergone Sanger sequencing at 2× coverage; library and sequencing protocols have been described previously22. The dolphin assembly was produced by assembling the ∼2.5× Sanger sequencing data with the ∼3.5× Roche 454 FLX fragment data and the ∼30× Illumina HiSeq data. The Sanger sequencing and 454 data were combined with the Atlas assembler, and Atlas-Link and ATLAS GapFill were then used to add the Illumina data, improve the scaffolds and fill in gaps within the scaffolds.

De novo assemblies were produced using methods similar to those applied in the Assemblathon II comparison. An initial assembly was generated using AllPath-LG with default parameters and MIN_CONTIG = 300 on all sequence data except the data for libraries with an insert size of 500 bp. The assembled scaffolds from the initial assembly were further extended using Atlas-Link on the basis of linking information provided by the libraries with insert sizes of 3 kb and 8 kb. ATLAS GapFill was then used to fill gaps within scaffolds by locally assembling the reads associated with each gap. For the killer whale and walrus, respectively, these reads were assembled into draft genomes with contig N50 sizes of 70.3 kb and 90.0 kb and scaffold N50 sizes of 12.7 Mb and 2.6 Mb (Supplementary Table 1). The assemblies of 2,249 Mb and 2,300 Mb covered approximately 85% and 95% of the estimated 2,373 Mb (killer whale) and 2,400 Mb (walrus) of the genomes, respectively. The improved dolphin assembly contig N50 size was 11.9 kb, and the scaffold N50 size was 115 kb. The total assembled size of the genome was 2.33 Gb (2.55 Gb with gaps) and covered ∼95.3% of the genome.

Sequencing and assembly of the manatee varied slightly from the method used for the other marine mammals: the DNA from the manatee was sequenced to 90× total coverage by Illumina sequencing technology, comprising 45× coverage of libraries with a fragment size of 180 bp, 42× coverage of sheared jumping libraries with a fragment size of 3 kb, 2× coverage of sheared jumping libraries with a fragment size of 6–14 kb and 1× coverage of fosmid jumping libraries23. The sequence was then assembled using ALLPATHS-LG24. The draft assembly was 3.10 Gb in size and was composed of 2.77 Gb of sequence plus the gaps between contigs. The manatee genome assembly had a contig N50 size of 37.8 kb, a scaffold N50 size of 14.4 Mb and quality metrics comparable to those of other Illumina genome assemblies.

Annotation.

The NCBI eukaryotic genome annotation pipeline was used. The first step involved repeat identification and masking with WindowMasker25. Second, proteins, transcripts generated from the RNA sequencing experiments and ESTs, including previously identified sequences from the study organisms or closely related organisms from RefSeq26, were aligned to the genome assembly using BLAST. This step included a 'polishing' stage using the spice site–aware algorithm Splign27 to improve information about splice sites and exon boundaries. Protein and transcript alignments were passed to Gnomon, which uses a hidden Markov model (HMM) tool based on Genscan28 to extend predictions missing a start or stop codon or internal exon(s). Gnomon additionally creates ab initio gene predictions for regions with no evidence of alignment. The final set of annotated features comprised, in order of preference, (i) RefSeq transcripts or genomic sequences and (ii) Gnomon-predicted models. Each genome was additionally masked for repetitive elements using RepeatMasker. The proportion of repetitive elements constituting each is shown in Supplementary Figure 3.

Ortholog identification and alignment.

The latest human (hg19), macaque (rheMac2), marmoset (calJac3), mouse (mm9), rat (rn4), alpaca (vicPac2), cow (bosTau7), dog (canFam2), elephant (loxAfr3), baboon (papAnu2) and opossum (monDom5; used as an outgroup) genome assemblies were obtained from the UCSC Genome Browser. Human-referenced whole-genome alignments were constructed from syntenic pairwise alignments with human ('syntenic nets') or reciprocal best alignments with human, depending on the quality of the assembly, using the UCSC MULTIZ alignment pipeline29,30.

A starting gene set was composed of the human RefSeq, UCSC Known Genes31 and VEGA32 annotations (downloaded from UCSC on 29 July 2013). Transcripts that lacked annotated coding regions (CDSs), that had CDSs of <100 bp in length or that had CDSs whose lengths were not multiples of three were discarded. These transcripts were grouped by same-stranded CDS overlap into genes (transcript clusters). All transcripts were mapped from human to each of the other mammalian species via syntenic alignments and then subjected to a series of filters designed to minimize the impact of annotation errors, sequence quality and changes in gene structure on subsequent analyses. Briefly, each human transcript was required (i) to map to the non-human genome via a single chain of sequence alignments including ≥80% of its CDS; (ii) after mapping to a non-human species, to have ≤10% of its CDS in sequencing gaps or low-quality sequence; (iii) to have no frameshift indels, unless they were compensated for within 15 bases; and (iv) to have no in-frame stop codons and to have all splice sites conserved. To allow for genes that were mostly conserved but whose start or stop codons had shifted, incomplete transcripts with ∼10% of the bases removed from the 5′ and 3′ ends of their CDSs were also considered. The final collection of ortholog sets was obtained by selecting, for each gene, the (complete or incomplete) transcript that successfully mapped to the largest number of marine mammals, with the number of other species used as a secondary criterion. In the case of a tie, the transcript with the greatest total CDS length was selected. This procedure resulted in the annotation of 16,878 genes with at least 2 non-human orthologs, with each gene present on average in ∼3.3 marine species and 4.8 other species (including human).

Testing for positive selection.

To find genes under positive selection, we applied four different branch-site likelihood ratio tests33: for the cetacean clade and branch leading to cetaceans, the walrus lineage, the manatee lineage and a single test for all branches involving the four marine mammals (foreground branches for individual tests are highlighted in Fig. 1). In all tests, we used the reduced parameterization introduced by Kosiol et al.34. P values were estimated assuming a null distribution constituting a 50:50 mixture of a χ2 distribution and a point mass at zero, leading to conservative P-value estimates35. The Benjamini-Hochberg method36 was used to correct for multiple testing, and a false discovery rate (FDR) cutoff of 0.1 was used. A comprehensive table of all genes in the study, together with the list of species where orthologs were found and information indicating for which tests these ortholog groups were used and the resulting P values, and indications of whether these genes had significant FDR values after correction for multiple testing are available to download (see URLs). Genes found to be evolving under positive selection are listed in Supplementary Tables 3,9,10 and 11.

Gene Ontology (GO) categories were assigned to orthologous groups according to the human genome reference. Each gene was also assigned to all parental categories in the ontology. We used two different statistical tests to detect categories with an over-representation of positively selected genes. First, Fisher's exact test (considering all genes with P < 0.05 as positively selected) measured the enrichment of a particular GO category for positively selected genes (Supplementary Table 12). A disadvantage of this test is that its results are highly dependent on the cutoff value for positively selected genes. Second, Mann-Whitney U tests were used to measure shifts toward higher P values in a particular GO category (Supplementary Table 13). Thus, the Mann-Whitney U test does not depend on a P-value cutoff; however, its results may also be affected by relaxation of constraint instead of positive selection. The Holm method37 was used to correct for multiple testing.

Testing for genomic convergence.

Reconstruction of ancestral sequences was conducted for 16,833 mammalian orthologs using the Codeml program in PAMLv4.4 (ref. 38). For each of the three marine mammal groups—cetaceans, manatee and walrus—the extant sequences at each position were compared to the ancestral sequence at the node corresponding to the most recent ancestor. For the two cetaceans, this node was the one shared with cow; for walrus, this node was the one shared with dog; and, for manatee, this node was the one shared with elephant. The ancestral nodes are those at the roots of the red branches in Figure 1. We identified amino acid positions for which changes were inferred to have occurred and further examined those positions that changed in more than one marine mammal group. These changes could have been shared by all three groups or shared by any two of the three groups. Changes were further classified as 'parallel' if they resulted in an identical amino acid state in the present-day species and 'common' if they resulted in non-identical amino acid states in the present-day species. Common changes were hypothesized to be possible indicators of convergent evolution if adaptation to an aquatic lifestyle could be accomplished via multiple different amino acids at the same position. Genes with common and parallel changes were then compared to genes found to be under positive selection, and any overlapping genes between these two sets were inferred to have undergone convergent evolution. The positions of the parallel nonsynonymous amino acid substitutions that were found in positively selected genes are shown in Supplementary Table 14.

URLs.

Multi-genome alignment, ortholog set and likelihood ratio test results, http://compbio.fmph.uniba.sk/suppl/marine-mammals/; NCBI eukaryotic genome annotation pipeline, http://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/; UCSC Genome Browser, http://genome.ucsc.edu/; Baylor College of Medicine Marine Mammal Genome Project, https://www.hgsc.bcm.edu/marine-mammals; Atlas-Link, https://www.hgsc.bcm.edu/software/Atlas-Link; ATLAS GapFill, https://www.hgsc.bcm.edu/software/atlas-gapfill; Gnomon, http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml; RepeatMasker, http://www.repeatmasker.org/.

Accession codes.

The whole-genome shotgun sequences have been deposited in GenBank under the BioProject accessions ANOL00000000, ANOP00000000, AHIN00000000 and ABRN00000000. Sequencing data have been deposited in GenBank under BioProject 170427 corresponding to the Marine Mammal Genomes Project. Sequencing data for the Florida manatee genome have been deposited in GenBank under BioProject PRJNA189960.