DNA isolation and processing for Bionano optical mapping

Komodo dragon whole blood was obtained from one of two individuals housed at Zoo Atlanta (Rinca). Samples from the animals at Zoo Atlanta were collected with the approval of the Zoo’s Scientific Research Committee. High-molecular-weight genomic DNA was extracted for genome mapping. Blood was centrifuged at 2,000 g for 2 min, plasma was removed and the sample was stored at 4 °C. Then, 2.5 μl of blood was embedded in 100 μl agarose gel plugs to give ~7 μg DNA per plug, using the Bio-Rad CHEF Mammalian Genomic DNA Plug Kit (Bio-Rad Laboratories). Plugs were treated with proteinase K overnight at 50 °C. The plugs were then washed, melted and solubilized with GELase (Epicentre). The purified DNA was subjected to 4 h of drop dialysis. DNA concentration was determined using Qubit 2.0 Fluorometer (Life Technologies), and the quality was assessed with pulsed-field gel electrophoresis.

The high-molecular-weight DNA was labelled according to commercial protocols using the IrysPrep Reagent Kit (Bionano Genomics). Specifically, 300 ng of purified genomic DNA was nicked with 7 U nicking endonuclease Nb.BbvCI (NEB) at 37 °C for 2 h in NEB Buffer 2. The nicked DNA was labelled with a fluorescent-dUTP nucleotide analogue using Taq polymerase (NEB) for 1 h at 72 °C. After labelling, the nicks were repaired with Taq ligase (NEB) in the presence of dNTPs. The backbone of fluorescently labelled DNA was stained with DNA stain (Bionano).

Bionano mapping and assembly

Using the Bionano Irys instrument, automated electrophoresis of the labelled DNA occurred in the nanochannel array of an IrysChip (Bionano Genomics), followed by automated imaging of the linearized DNA. The DNA backbone (outlined by YOYO-1 staining) and locations of fluorescent labels along each molecule were detected using the Irys instrument’s software. The length and set of label locations for each DNA molecule defines an individual single-molecule map. Raw Bionano single-molecule maps were de novo assembled into consensus maps using the Bionano IrysSolve assembly pipeline (version 5134) with default settings, with noise values calculated from the 10x Genomics Supernova assembly. The Bionano optical mapping data comprised 80× genome coverage, and the scaffold N50 of the assembly was 1.2 Mb.

DNA processing for 10x Genomics linked-read sequencing

Blood samples from two individuals housed at Zoo Atlanta (Slasher and Rinca) were used. High-molecular-weight genomic DNA extraction, sample indexing and generation of partition barcoded libraries were performed by 10x Genomics according to the Chromium Genome User Guide and as published previously100. Then 1.2 ng of genomic DNA was used as input to the Chromium system.

10x Genomics sequencing and assembly

The 10x Genomics barcoded library was sequenced on the Illumina HiSeq2500. A total of 660 million of the raw reads comprising 57× genome coverage were assembled using the company’s Supernova software (version 1.0) with default parameters. Output fasta files of the Supernova assemblies were generated in pseudohap format, which links phased and unphased regions of the assembly into ‘pseudo-haplotype’ scaffolds. This generated an initial assembly with a scaffold N50 length of 10.2 Mb and a contig N50 length of 95 kb.

Oxford Nanopore sequencing

DNA isolated from Slasher was sequenced to 0.75× coverage on an Oxford Nanopore MinIon sequencer following the manufacturer’s instructions. MinKNOW was used for basecalling and output to FASTQ files.

DNA processing for PacBio sequencing

Komodo dragon whole blood collected in EDTA from an individual housed at Reptilandia zoo, Gran Canaria under institutional approval, stored at −20 °C, was used to extract high-molecular-weight DNA for single-molecule real-time sequencing. Extraction was performed using gravity-flow, anion-exchange tips (Qiagen genomic-tip 100/G kit) to a final DNA concentration of 130 ng μl−1 assessed using a Qubit 2.0 Fluorometer. Size of extracted DNA was determined by a 16 h pulse-field gel electrophoresis, which resolved high-molecular-weight fragments from 15 kb to 85 kb. We constructed a 10 kb and a 20 kb insert library using 20 μg of genomic DNA. The library was then size selected using a Blue Pippin (Sage Science) and resulting double-stranded DNA fragments were capped by hairpin loops at both ends to form a SMRTbell template. Single-molecule SMRTbell templates were then loaded in 150,000 Zero Mode Waveguides SMRT cells of a PacBio RS II sequencing system using paramagnetic beads (MagBeads, PacBio). Sequencing was performed using a total of 29 SMRT cells. We obtained a total of 2,061,804 subread filtered sequences for a total sequence length of 11,907,672,561 bp. Average, maximum and minimum sequence lengths were 5,775 bp, 48,338 bp and 35 bp, respectively. Median sequence length was 4,486 bp. N50 read length was 12,457 bp. In total, PacBio sequencing data represented 6.3× genome sequencing coverage.

Merging datasets into a single assembly

Sequencing and mapping data types were merged together as follows. First, Bionano assembled contigs and the 10x Genomics assembly were combined using Bionano’s hybrid assembly tool with the -B2 -N2 options. SSPACE-LongRead (https://doi.org/10.1186/1471-2105-15-211) was used in series with default parameters to scaffold the hybrid assembly using PacBio reads, Nanopore reads, and unincorporated 10x Genomics Supernova scaffolds/contigs, resulting in the final assembly.

Genome completeness assessment

The BUSCO pipeline version 3.0.2 was used to determine the completeness of the Komodo dragon genome, using the 2,586 gene vertebrata gene set and the Augustus retraining parameters (—long)16. For comparison, BUSCO was also run with the same parameters on all reptile genomes used for comparative analyses (S. crocodilurus, Ophisaurus gracilis, A. carolinensis, P. vitticeps, Python molurus bivittatus, Eublepharis macularius, Gekko japonicus, Pelodiscus sinensis, Chelonia mydas, Chrysemys picta bellii, Alligator sinensis, Alligator mississippiensis, Gavialis gangeticus and Crocodylus porosus) (Supplementary Table 3).

We obtained an assembly-free estimate of the Komodo dragon genome size using an sequencing error corrected k-mer counting method implemented in the PreQC component of the SGA assembler13.

Assignment of scaffolds to chromosomes

Isolation of Komodo dragon chromosome-specific DNA pools was performed as previously described10. Briefly, fibroblast cultivation of a female V. komodoensis were obtained from tissue samples of an early embryo of a captive individual at the Prague National Zoo. Embryos are obtained under the laws of the Czech Republic and of the European Union. Chromosomes obtained by fibroblast cultivation were sorted using a Mo-Flo (Beckman Coulter) cell sorter. Fifteen chromosome pools were sorted in total. Chromosome-specific DNA pools were then amplified and labelled by degenerate oligonucleotide primed PCR and assigned to their respective chromosomes by hybridization of labelled probes to metaphases. V. komodoensis chromosome pools obtained by flow sorting were named according to chromosomes (for example majority of DNA of VKO6/7 belong to chromosomes 6 and 7 of V. komodoensis). V. komodoensis pools for macrochromosomes are each specific for one single pair of chromosomes, except for VKO6/7 and VKO8/7, which contain one specific chromosome pair each (pair 6 and pair 8, respectively), plus a third pair which overlaps between the two of them (pair 7). For microchromosomes, pools VKO9/10, VKO17/18/19, VKO11/12/W and VKO17/18/Z contained more than one chromosome each, while the rest are specific for one single pair of microchromosomes. The W and Z chromosomes are contained in pools VKO11/12/W and VKO17/18/Z, respectively, together with two pairs of other microchromosomes each.

Chromosome-pool-specific genetic material was amplified by GenomePlex Whole Genome Amplification Kit (Sigma) following manufacturer protocols. DNA from all 15 chromosome pools was used to prepare Illumina sequencing libraries, which were independently barcoded and sequenced 125 bp paired-end in a single Illumina Hiseq2500 lane.

Reads obtained from sequencing of flow-sorting-derived chromosome-specific DNA pools were processed with the dopseq pipeline (https://github.com/lca-imcb/dopseq)101,102. Illumina adaptors and whole-genome amplification primers were trimmed off by cutadapt v1.13103. Then, pairs of reads were aligned to the genome assembly of V. komodoensis using bwa mem104. Reads were filtered by mapping quality ≥20 and length ≥20 bp, and aligned reads were merged into positions using pybedtools 0.7.10105,106. Reference genome regions were assigned to specific chromosomes based on distance between positions. Finally, several statistics were calculated for each scaffold. Calculated parameters included: mean pairwise distance between positions on scaffold, mean number of reads per position on scaffold, number of positions on scaffold, position representation ratio (PRR) and P value of PRR. PRR of each scaffold was used to evaluate enrichment of given scaffold on chromosomes. PRR was calculated as ratio of positions on scaffold to positions in genome divided by ratio of scaffold length to genome length. PRRs >1 correspond to enrichment, while PRRs <1 correspond to depletion. As the PRR value is distributed lognormally, we use its logarithmic form for our calculations. To filter out only statistically significant PRR values, we used thresholds of logPRR >0 and its P value ≤0.01. Scaffolds with logPRR >0 were considered enriched in the given sample. If one scaffold was enriched in several samples we chose the highest PRR to assign scaffold as top sample.

We also compared the genome organization at the chromosome level among V. komodoensis, A. carolinensis and G. gallus. We determined homology of each V. komodoensis scaffold to scaffolds of A. carolinensis (AnoCar2.0) and G. gallus (galGal3) genome generating alignment between genomes with LAST107 and subsequently using chaining and netting technique108. Homology to A. carolinensis microchromosomes was determined using scaffold assignments from an Anolis chromosome-specific sequencing project101. For LAST, we used default scoring matrix and parameters of 400 for gap existence cost, 30 for gap extension cost and 4,500 for minimum alignment score. For axtChain we used same distance matrix and default parameters for other chain-net scripts.

RNA sequencing

RNA was extracted from heart tissue obtained from an adult male specimen that died of natural causes at the San Diego Zoo. This was approved by the Institutional Animal Care and Use Committee and Biomaterials Review Group of San Diego Zoo. Trizol reagent was used to extract RNA following the manufacturer’s instructions. Two RNA-seq libraries were produced using a NuGen RNAseq v2 and Ultralow v2 kits from 100 ng total RNA each, and sequenced on an Illumina Nextseq 500 with 150 bp paired-end strand-specific reads.

Genome annotation

RepeatMasker v4.0.7 was used to mask repetitive elements in the Komodo dragon genome using the Squamata repeat database from the RepBase-Dfam combined database as reference109. After masking repetitive elements, protein-coding genes were annotated using the MAKER version 3.01.02110 pipeline, combining protein homology information, assembled transcript evidence, and de novo gene predictions from SNAP and Augustus version 3.3.1111. Protein homology was determined by aligning proteins from 15 reptile species (Supplementary Table 1) to the Komodo dragon genome using exonerate version 2.2.0112. RNA-seq data were aligned to the Komodo dragon genome with STAR version 2.6.0113 and assembled into 900,722 transcripts with Trinity version 2.4.0114. Protein domains were determined using InterProScan version 5.31.70115. Gene annotations from the MAKER pipeline were filtered based on the strength of evidence for each gene, the length of the predicted protein and the presence of protein domains. Clusters of orthologous genes across 15 reptile species were determined with OrthoFinder v2.0.0116. A total of 284,107 proteins were clustered into 16,546 orthologous clusters. In total, 96.4% of Komodo dragon genes were grouped into orthologous clusters. For estimating a species phylogeny only, protein sequences from Mus musculus and G. gallus were added to the orthologous clusters with OrthoFinder. Transfer RNAs were annotated using tRNAscan-SE version 1.3.1117, and other noncoding RNAs were annotated using the Rfam database118 and the Infernal software suite119.

Phylogenetic analysis

A total of 1,394 one-to-one orthologous proteins across 15 reptile species, three birds and four mammals were used to estimate a species phylogeny. The 15 reptile species used were V. komodoensis, S. crocodilurus, O. gracilis, A. carolinensis, P. vitticeps, P. molurus bivittatus, E. macularius, G. japonicus, P. sinensis, C. mydas, C. picta bellii, A. sinensis, A. mississippiensis, G. gangeticus and C. porosus. The three birds used were G. gallus, Meleagris gallopavo and Taeniopygia guttata, and the four mammals were Ornithorhynchus anatinus, M. musculus, Canis familiaris and Homo sapiens. Each set of orthologous proteins were aligned using PRANK v.170427120. Aligned proteins were concatenated into a supermatrix, and a species tree was estimated using IQ-TREE version 1.6.7.1121 with model selection across each partition122 and 10,000 ultrafast bootstrap replicates123.

To identify genes in this dataset of 1,394 orthologues that are evolving in a clock-like manner and thus useful for phylogenomic analyses such as divergence time estimates, we created individual trees for each of the genes in our analysis using the procedure described above but without bootstrapping. We then used SortaDate124 to calculate a number of informative metrics, including root-to-tip variance, tree length and bipartition support relative to the species phylogeny. This dataset can be used to find candidate marker genes for phylogenomic analyses and is available in a Figshare repository at https://doi.org/10.6084/m9.figshare.7967300.

Gene family evolution analysis

Gene family expansion and contraction analyses were performed with CAFE v4.2125 for the squamate reptile lineage, with a constant gene birth and gene death rate assumed across all branches.

V2Rs were first identified in all species by containing the V2R domain InterPro domain (IPR004073)126. To ensure that no V2R genes were missed, all proteins were aligned against a set of representative V2R genes using BLASTp127 with an e-value cutoff of 1 × 10−6 and a bitscore cutoff of 200 or greater. Any genes passing this threshold were added to the set of putative V2R genes. Transmembrane domains were identified in each putative V2R gene with TMHMM v2.0128 and discarded if they did not contain seven transmembrane domains in the C-terminal region. Beginning at the start of the first transmembrane domain, proteins were aligned with MAFFT v7.310 (auto-alignment strategy)129 and trimmed with trimAL (gappyout)130. A gene tree was constructed using IQ-TREE121,122,123 with the JTT+ model of evolution with empirical base frequencies and 10 FreeRate model parameters, and 10,000 bootstrap replicates. Genes were discarded if they failed the IQ-TREE composition test.

Positive selection analysis

We analysed 4,047 genes that were universal and one-to-one across all squamate lineages tested (V. komodoensis, S. crocodilurus, O. gracilis, A. carolinensis, P. vitticeps, P. molurus bivittatus, E. macularius and G. japonicus) to test for positive selection (Supplementary Table 9). An additional 2,013 genes that were universal and one-to-one across a subset of squamate species (V. komodoensis, A. carolinensis, P. molurus bivittatus and G. japonicus) were also analysed (Supplementary Table 9). We excluded multicopy genes from all positive selection analyses to avoid confounding from incorrect paralogy inference. Proteins were aligned using PRANK120 and codon alignments were generated using PAL2NAL131.

Positive selection analyses were performed with the branch-site model aBSREL using the HYPHY framework132,133. For the 4,081 genes that were single-copy across all squamate lineages, the full species phylogeny of squamates was used. For the 2,040 genes that were universal and single-copy across a subset of species, a pruned tree containing only those taxa was used. We discarded genes with unreasonably high dN/dS (defined as the ratio of the rate of nonsynonymous to synonymous substitutions) values across a small proportion of sites, as those were false positives driven by low-quality gene annotation in one or more taxa in the alignment. We used a cutoff of dN/dS of less than 50 across 5% or more of sites, and a P value of less than 0.05 at the Komodo dragon node. Each gene was first tested for positive selection only on the Komodo dragon branch. Genes undergoing positive selection in the Komodo dragon lineage were then tested for positive selection at all nodes in the phylogeny. This resulted in 201 genes being under positive selection in the Komodo dragon lineage (Supplementary Table 10).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.