Significance Snake venoms are toxic protein cocktails used for prey capture. To investigate the evolution of these complex biological weapon systems, we sequenced the genome of a venomous snake, the king cobra, and assessed the composition of venom gland expressed genes, small RNAs, and secreted venom proteins. We show that regulatory components of the venom secretory system may have evolved from a pancreatic origin and that venom toxin genes were co-opted by distinct genomic mechanisms. After co-option, toxin genes important for prey capture have massively expanded by gene duplication and evolved under positive selection, resulting in protein neofunctionalization. This diverse and dramatic venom-related genomic response seemingly occurs in response to a coevolutionary arms race between venomous snakes and their prey.

Abstract Snakes are limbless predators, and many species use venom to help overpower relatively large, agile prey. Snake venoms are complex protein mixtures encoded by several multilocus gene families that function synergistically to cause incapacitation. To examine venom evolution, we sequenced and interrogated the genome of a venomous snake, the king cobra (Ophiophagus hannah), and compared it, together with our unique transcriptome, microRNA, and proteome datasets from this species, with data from other vertebrates. In contrast to the platypus, the only other venomous vertebrate with a sequenced genome, we find that snake toxin genes evolve through several distinct co-option mechanisms and exhibit surprisingly variable levels of gene duplication and directional selection that correlate with their functional importance in prey capture. The enigmatic accessory venom gland shows a very different pattern of toxin gene expression from the main venom gland and seems to have recruited toxin-like lectin genes repeatedly for new nontoxic functions. In addition, tissue-specific microRNA analyses suggested the co-option of core genetic regulatory components of the venom secretory system from a pancreatic origin. Although the king cobra is limbless, we recovered coding sequences for all Hox genes involved in amniote limb development, with the exception of Hoxd12. Our results provide a unique view of the origin and evolution of snake venom and reveal multiple genome-level adaptive responses to natural selection in this complex biological weapon system. More generally, they provide insight into mechanisms of protein evolution under strong selection.

Snake venom contains biologically active proteins (toxins) encoded by several multilocus gene families that each comprise several distinct isoforms (1, 2). Venom is produced in a postorbital venom gland (3) and associated in elapids (cobras and their relatives) and viperids (vipers and pit vipers) with a small downstream accessory gland of unknown function (Fig. 1). Understanding the origin and evolution of the snake venom system is not only of great intrinsic biological interest (3⇓–5), but is also important for drug discovery (1, 2, 6), understanding vertebrate physiological pathways (7, 8), and addressing public health concerns about the enormous number of snake bites suffered in tropical countries (9, 10).

Fig. 1. The king cobra venom system with venom and accessory gland expression profiles. Pie charts display the normalized percentage abundance of toxin transcripts recovered from each tissue transcriptome. Three-finger toxins are the most abundant toxin family in the venom gland (66.73% of all toxin transcripts and 4.37% in the accessory gland), and they are represented in the genome by at least 21 loci. Lectins are the most abundant toxin family in the accessory gland (42.70% of all toxin transcripts and 0.03% in the venom gland), and they are represented in the genome by at least six loci. Asterisks indicate toxin gene families annotated in the genome. 3FTx, three-finger toxin; AchE, acetylcholinesterase; CRISP, cysteine-rich secretory protein; CVF, cobra venom factor; IGF-like, insulin-like growth factor; kallikrein, kallikrein serine proteases; kunitz, kunitz-type protease inhibitors; LAAO, l-amino acid oxidase; NGF, nerve growth factor; PDE, phosphodiesterase; PLA2, phospholipase A 2 ; PLB, phospholipase-B; SVMP, snake venom metalloproteinase. Drawing made based on a photo by F.J.V.

The birth and death model of gene evolution is the canonical framework used to explain the evolutionary origin of snake venom toxins. Drivers of toxin diversification may include (i) directional selection for toxins that facilitate prey capture, (ii) the need to target a diversity of receptors in different prey, and (iii) the concomitant evolution of venom resistance in some prey as part of an evolutionary arms race (2). The lack of genome sequences for any venomous snake and the consequent dependence on transcriptome data have hampered our understanding of not only the tempo and mode of venom toxin evolution but also, the genomic mechanisms that regulate toxin–gene expression.

To address these issues, we have produced a draft genome of a venomous snake—that of an adult male Indonesian king cobra (Ophiophagus hannah). This iconic species is the longest venomous snake in the world. Native to tropical Asia, it feeds on other snakes, and it is a member of the family Elapidae. We also deep-sequenced transcriptomes and small RNAs of the venom gland, the accessory gland, and a pooled, multitissue archive and characterized the king cobra venom proteome. These unique datasets provide an unprecedented insight into the evolution of venom.

Methods SI Appendix, SI Materials and Methods has additional information relating to the methodologies described below. Tissue Acquisition and Processing. All animal procedures complied with local ethical guidelines. Genome sequencing was undertaken on a blood sample obtained from an adult male king cobra that originated in Bali, Indonesia. Venom was extracted, and 4 d later (to maximize mRNA production), the venom gland, accessory gland, and other tissue samples were sourced from a second Indonesian adult male specimen and stored in RNAlater. Genome Sequencing. We used a whole-genome shotgun sequencing strategy and Illumina sequencing technology. Genomic DNA was isolated from blood using the Qiagen Blood and Tissue DNeasyKit and paired-end libraries prepared from 5 µg isolated gDNA using the Illumina Paired-End Sequencing Sample Prep Kit. Either a 200- or 500-bp band was cut from the gel (library PE200 or PE500, respectively) (SI Appendix, Table S1). Similarly, mate pair libraries were prepared from 10 µg isolated gDNA using the Illumina Mate Pair 2–5 Kb Sample Prep Kit and bands from 2 to 15 Kbp cut from the gel (MP2K, MP7K, MP10K, and MP15K libraries) (SI Appendix, Table S1). After circularization, shearing, isolation of biotinylated fragments, and amplification, the 400- to 600-bp fraction of the resulting fragments was isolated from the gel. Genomic libraries were paired-end sequenced with a read length of 36–151 nt on an Illumina GAIIx instrument. Genome Assembly. For genome assembly, we largely followed the strategy pioneered in the work by Li et al. (38) for the assembly of the giant panda genome. Sequencing reads from both paired-end libraries were first used for building initial contigs. Both sets were preprocessed to eliminate low-quality reads and nucleotides as well as adapter contamination. For initial contig assembly, we used the CLC Assembly Cell De Novo Assembler (version 3.2; CLC Bio, Aarhus, Denmark), which implements a De Bruijn graph-based assembler. A run with a minimum-required contig size of 100 bp and a k-mer length of 31 nt resulted in an assembly with a total length of 1.45 Gbp and a contig N50 of 3,982 bp [i.e., 50% of the assembly (725 Mbp) is in contigs of at least this length]. Initial contigs were subsequently oriented into larger supercontigs (scaffolds) using SSPACE (39). SSPACE aligns paired reads to the contigs using Bowtie (40). SSPACE was used to scaffold contigs in a hierarchical fashion using first links obtained from the PE500 library to generate intermediate supercontigs, which were then used as the input for subsequent runs, with links from individual mate-pair libraries increasing in size. At each stage, a minimum of three nonredundant links was required to join two contigs. This procedure resulted in a final scaffold set with a total length of 1.66 Gbp and an N50 of 225,511 bp. Genome Annotation. Automated gene prediction was undertaken using the automated annotation pipeline MAKER (41, 42). Gene annotations were made using a protein database combining the Uniprot/Swiss-Prot protein database and all king cobra and green anole (A. carolinensis) sequences from the National Center for Biotechnology Information protein database. Ab initio gene predictions were created by MAKER using the programs SNAP (43) and Augustus (44). Gene models were further improved by providing MAKER with all king cobra mRNAseq data generated in this study, which were combined to generate a joint assembly of transcripts using Trinity (45). A total of three iterative runs of MAKER was used to produce the final gene set. Additional extensive manual annotation was performed to establish the intron–exon boundaries of members of venom toxin gene families. mRNA-Seq and Small RNA Libraries. King cobra tissue sequencing libraries were prepared for the venom gland, accessory gland, and a pooled multitissue archive (heart, lung, spleen, brain, testes, gall bladder, pancreas, small intestine, kidney, liver, eye, tongue, and stomach). Total RNA was isolated from each tissue using the Qiagen miRNeasy Kit. Transcriptome libraries were subsequently prepared from 10 µg total RNA (using equal amounts of RNA isolated from each tissue for the pooled multitissue archive) using the Illumina mRNA-Seq Sample Preparation Kit. Total RNA from the same samples was used to prepare the small RNA libraries using the Illumina small RNA v1.5 Sample Preparation Kit. RNAseq and small RNA libraries were sequenced on the Illumina GAIIx sequencing platform. Transcriptome Assembly. Reads for the venom gland, accessory gland, and pooled multitissue archive were coassembled with Abyss (46, 47) with various k values (every even number from 50 to 96). The resulting assemblies were joined by an iterative BLAST and cap3 assembler (48). Coding sequences were extracted using an automated pipeline based on similarities to known proteins or by obtaining coding sequences from the larger ORF of the contigs containing a signal peptide. To map the raw Illumina reads to the coding sequences and determine their tissue bias, raw reads from each library were blasted to the coding sequences using blastn with a word size of 25 (−W 25 switch) and allowing recovery of up to three matches. The three matches were used if they had less than two gaps and their scores were equal to the best score. The resulting blast file was used to compile the number of reads each coding DNA sequence received from each library. miRNA Profiles and in Situ Hybridizations. The small RNA sequences were analyzed using CLC Bio Genome Workbench. Briefly, small RNA sequences were filtered for quality and size, and reads of low quality and lengths less than 17 or greater than 26 nt were discarded. The remaining pool of small RNAs was compared with miRBase release 18 (http://www.miRBase.org) to extract orthologous mature miRNA sequences from each king cobra RNA sample. These miRNAs were subsequently mapped to the king cobra genome, with 70 bp upstream and downstream of the mature sequence extracted as the potential precursor miRNA sequence using PHP scripts and blast (49) 2.2.26+. The expression level of each miRNA was assessed using CLC Bio and compared with data available at the miRNA targets and expression database (http://www.microRNA.org; release August 2010) for the expression profiles of orthologous miRNA genes in mouse and human (e.g., miR-375). Whole-mount in situ hybridizations for miR-375 detection were performed using 5′ digoxigenin-labeled locked nucleic acid (LNA; Exiqon) probes following the protocol in the work by Darnell et al. (19). The standard tissue section in situ protocol in the work by Jostarndt et al. (50) for paraffin-embedded tissues was followed for miR-375 detection in the adult king cobra venom gland. For whole-mount in situ hybridizations in late-stage snake embryos (27 d postoviposition or older), embryos were skinned, and the abdominal wall was cut open followed by an extended probe hybridization for ∼36 h. All miR-375 LNA in situ hybridizations were carried out at 57 °C (22°C below the calculated probe melting temperature of 79°C) along with a no-probe control. miR-196 LNA in situ was carried out at 47 °C as an additional negative control in the adult venom gland. Venom Proteomics. We used king cobra venom extracted from the same animal used for transcriptomics. The venom was reduced, alkylated, digested with trypsin, separated by column chromatography, and analyzed by ESI-ion trap tandem MS. The peptide fragments created by collision-induced dissociation were compared against the assembled king cobra venom gland and accessory gland transcriptomes and a Lepidosaurian (National Center for Biotechnology Information) database using Sequest and Mascot software with a false discovery rate of 0.01. Evolutionary Analyses. King cobra sequences exhibiting homology to toxin families were identified through (i) annotation in the genome or transcriptome and (ii) blast searching the king cobra genome and transcriptome datasets in CLC Main Workbench with representative templates of toxin and nontoxin gene homologs. Coding regions of identified toxin gene loci were aligned using the MUSCLE algorithm (51) with putative paralogs and orthologs from selected vertebrates, including other venomous snakes and the P. molurus bivittatus and A. carolinensis genomes and transcriptomes (12, 16⇓–18). These sequences were obtained by mining GenBank for blast hits and using the datasets in work by Casewell et al. (15). DNA gene trees for each toxin family were reconstructed using Bayesian inference in MrBayes v3.2 (52) incorporating optimized models of sequence evolution selected by MrModelTest v2.3 (53). Each dataset was run in duplicate using four chains for 5 × 106 generations, sampling every 500th cycle from the chain, and using default settings in regards to priors. Tracer v1.4 (54) was used to estimate effective sample sizes for all parameters and verify the point of convergence (burnin), with trees generated before the completion of burnin discarded. The locations of gene expression of snake sequences determined by transcriptomics were mapped on the gene trees to visualize relative expression in different tissue types. Toxin family gene duplication events were inferred by pruning the gene trees to only contain king cobra and Burmese python genes along with a single outgroup sequence. The ensuing gene trees were analyzed using the duplication and loss criterion in iGTP (55) with the following species tree: [outgroup (king cobra, Burmese python)]. For tests of directional selection, we inferred fully resolved maximum likelihood trees from each of the toxin family datasets using the BEST tree-searching algorithm in PHYML (56). The most parsimonious points of recruitment into venom-producing pathways were then reconstructed on these trees, thereby classifying tree branches into venomous and nonvenomous. The method of Yang and Nielsen (57) was implemented in the PAML software package to estimate ω venomous and ω nonvenomous for each toxin family.

Acknowledgments We thank the following persons who helped us or contributed material used in this study: Austin Hughes, Nathan Dunstan, Daniëlle de Wijze, and Youri Lammers. We thank Bas Blankevoort for constructing Fig. 1. This work received funding from the following sources: internal funding from the Naturalis Biodiversity Center (F.J.V., J.W.A., and M.K.R.), a Rubicon Grant from the Netherlands Organization for Scientific Research (to F.J.V.), a research fellowship from the United Kingdom Natural Environment Research Council (to N.R.C.), an Netherlands Organization for Scientific Research Visitor’s Travel Grant from Nederlandse Organisatie voor Wetenschappelijk Onderzoek (to R.J.R.M., R.M.K., and M.K.R.), a studentship from the United Kingdom Biotechnology and Biological Sciences Research Council (to R.B.C.), and a Smart Mix Grant from the Dutch Government (to M.K.R.).

Footnotes Author contributions: F.J.V., N.R.C., and M.K.R. designed research; F.J.V. acquired samples for sequencing and estimated genome size; H.J.J., and R.P.D. prepared sequencing libraries; M.B., and W.P. developed assembly software; C.V.H. assembled the genome; H.J.J., M.Y., D.C., and H.P.S. annotated the genome; J.M.C.R. assembled and annotated RNA-seq libraries; F.J.V., N.R.C., H.M.E.K., and A.S.H. analyzed RNA-seq libraries; A.M.H., D.S., and E.M. annotated and analyzed small RNA libraries; H.M.E.K., I.G., H.P.S., and D.D. annotated and analyzed Hox genes; F.J.V., N.R.C., C.V.H., R.J.R.M., H.M.E.K., A.S.H., R.P.D., R.M.K., and M.K.R. annotated venom toxin genes and performed synteny analyses; N.R.C., R.A.V., and W.W. analyzed gene family evolution; A.M.H. performed miRNA in situ hybridization; A.E.W., and J.M.L. performed lectin in situ hybridization; N.R.C., R.J.R.M., J.J.C., R.A.H., C.R., R.B.C., D.P., L.S., and R.M.K. analyzed the venom proteome; T.A.C., A.P.J.d.K., and D.D.P. contributed Burmese python genome data and assisted with comparative analyses; H.J.J., J.W.A., G.E.E.J.M.v.d.T., R.P.D., H.P.S., and M.K.R. organized sequencing platforms and facilities; F.J.V., N.R.C., W.W., and M.K.R. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The king cobra genome assembly and reads reported in this paper have been deposited in the GenBank database (bioproject no. PRJNA201683). The transcriptome sequences reported in this paper have been deposited in the GenBank Short Read Archive database (bioproject no. PRJNA222479). The microRNA sequences reported in this paper have been deposited in miRBase, www.mirbase.org.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1314702110/-/DCSupplemental.