Near-chromosomal de novo genome assembly

Using flow cytometry, we estimated the size of the Indian cobra haploid genome at 1.48–1.77 Gb (Extended Data Fig. 1). Cytogenetic analysis revealed a diploid karyotype of 2n = 38, comprising seven pairs of macrochromosomes (MACs), one pair of sex chromosomes (ZZ male or ZW female) and 11 pairs of microchromosomes (MICs), consistent with a previous report27 (Extended Data Fig. 2).

DNA from an adult male Indian cobra (NN01) was used to generate long-read (PacBio and Oxford Nanopore Technologies (ONT)), short-read (Illumina), Chicago28, Hi-C29 and optical mapping (Bionano Genomics (BNG)) data (Supplementary Table 1a,b). Additionally, we generated linked-read 10x Genomics, BNG and short-read Illumina sequence data for a female animal (NN05; Supplementary Note and Supplementary Table 1a,b). Our sequential assembly approach (Fig. 1 and Supplementary Note) resulted in a 1.79-Gb Nana_v5 genome with a scaffold N50 of 223.35 Mb and a BUSCO30 genome completeness score of 94.3% (Table 1 and Supplementary Table 2a–c).

Fig. 1: Schematic of N. naja genome sequencing and assembly. a,b, Long-read (PacBio and ONT) and short-read (lllumina) data (a) were used to build contigs that were then combined with Chicago28 chromatin interaction data (b) to generate scaffolds. c, Scaffolds from BNG optical mapping de novo assembly were combined with those from the previous step to generate super-scaffolds. d, Hi-C sequencing data were used to refine the assembly. e, Electronic fluorescence in situ hybridization (eFISH) was performed using cDNA FISH marker sequences from the Japanese rat snake, E. quadrivirgata. f, SChrom-seq data were used to assign scaffolds to chromosomes. Full size image

Table 1 N. naja assembly summary statistics Full size table

To assign scaffolds from Nana_v5 to chromosomes, we used complementary DNA (cDNA) chromosome marker sequences from a colubrid, the Japanese rat snake Elaphe quadrivirgata31 (Fig. 1e), synteny information from Anolis carolinensis (the green anole lizard genome32) and single-chromosome sequencing data (SChrom-seq; Fig. 1f, Supplementary Table 3a–c and Supplementary Note). We also generated a hybrid 10x BNG genome assembly to identify a 52.1-Mb female-specific W chromosome-linked scaffold (Supplementary Table 3d and Supplementary Note).

Comparison of the N. naja de novo genome assembly to the human33 and goat34 genomes showed that the scaffold N50 of the N. naja genome was 2.5× (87.27 versus 223.35 Mb) and 3.3× (67.79 versus 223.35 Mb) greater than the goat and human genomes previously assembled using the involved, expensive, time-consuming physical and cytogenetic maps developed over a decade or more (Table 2). Compared with the king cobra genome19, also an elapid, the Indian cobra genome contained far fewer scaffolds (296,399 versus 1,897, respectively), and had 929-fold better contiguity (scaffold N50 of 0.24 versus 223.35 Mb, respectively). Also, compared to the 7,034 scaffolds and 179.89-Mb scaffold N50 reported recently for the prairie rattlesnake genome (Crotalus viridis)35, the Indian cobra genome had a higher scaffold N50 and fewer scaffolds (Table 2).

Table 2 Comparison of Nana_v5 assembly to other high-quality genomes Full size table

Genome features

The average DNA base (GC) content of the N. naja genome was 40.46%. While MACs, representing 88% of the genome, had a GC content of 39.83%, that of MICs was 43.50% despite their containing only 12% of the genome (Welch’s two-sample t-test, two-sided P < 0.0001; Fig. 2a). Analysis of the repeat content in the Indian cobra genome in relation to other squamate reptile genomes revealed that 43.22% of the genome was repetitive (~760 Mb; Fig. 2a, Supplementary Table 4a,b, Extended Data Fig. 3 and Supplementary Note).

Fig. 2: Genome architecture of N. naja genome. a, Circos plot of the reference genome assembly (NN01, a male adult (n = 1)) representing N. naja chromosomes (outermost track) from the Nana_v5 assembly, repeat content, gene density and GC content (%). Regions of the genome with GC content higher than average (40.46%) are shown in blue. Regions within the gene density of more than 10 genes are shown as red spikes, while those with 5 to 10 genes are indicated by yellow spikes. Green spikes represent regions with fewer than five genes. The average repeat content is indicated by the red line. All data were plotted in 100-kb windows. The female-specific W-linked scaffold obtained using NN05 DNA is shown on the right. b, Chromosome painting depicting synteny between Indian cobra and rattlesnake genomes. c, Dot-plots showing synteny of the Indian cobra genome with the prairie rattlesnake, chicken or green anole lizard genomes. d, Bar plot of the number of predicted genes and corresponding transcripts observed in Nana_v5. Dashed and solid lines denote average number of genes and transcripts detected in each chromosome along 100-Mb windows, respectively. MICs were combined into one group. Unp, unplaced scaffolds (n = 1,878) containing predicted genes. Full size image

Whole-genome synteny comparison between the Indian cobra and prairie rattlesnake genomes revealed large syntenic blocks between the macro-, micro- and Z-chromosomes (Fig. 2b,c). We observed several fusion/fission events consistent with the difference in chromosome number between these two genomes. In particular, chromosome 4 of the Indian cobra shared syntenic regions with rattlesnake chromosomes 3 and 5, indicating a possible fusion event. In contrast, Indian cobra chromosomes 5 and 6 were syntenic to rattlesnake chromosome 5, indicating a possible fission event (Fig. 2b,c). Comparison of the Indian cobra genome to that of the more distantly related green anole lizard also showed regions of synteny and chromosomal rearrangements (Fig. 2c). Lizard chromosome 2 contained syntenic regions corresponding to chromosomes 4, 5 and 6 in the Indian cobra genome (Fig. 2c). Our synteny analysis also showed that the lizard chromosome 6 is homologous to the Z-chromosome of the Indian cobra (Fig. 2c and Supplementary Table 3b)36. Despite an estimated divergence time of ~280 Ma between snake and chicken (Gallus gallus), we observed synteny between several macro- and microchromosomes. Several regions of chicken chromosomes 1 and 2 showed syntenic blocks across Indian cobra macro-, micro- and Z-chromosomes (Fig. 2c). This indicated large-scale changes in macro- and microchromosome organization between squamate and avian genomes during evolution31,37.

Gene prediction and annotation

We used the MAKER pipeline38 to annotate the genome using protein homology information, in combination with gene expression data from 14 different tissues (n = 26 samples; Figs. 2d and 3, Supplementary Table 1a and Supplementary Note). Overall, we predicted 23,248 protein-coding genes and 31,447 transcripts that included alternatively spliced products encoding 31,036 predicted proteins (Fig. 2d). Of the 23,248 genes, we found 22,116 (95%) on the 19 largest scaffolds corresponding to the numbered chromosomes. A total of 26,216 of the 31,036 predicted proteins (84.4%) contained a canonical start and stop codon. We identified 3,265 of 26,216 proteins (~12.5%) with an N-terminal secretion signal sequence, a feature important for venom gland toxin secretion (Supplementary Table 5). We performed extensive functional annotation of the 31,036 predicted proteins and found that 17,019 (54.8%) had a corresponding ortholog in either the Human Gene Nomenclature Committee database, NCBI’s non-redundant database or the TrEMBL (https://www.ebi.ac.uk/uniprot) database (Supplementary Tables 5 and 7b and Supplementary Note). Comparison of our annotated proteome to that of the king cobra19, prairie rattlesnake35 and green anole lizard genomes32 identified 26,323, 25,505 and 11,820 orthologs in those genomes, respectively.

Fig. 3: N. naja expression body map. Heatmap showing log 2 (CPM) values of differentially upregulated genes (DUGs); FDR < 1% across 14 tissues (sample size n = 6) as indicated. NN01 and NN02 correspond to N. naja specimens obtained from Kerala, India. NN03, NN04, NN05 and NN06 correspond to N. naja specimens obtained from the Kentucky Reptile Zoo. Sg, salivary gland. Full size image

We comprehensively annotated venom-gland-relevant genes by combining the predicted gene models with long-read sequencing data (Supplementary Table 6a and Supplementary Note), toxin gene hidden Markov models (HMM) and manual curation to identify 139 toxin genes from 33 gene families39 that included 19 three-finger toxins (3FTxs), 8 snake venom metalloproteinases (SVMPs) and 6 cysteine (Cys)-rich secretory venom proteins (CRISPs) (Supplementary Table 6b). Near-chromosomal assembly also allowed us to assess the genomic organization of gene families that encoded enzymatic and non-enzymatic toxin proteins involved in venom gland function. In the N. naja genome, 16 major toxin gene families were organized on MACs (Fig. 4a and Supplementary Table 6c). This is in contrast to the genomes of viperids C. viridis (prairie rattlesnake) and Protobothrops flavoviridis (Amami habu), where a majority of the venom gland genes were found on MICs23,35.

Fig. 4: The N. naja venom gene repertoire. a, Genomic organization of N. naja toxin gene families. b–d, Arrayed venom gene organization of three major toxin gene families: 3FTx (b), SVMP (c) and CRISP (d). Genes that show venom-gland-specific expression are colored orange, and those with expression not restricted to venom glands are shown in magenta. Pseudogenes with no evidence for expression are shown in gray. e,f, Comparison showing the ancestral 3FTx (e) and CRISP (f) genes in lizard, and duplicated copies in the Indian cobra and prairie rattlesnake genomes. Orthologous gene pairs are indicated by shaded regions across the corresponding genomic regions. g, Schematic of filtering used to identify the 19 VSTs, and a heatmap showing the corresponding log 2 (CPM) values. NN01 and NN02 correspond to N. naja specimens obtained from Kerala, India. NN03, NN04, NN05 and NN06 correspond to N. naja specimens obtained from the Kentucky Reptile Zoo. FC, fold change; Chr, chomosome. Anatomical abbreviations as in Fig. 3. Full size image

Of the 19 full-length 3FTx genes identified in the genome, 14 were clustered within a 6.3-Mb region on chromosome 3 (Fig. 4b). One 3FTx gene (Nana001KS) was located on chromosome 4 and the remaining four were found on an unassigned scaffold, ScVE01q_1072 (Supplementary Table 6c). Additionally, we identified 10 3FTx pseudogenes that lacked parts of the coding region and were not expressed. The second largest toxin gene family encoded by the N. naja genome consisted of eight SVMPs that were clustered on a MIC 1 (Fig. 4c). A cluster of six CRISPs were found on chromosome 1 (Fig. 4d). Other toxin genes, including natriuretic peptide, C-type lectin, snake venom serine proteinase (SVSP), Kunitz and venom complement-activating gene families, were found to be distributed across the 19 chromosomes while two group I Phospholipase A2 (PLA2) genes and one cobra venom factor (CVF) gene were located on an unassigned scaffold (ScVE01q_344; Supplementary Table 6c). Comparisons of venom gland genes between the C. viridis genome and that of the Indian cobra identified 15 toxin gene families that were unique to the Indian cobra. This included phospholipase B-like toxins and cathlecidins40,41 (Supplementary Table 6d). Assessment of the 139 Indian cobra venom gland toxin genes for orthologs in the king cobra showed that, while 96 genes had a match, 43 did not (Supplementary Table 6e). Although some of the 43 toxins are likely to be unique to the Indian cobra, a majority were not annotated in the king cobra genome probably due to its highly fragmented assembly (Table 2).

Synteny comparisons of the major toxin gene families (3FTx, SVMP and CRISP) in genomes of the Indian cobra, prairie rattlesnake and green anole lizard revealed multiple duplication events in each family involving a paralog of non-venomous origin, leading to co-option/recruitment and expression of the duplicated gene in the venom gland (Fig. 4e,f and Extended Data Fig. 4)19,35,42.

Minimal core venom-ome toxin genes

Analysis of multi-tissue transcriptome data from 26 samples representing 14 different tissues (Supplementary Table 1a) identified 19,426 expressed genes (counts per million (CPM) >1; Fig. 3 and Supplementary Table 7a,b), of which 6,601 common core genes were expressed across all tissues (Supplementary Note).

The venom gland transcriptome (venom-ome) comprised 12,346 expressed genes that included 139 genes from 33 different toxin gene families. Furthermore, differential expression analysis revealed a set of 109 genes from 15 different toxin gene families that were significantly upregulated in the venom gland (fold change >2 and 1% false discovery rate (FDR); Extended Data Fig. 5 and Supplementary Table 7c), and this included 19 toxin genes that were expressed exclusively in the venom gland (Fig. 4g and Supplementary Note). These 19 VST genes are likely to encode the core venom effector toxin proteins, consisting of six neurotoxins, one cytotoxin, one cardiotoxin, one muscarinic toxin, six SVMPs, nerve growth factor (NGF-β), two venom Kunitz serine proteases and a CRISP (Supplementary Table 6b, column L). Additionally, we confirmed the presence of 16 of the 19 VSTs at the protein level using mass spectrometry (Supplementary Table 8).

Functionally diverse 3FTxs

Three-finger toxins are short polypeptides (60–90 amino acids) that belong to a superfamily of non-enzymatic proteins found primarily in elapid snakes43. These small proteins are known to primarily target neuronal receptors including nicotinic acetylcholine receptors (nAChRs), muscarinic acetylcholine receptors, calcium channels and other proteins43. Structurally, they fold into an outstretched, three-finger-like structure where each finger contains a β-hairpin loop that extends from a disulfide bond-stabilized hydrophobic core44,45. 3FTxs typically contain four conserved disulfide bridges, with some containing a fifth disulfide bridge46. Functionally, 3FTxs are broadly classified as neurotoxins, cytotoxins, cardiotoxins and anticoagulants.

While expression of all 19 annotated N. naja 3FTxs was detected in the venom gland, 9 were specific to the venom gland. Of these 19 3FTxs, 14 were classified as conventional 3FTxs with 8 conserved Cys residues, while 4 3FTxs contained 10 Cys residues. Homology-based assessment enabled classification of the 3FTxs into seven neurotoxins, six cytotoxins, four cardiotoxins, one muscarinic toxin and one anticoagulant (Supplementary Table 9). The neurotoxins included two type I short-chain neurotoxins (Nana002KS and Nana005KS), known to interact with muscle nAChR, one type II long-chain neurotoxin (Nana012KS), known to target muscle and neuronal nAChRs, and three putative type III weak neurotoxins containing 10 conserved cysteines (Nana001KS, Nana003KS and Nana004KS), a characteristic feature of both non-conventional and prey-specific 3FTxs47,48. Nana018KS was structurally similar to haditoxin from the king cobra and known to block α7-nAChRs49. Further assessment by structural modeling classified the Indian cobra 3FTxs into four groups (Fig. 5, Extended Data Fig. 6 and Supplementary Table 10). The aromatic residue (Tyr25 or Phe27), crucial for proper folding50 and stability of the antiparallel β-sheet structure, was found to be conserved in all 19 3FTxs (Fig. 5a). Three of the four 10-Cys-residue-containing 3FTxs were non-conventional 3FTxs containing a pair of additional cysteines in loop I, resulting in a longer N terminus loop that could potentially also play a role in stabilizing this loop and contribute to toxin function47 (Fig. 5a,b). The other 10-Cys 3FTx was a long-chain neurotoxin that contained a pair of additional cysteines in loop II (Nana012KS). The charged amino acid residue Arg39, which typically stabilizes native toxin conformation by forming a salt bridge with the C terminus51, was conserved in all identified 3FTxs except in Nana012KS, where it was replaced by a leucine residue (Fig. 5a). The homology model of this protein indicated that Leu39 may exhibit van der Waals interactions and form part of a hydrophobic core comprising Ile35, Phe4, Thr22 and Arg68 (Fig. 5c). Two of the cytotoxins, Nana008KS and Nana010KS, had shorter loops I and III compared to the other 3FTxs (Fig. 5d).

Fig. 5: Characterization of N. naja 3FTx gene family. a, Multiple sequence alignment of the 19 3FTx proteins identified in the Indian cobra genome. Protein names in orange in the alignment indicate VSTs identified using RNA-seq. Conserved Cys residues are highlighted yellow in the alignment. b–e, Ribbon representations of representative 3FTxs from four different structural classes. Disulfide bonds are shown as sticks. The hydrophobic packing of Leu39 and surrounding residues is shown in c. Dashed circles highlight the additional disulfide bonds in Nana001KS, Nana012KS and the unpaired Cys in Nana005KS. f, Superimposition of ribbon models of Nana001KS, Nana003KS and Nana010KS highlighting the differences in loop length and conformation between the distinct classes of 3FTx found in the Indian cobra genome. g, Analysis of evolutionary rates on 3FTx venom genes and their non-venom paralogs. K A and K S values were calculated according to the Nei–Gojobori method. K A and K S with values <1 were not included in further analysis. SNTX, short neurotoxin; LNTX, long neurotoxin; MTLP, muscarinic toxin-like; CTX, cardiotoxin or cytotoxin; Nonc, non-conventional toxin; WTX, weak neurotoxin. SV, snake venom; NV, non-venom. Full size image

Of note, Nana005KS contained nine Cys residues and such unusual 3FTxs have been found in only two other elapids, Micrurus lemniscatus and Micrurus altirostris52. This free Cys residue at position 16 that precedes the conserved second Cys in loop I (Fig. 5b) probably facilitates the formation of covalent homo- or heterodimeric 3FTxs (Supplementary Table 10). Nana005KS was closely related to the short-chain 3FTx Nana002KS and contained a majority of the residues required for activity against muscle nAChRs. For instance, the presence of the positively charged residues Lys25, Lys26 and Arg32 indicates that these toxins might be crucial for envenomation in mammals. In particular, the guanidyl group of Arg32 mimics acetylcholine forming a cation–π interaction with α–α and α–δ interfaces in the nAChRs53.

Acquired mutations affecting the neurotoxin binding site of nAChR have rendered certain species, such as the Egyptian mongoose (Herpestes ichneumon), immune to snake venom54,55,56. Comparison of N. naja nAChR sequence and H. ichneumon nAChR and other representative mammalian species showed that the N. naja nAChR carries a key p.Phe189Asn alteration in the α-neurotoxin site, known to result in decreased sensitivity to short- and long-chain neurotoxins55,56 (Extended Data Fig. 7).

In the final group of 3FTxs, Nana013KS was structurally similar to AncTx-1 (ref. 57), a synthetic ancestral toxin known to interact with β-adrenergic G-protein-coupled receptors. The 3FTx encoded by Nana006KS was found to be a close homolog of ringhalexin, an inhibitor of the extrinsic tenase coagulation complex, from Hemachatus haemachatus, the African ringhals cobra58 (Fig. 5e and Supplementary Table 9). Additionally, we found Nana017KS to be 97% similar to a recently reported Naja atra (μ-EPTX-Na1a)59 Nav1.8 voltage-gated sodium channel inhibitor.

To understand the consequences of genetic variation in the 3FTx family, we assessed the rate of evolution of the major toxin families by computing the numbers of synonymous (K S ) and non-synonymous (K A ) nucleotide substitutions per site for each pairing of toxin gene and its non-toxin paralog. The K A /K S substitution ratio for the 3FTx toxin genes was 2.034 (±0.818), while that of the non-toxin paralogs (Ly-6/UPAR domain-containing genes) was 0.894 (±0.103; Fig. 5g). The observed high K A /K S ratio (>1) indicated diversifying selection leading to rapid divergence and functional diversification of the venom-gland-specific 3FTx genes.

Indian cobra SVMP, CRISPs, PLA2, CVF and growth factor genes

We detected six venom-gland-specific SVMPs that belonged to the P-III class of metalloproteinases (with metalloproteinase/disintegrin/Cys-rich domains) known to be involved in the induction of hemorrhage, inflammation, apoptosis, prothrombin activation and inhibition of platelet aggregation60. The Indian cobra SVMPs were found to evolve less rapidly than 3FTx genes, because the K A /K S ratio for the venom-gland-specific SVMPs was 1.070 (±0.137) when compared to 0.998 (±0.049) observed for metalloprotease domain-containing paralogs (Extended Data Fig. 8). The Indian cobra SVMPs formed a separate cluster compared with the viperid SVMPs (Extended Data Fig. 9). In agreement with this, a seventh Cys residue in domain M12, involved in the disulfide bond exchange for autolysis during secretion or formation of the biologically active disintegrin/Cys-rich domain typical of Viperidae SVMP-PIIIs60, was absent (Extended Data Fig. 9).

We also detected six CRISPs in the venom gland transcriptome that were highly conserved across different snake species (Supplementary Fig. 1). Venom CRISPs have a wide variety of biological effects, including blockade of K+ and/or Ca2+ currents in neurons and blockage of vascular smooth muscle contraction61. Two of the five CRISPs were homologous to venom CRISPs from the Chinese cobra, Naja atra (natrin), and the monocled cobra, Naja kaouthia (UniProtKB: Q7T1K6). Expression of the N. naja natrin homolog (Nana02866) was venom gland specific, and it probably functions as a Kv1.3 potassium channel blocker62.

Additionally, two group I secretory acidic PLA2 genes (Nana39244 and Nana39246) that were highly expressed in the venom and salivary glands showed a high degree of similarity to other elapid venom PLA2s, and contained the characteristic calcium-binding (XCGXGG) and catalytic (DXCCXXHD) motifs63 (Supplementary Fig. 2 and Supplementary Note).

In addition to the major elapid toxin families described above, we detected transcripts from other toxin gene families including hyaluronidases, phospholipase B-like genes, cathelicidins, ohanin (known to induce hypolocomotion and hyperalgaesia)64 and 5′ nucleotidases (Supplementary Note). We also detected l-amino acid oxidase (LAAO) (Nana07858), which is involved in platelet aggregation, edema and hemorrhage. Furthermore, we identified two full-length c-type natriuretic peptide genes, Nana20849 and Nana20852, in the venom-ome. Of the three Kunitz serine protease inhibitors detected in the venom-ome, two were VST gene products (Nana13192 and Nana13193) and these probably function to inhibit serine proteases acting in the hemostatic system65. In addition, two full-length cystatin genes, Nana15538 and Nana35841, were also found expressed in the venom-ome.

Cobra venom factor is a non-lethal protein that resembles the complement C3 protein in structure and function66. Previously, the complete structure of one CVF gene from N. Naja has been reported67. In the present study, we identified three CVF genes (Nana10828, Nana38416 and Nana10826) in the N. naja genome. Nana38416 and Nana10828 contained 40 exons and spanned ~118 and ~75 kb, respectively, on chromosome 2. Isoform sequencing (Iso-seq) data confirmed the expression of the full-length transcripts corresponding to Nana10828 and Nana38416. Though the 5′ genomic structure of Nana10826 was not fully resolved in the currently assembly, the expression of the full-length transcript of Nana10826 was confirmed by iso-seq. Protein sequence alignment showed that Nana38416 was 96% similar to CVF (UniProtKB: Q91132) from N. kaouthia, while Nana10828 was 99% similar to a previously characterized N. naja C3 complement component protein68.

In addition to toxin components, we also identified four full-length PDGF/VEGF-growth factor genes including vascular endothelial growth factor-1 (VEGF1; Nana01393), VEGFC (Nana18254), platelet-derived growth factor (PDGF; Nana34300), placenta growth factor (PGF; Nana05337) and an insulin growth factor (IGF) gene and Nana04360 in the venom-ome. Furthermore, we also identified a venom-gland-specific nerve growth factor (NGF-β; Nana13949) that exhibited high homology to NGFV2 from the spitting cobra Naja sputratrix.