Evolution of snake genome architecture

We sequenced a male and a female five-pacer viper (Deinagkistrodon acutus) to high-coverage (♀ 238 fold, ♂114 fold, Supplementary Table 1), and estimated the genome size to be 1.43 Gb based on k-mer frequency distribution22 (Supplementary Table 2 and Supplementary Fig. 1). Fewer than 10% of the reads, which have a low quality or are probably derived from repetitive regions, were excluded from the genome assembly (Supplementary Table 3). We generated a draft genome using only male reads for constructing the contigs, and female long-insert (2–40 kb) library reads for joining the contigs into scaffolds. The draft genome has an assembled size of 1.47 Gb, with a slightly better quality than the genome assembled using only female reads. The draft genome has high continuity (contig N50: 22.42 kb, scaffold N50: 2.12 Mb) and integrity (gap content 5.6%, Supplementary Table 4), and thus was chosen as the reference genome for further analyses. It includes a total of 21,194 predicted protein-coding genes, as estimated using known vertebrate protein sequences and transcriptome data generated in this study from eight tissues (Fig. 1a, Methods). For comparative analyses, we also annotated 17,392 protein-coding genes in the boa genome (the SGA version from21). 80.84% (17,134) of the viper genes show robust expression (normalized expression level RPKM>1) in at least one tissue, comparable to 70.77% in king cobra (Supplementary Table 5). On the basis of 5,353 one-to-one orthologous gene groups of four snake species (five-pacer viper, boa21, python16 and king cobra10), the green anole lizard23 and several other sequenced vertebrate genomes (Methods), we constructed a phylogenomic tree with high bootstrapping values at all nodes (Fig. 1b). We estimated that advanced snakes diverged from boa and python about 66.9 (47.2–84.4) million years ago (MYA), and five-pacer viper and king cobra diverged 44.9 (27.5–65.0) MYA assuming a molecular clock. These results are consistent with the oldest snake and viper fossils from 140.8 and 84.7 MYA, respectively24.

The local GC content of snakes (boa and five-pacer viper) shows variation (GC isochores) similar to the genomes of turtles and crocodiles, and intermediate between mammals/birds and lizard (Fig. 1c, Supplementary Fig. 2), confirming the loss of such a genomic feature in the green anole lizard23. Cytogenetic studies showed that, like most other snakes, the five-pacer viper karyotype has 2n=36 chromosomes (16 macro and 20 micro-chromosomes)25. Previous work showed that there is extensive inter-chromosomal conservation between the rat snake and the butterfly lizard26. This information enables us to organize 56.50% of the viper scaffold sequences into linkage groups, based on their homology with sequences of known green anole lizard macro-chromosomes (Supplementary Table 6). As expected, autosomal sequences have the same read coverage in both sexes, whereas scaffolds inferred to be located on the viper Z chromosome (homologous to green anole lizard chr6) have coverage in the female that is half that in the male (Fig. 1c). Additionally, the frequency of heterozygous variants on the Z chromosome is much lower in the female than in the male (0.005 versus 0.08%, Wilcoxon signed rank test, P value<2.2e-16,) due to the nearly hemizygous state of Z chromosome in female, while those of autosomes (∼0.1%) are very similar between sexes. These results indicate that our assembly mostly assigns genes to the correct chromosome, which is further supported by comparison of 172 genes’ locations with previous fluorescence in situ hybridization results (Supplementary Data 1)26. The pattern of heterozygous variants also suggests that the viper sex chromosomes are highly differentiated from each other (see below).

47.47% of the viper genome consists of transposable element sequences (TEs), a higher percentage than in any other snake so far analysed (33.95–39.59%), which cannot be explained solely by the higher assembly quality of the viper10,16,21 (Supplementary Tables 7 and 8). The TEs in the viper genome are mostly long interspersed elements (LINE, 13.84% of the genome) and DNA transposons (7.96%, Supplementary Table 7). Sequence divergence of individual families from inferred consensus sequences uncovered recent rampant activities in the viper lineage of LINEs (CR1), DNA transposons (hAT and TcMar) and retrotransposons (Gypsy and DIRS). In particular, there is an excess of low-divergence (<10% divergence level) CR1 and hAT elements in the viper genome only (Fig. 2a). We also inferred earlier propagation of TEs shared by viper and king cobra, which thus probably occurred in the ancestor of advanced snakes. Altogether, these derived insertions resulted in an at least three-fold difference in the CR1 and hAT content between viper and more basal-branching snakes such as the boa and python (Fig. 2b). Meanwhile, the boa and python have undergone independent expansion of L2 and CR1 repeats, so that their overall LINE content is at a similar level to that of the viper and cobra (Fig. 2a, Supplementary Table 7).

Figure 2: Genomic and transcriptomic variation of snake transposable elements. (a) Violin plots showing each type of TE’s frequency distribution of sequence divergence level from the inferred ancestral consensus sequences. Clustering of TEs with similar divergence levels, manifested as the ‘bout’ of the violin, corresponds to the burst of TE amplification. (b) Bar plots comparing the genome-wide TE content between four snake species. TE families were annotated combining information of sequence homology and de novo prediction. (c) TE’s average normalized expression level (measured by RPKM) across different tissues in five-pace viper. Full size image

Most TEs are presumably silenced through epigenetic mechanisms to prevent their deleterious effects of transposition and mediation of genomic rearrangements. Indeed, very few TEs are transcribed in all of the tissues examined, except, unexpectedly, in the brain (Fig. 2c). This brain-specific expression prompted us to test whether some snake TE families might have been co-opted into brain gene regulatory networks. Focusing on highly expressed (RPKM>5) TEs that are located within 5 kb flanking regions of genes, we found that these nearby genes also show a significantly higher expression in brain than in any other tissues (Wilcoxon test, P value<1.1e-40, Supplementary Fig. 3). The expression levels of individual genes are strongly correlated (Spearman’s test, P value<1.35e-08) with those of nearby TEs. These genes are predominantly enriched (Fisher's Chi-square test, Q-value<0.05, Supplementary Figs 4–9) in functional domains of ‘biological process’ compared with ‘cellular component’ and ‘molecular function’, and particularly enriched categories include environmental response (‘response to organic substance’, ‘regulation of response to stimulus’ and ‘sensory perception of light stimulus’) and brain signalling pathways (‘neuropeptide signalling pathway’, ‘opioid receptor signalling pathway’ and ‘regulation of cell communication’ and so on). Further experimental studies are required to elucidate how some of these TEs evolved to regulate gene expression in the brain; these results nevertheless highlight the evolutionary dynamics and potential functional contribution of TEs in shaping snake genome evolution.

Evolution of snake genes and gene families

To pinpoint the critical genetic changes underlying the phenotypic innovations of snakes, we next mapped protein coding genes’ gain and loss (Fig. 1b) onto the phylogenetic tree. We also characterized signatures of lineage-specific adaptive or degenerative evolution (Fig. 3a) measured by their ratios (ω) of nonsynonymous versus synonymous substitution rates (Supplementary Data 2). We inferred a total of 1,725 gene family expansion and 3,320 contraction events, and identified 610 genes that appear to have undergone positive selection and 6,149 with relaxed selective constraints at different branches, using a likelihood model and conserved lineage-specific test27. Genes of either scenario were separated for analysis of their enriched gene ontology (GO) and mouse orthologs’ mutant phenotype terms, assuming most of them have a similar function in snakes.

Figure 3: Evolution of snake genes and gene families. (a) Phylogenetic distribution of mutant phenotypes (MP) of mouse orthologs of snakes. Each MP term is shown by an organ icon, and significantly enriched for snake genes undergoing positive selection (red) or relaxed selective constraints (grey) inferred by lineage-specific PAML analyses. (b) We show the four Hox gene clusters of snakes, with each box showing the ratio of nonsynonymous (dN) over synonymous substitution (dS) rate at the snake ancestor lineage. White boxes represent genes that haven’t been calculated for their ratios due to the genome assembly issue in species other than five-pacer viper. Boxes with dotted line refer to genes with dS approaching 0, therefore the dN/dS ratio cannot be directly shown. Each cluster contains up to 13 Hox genes with some of them lost during evolution. We also marked certain Hox genes undergoing positive selection (in red) or relaxed selective constraints (in green) at a specific lineage above the box. Each lineage was denoted as: S: Serpentes (ancestor of all snakes), H: Henophidia (ancestor of boa and python), B: Boa constrictor, P: Python bivittatus, C: Colubroidea (ancestor of five-pacer viper and king cobra), D: Deinagkistrodon acutus, O: Ophiophagus hannah. (c) Comparing olfactory receptor (OR) gene repertoire between boa, viper and lizard. Each cell corresponds to a certain OR family (shown at the y-axis) gene number on a certain chromosome (x-axis). (d) Pie chart shows the composition of normalized venom gland transcripts of male five-pacer viper. The heatmap shows the normalized expression level (in RPKM) across different tissues of viper and king cobra. We grouped the venom genes by their time of origination, shown at the bottom x-axis. Full size image

Significantly (Fisher’s exact test, P value<0.05) enriched mutant phenotype terms integrated with their branch information illuminated the molecular evolution history of snake-specific traits (Fig. 3a). For example, as adaptations to a fossorial lifestyle, the four-legged snake ancestor28 had evolved an extreme elongated body plan without limbs, and also fused eyelids (‘spectacles’, presumably for protecting eyes against soil29). The latter is supported by the results for the positively selected gene Ereg30 and the genes under relaxed selection Cecr2 (ref. 31) and Ext1 (ref. 32) at the snake ancestor branch (Supplementary Data 3), whose mouse mutant phenotype is shown as prematurely opened or absent eyelids. The limbless body plan has already driven many comparisons of expression domains and coding-sequences of the responsible Hox genes between snakes and other vertebrates7,17. We here refined the analyses to within snake lineages, focusing on sequence evolution of Hox and other genes involved in limb development and somitogenesis. We annotated the nearly complete sequences of 39 Hox genes organized in four clusters (HoxA–HoxD) of the five-pacer viper. Compared with the green anole lizard, the four studied snake species have Hox genes whose sizes are generally reduced, due to the specific accumulation of DNA transposons in the lizard’s introns and intergenic regions (Supplementary Fig. 10). However, snakes have accumulated particularly higher proportions of simple tandem repeat and short interspersed element sequences within Hox clusters (Supplementary Fig. 11), either as a result of relaxed selective constraints and/or evolution of novel regulatory elements. We identified 11 Hox genes as under relaxed selective constraint and one (Hoxa9) as under positive selection (Fig. 3b). Their combined information of gene function and affected snake lineage informed the stepwise evolution of snake body plan. In particular, Hoxa5 (ref. 33), Hoxa11 (ref. 34) and Tbx5 (ref. 35), which specifically pattern the forelimbs in mouse, have been identified as genes under relaxed selective constraint in the common ancestor of all four snakes. Meanwhile, Hoxc11 and Tbx4 (ref. 36), which pattern the hindlimbs in the mouse, and many other limb-patterning genes (for example, Gli3, Tbx18, Alx4) were identified as genes under relaxed selective constraint that evolved independently on external snake branches (Fig. 3b, Supplementary Data 3). These results provide robust molecular evidence supporting the independent loss of hindlimbs after the complete loss of forelimbs in snake ancestors. In the snake ancestor branch, we also identified the genes under relaxed selective constraint Hoxa11, Hoxc10 and Lfng, which are respectively associated with sacral formation37, rib formation8 and somitogenesis speed38 in vertebrates. Their changed amino acids and the expression domains that have expanded in snakes relative to lizards17,19 might have altogether contributed to the ‘de-regionalization’17 and elongation of the snake body plan. In several external branches, we identified Hoxd13 independently as under relaxed selective constraint. Besides its critical roles in limb/digit patterning39, Hoxd13 is also associated with termination of the somitogenesis signal and is specifically silenced at the snake tail relative to the lizard tail7. This finding suggests that body elongation may have evolved more than once among snake lineages. Overall, limb/digit/tail development mutant phenotype terms are significantly enriched in genes under relaxed selective constraint at both ancestral and external snake branches (Fig. 3a), and we identified many such genes in different snake lineages for future targeted experimental studies (Supplementary Data 3 and 4).

Another important adaption to the snakes’ ancestrally fossorial and later ground surface lifestyle is the shift of their dominant source of environmental sensing from visual/auditory to thermal/chemical cues. Unlike most other amniotes, extant snake species do not have external ears, and some basal species (for example, blindsnake) have completely lost their eyes. Consistently, we found mutant phenotype terms associated with hearing/ear and vision/eye phenotypes (for example, abnormal ear morphology, abnormal vision and abnormal cone electrophysiology) are enriched among genes under relaxed selection along all major branches of snakes starting from their common ancestor (Fig. 3a, Supplementary Fig. 12, Supplementary Data 3). Gene families that have contracted in the ancestor of the four studied snake species, and specifically in the viper, are also significantly enriched in GOs’ of ‘sensory perception of light stimulus (GO:0050953)’ or ‘phototransduction (GO:0007602)’ (Fisher's Exact Test, Q-value<9.08e-4; Fig. 1c, Supplementary Data 5). In particular, only three (RH1, LWS and SWS1) out of 13 opsin genes’ complete sequences can be identified in the viper genome, consistent with the results found in python and cobra16. By contrast, infrared receptor gene TRPA1 (ref. 5) and ubiquitous taste-signalling gene TRPM5 (ref. 40) have respectively undergone adaptive evolution in five-pacer viper and the ancestor of boa and python. Gene families annotated with the GO term ‘olfactory receptor (OR) activity’ have a significant (Fisher's Exact Test, Q-value<1.63e-4) expansion in all snake species studied and at some of their ancestral nodes, except for the king cobra (Supplementary Figs 13–26). In the boa and viper, whose genome sequences have much better quality than the other two snake genomes, we respectively annotated 369 and 412 putatively functional OR genes, based on homology search and the characteristic 7-TM (transmembrane) structure (Methods). Both terrestrial species have an OR repertoire predominantly comprised of class II OR families (OR1–14, presumably for binding airborne molecules, Fig. 3c), and their numbers are much higher than the reported numbers in other squamate genomes41. Some (ranging from 18 to 24) class I (OR51–56, for water-borne molecules) genes have also been found in the two species, indicating this OR class is not unique to python as previously suggested41. Compared with the green anole lizard, the boa and viper exhibit a significant size expansion of OR family 5, 11 and 14 (Fisher’s exact test P<0.05), and also a bias towards being located on the Z chromosome (Fig. 3c), leading to higher expression of many OR genes in males than in females (see below). In particular, OR5 in the viper probably has experienced additional expansion events and become the most abundant (with 71 members) family in the genome. Intriguingly, this family is specifically enriched in birds of prey42 relative to other birds, and in non-frugivorous bats versus frugivorous bats43. Therefore, its expansion in the five-pacer viper could have been positively selected for a more efficient detection of prey.

Besides acute environmental sensing, specialized fangs6 and venoms11 (for example, hemotoxins of viper or neurotoxins of elapid) arm the venomous snakes (∼650 species) to immediately immobilize much larger prey for prolonged ingestion, which probably comprised one of the most critical factors that led to the advanced snakes’ species radiation. It has been proposed that the tremendous venom diversity probably reflects snakes’ local adaption to prey44 and was generated by changes in the expression of pre-existing or duplicated genes11,45. Indeed, we found that the five-pacer viper’s venom gland gene repertoire has a very different composition compared with other viper46 or elapid species10 (Fig. 3d). We have annotated a total of 35 venom genes or gene families using all the known snake venom proteins as the query. Certain gene families, including snake venom metalloproteinases (SVMP), C-type lectin-like proteins (CLPs), thrombin-like snake venom serine proteinases (TL), Kunitz and disintegrins, have more genomic copies in the five-pacer viper than other studied snakes or the green anole lizard (Supplementary Table 9), whereas characteristic elapid venom genes such as three-finger toxins (3FTx) are absent from the viper genome. Most venom proteins of both the viper and king cobra have expression restricted to venom or accessary glands, and for both species this is particularly seen for those genes that originated in the ancestor of snakes or of advanced snakes (Fig. 3d). However, elapid- and viper-specific venom genes, that is, those that originated more recently, are usually expressed in the liver of the other species. Such cases include FactorV, FactorX of king cobra, which are expressed in the liver of five-pacer viper, and PLA2-2A of viper (Fig. 3d), which is expressed in the pooled organ of king cobra. This expression pattern suggests that these venom genes may have originated from metabolic proteins and undergone neo-/sub-functionalization, with altered expression.

Evolution of snake sex chromosomes

Different snake species exhibit a continuum of sex chromosome differentiation. Pythons and boas possess homomorphic sex chromosomes, which is assumed to be the ancestral state; the lack of differentiation between the W and Z chromosomes in these species suggests that most regions of this chromosome pair recombine like the autosomes47. Advanced snakes usually have heteromorphic sex chromosomes that have undergone additional recombination suppression47,48. We found that the five-pacer viper probably has suppressed recombination throughout almost the entire sex chromosome pair, as the read coverage in the female that we sequenced is half that in the male (figs 1c and 4). By contrast, the boa’s homologous chromosomal regions show a read coverage pattern that does not differ from that of autosomes and between sexes (Fig. 4). Assuming that these two species share the same ancestral snake sex-determining region, this lack of sex-differentiated region shown in boa suggests that that region is not included in our current chromosomal assembly.

Figure 4: Snake sex chromosomes have at least three evolution strata. The three tracks in the top panel shows female read depths along the Z chromosome relative to the median depth value of autosomes, Z/W pairwise sequence divergence within intergenic regions, and female read depths of W-linked sequence fragments relative to the median depth value of autosomes. Depths close to 1 suggest that the region is a recombining pseudoautosomal region (PAR), whereas depths of 0.5 are expected in a highly differentiated fully sex-linked region where females are hemizygous. The identifiable W-linked fragments are much denser at the region 56–70 Mb, probably because this region (denoted as stratum 3, S3) has suppressed recombination most recently. S2 and S1 were identified and demarcated by characterizing the sequence conservation level (measured by LASTZ alignment score, blue line) between the chrZs of boa and viper. At the oldest stratum S1 where recombination has been suppressed for the longest time, there is an enrichment of repetitive elements on the affected Z-linked region (Gypsy track in red, 100 kb non-overlapping sliding window). And these Z-linked TEs A similar pattern was found in homologous recombining region of boa, but not in lizard. Full size image

In plants, birds and mammals, it has been found that recombination suppression probably occurred by a succession of events. This stepwise recombination loss has led to the punctuated accumulation of excessive neutral or deleterious mutations on the Y or W chromosome by genetic drift, and produced a gradient of sequence divergence levels over time, which are termed ‘evolutionary strata’49,50,51. Advanced snakes have been suggested to have at least two strata12. One goal of our much more continuous genome assembly of the five-pacer viper compared with those of any other studied advanced snakes10,12 (Supplementary Table 4) was to reconstruct a fine history of snake sex chromosome evolution. We assembled 77 Mb Z-linked and 33 Mb W-linked scaffolds (Methods). The reduction of female read coverage along the Z chromosome suggests that there is substantial divergence between Z- and W- linked sequences; this divergence would enable the separate assembly of two chromosomes’ scaffolds. Mapping the male reads confirmed that the inferred W-linked scaffold sequences are only present in the female (Supplementary Fig. 27). The W-linked scaffolds’ density and pairwise sequence divergence values within putative neutral regions along the Z chromosome indicate at least two ‘evolutionary strata’, with the older stratum extending 0–56 Mb, and the younger one extending 56–70 Mb. The boundary at 56 Mb region can also be confirmed by analyses of repetitive elements on the Z chromosome (see below). Consistently, identifiable W-linked fragments are found at the highest density per megabase in the 56–70 Mb region (Fig. 4), suggesting that recombination in this region have been suppressed more recently. The older stratum includes much fewer identifiable fragments that can resolve the actual times of recombination suppression events. To study this region further, we inspected the homologous Z-linked region, whose recombination has also been reduced, albeit to a much smaller degree than that of the W chromosome, after the complete suppression of recombination between Z and W in females. In addition, Z chromosome transmission is biased in males. As males usually have a higher mutation rate than females, due to many more rounds of DNA replication during spermatogenesis than during oogenesis (‘male-driven evolution’)52, Z-linked regions are expected to have a generally higher mutation rate than any other regions in the genome. This male-driven evolution effect has been demonstrated in other snake species12 and also been validated for the snakes inspected in this study (Supplementary Fig. 28). As a result, we expected that regions in older strata should be more diverged from their boa autosome-like homologues than those in the younger strata. This expectation enabled us to identify another stratum (0–42 Mb, stratum 2, S2 in Fig. 4) and demarcate the oldest one (42–56 Mb, S1), by estimating the sequence conservation level (measured by LASTZ alignment score, blue line) between the Z chromosomes of boa and viper. The Z-linked region in the inferred oldest stratum S1 exhibits the highest sequence divergence with the homologous W-linked region and also the highest proportion of repetitive elements (CR1, Gypsy and L1 elements; Fig. 4 shows the example of Gypsy; other repeats are shown in Supplementary Fig. 29). This enrichment of repeats can be explained by the effect of genetic drift53, which has been acting on the Z-linked S1 longer than any other Z-linked regions since the S1 reduced recombination rate in females. As a result, the accumulated repeats of S1 also tend to have a higher divergence level from the inferred ancestral consensus sequences compared with nearby strata (Fig. 4). Unexpectedly, a similar enrichment was found in the homologous region of S1 in boa, despite it being a recombining region and exhibiting the same coverage depth between sexes (Fig. 4, Supplementary Fig. 29). This finding indicates that the pattern is partially contributed by the ancestral repeats that had already accumulated on the proto-sex chromosomes of snake species. Since our current viper sex chromosomal sequences used the green anole lizard chromosome 6 as a reference, rearrangements within this chromosome make it impossible to test whether S2 encompasses more than one stratum.

We dated the three resolved strata by constructing phylogenetic trees with homologous Z- and W-linked gene sequences of multiple snake species. Combining the published CDS sequences of pygmy rattlesnake (Viperidae family species) and garter snake (Colubridae family species)12, we found 31 homologous Z-W gene pairs, representing the three strata. All of their sequences clustered by chromosome (that is, the Z-linked sequences from all the species cluster altogether, separately from the W-linked ones) rather than by species (Supplementary Figs 30–32). This clustering pattern indicates that all three strata formed before the divergence of the advanced snakes and after their divergence from boa and python, that is, about 66.9 million years ago (Fig. 1c).

We found robust evidence of functional degeneration on the W chromosome. It is more susceptible to the invasion of TEs; the assembled sequences’ overall repeat content is at least 1.5 fold higher than that of the Z chromosome, especially in the LINE L1 (2.9 fold) and LTR Gypsy families (4.3 fold) (Supplementary Table 10 and Supplementary Fig. 33). Of 1,135 Z-linked genes, we were only able to identify 137 W-linked homologues. Among these, 62 (45.26%) have probably become pseudogenes due to nonsense mutations (Supplementary Table 11). W-linked loci generally are transcribed at a significantly lower level (Wilcoxon test, P <0.0005), with pseudogenes transcribed at an even lower level relative to their autosomal or Z-linked homologous loci regardless of the tissue type (Supplementary Figs 34 and 35). Given such a chromosome-wide gene loss, as in other snakes12 and the majority of species with ZW sex chromosomes54, the five-pacer viper shows a generally male-biased gene expression throughout the Z-chromosome and probably has not evolved global dosage compensation (Supplementary Fig. 36).