The diverse antifreeze proteins enabling the survival of different polar fishes in freezing seas offer unparalleled vistas into the breadth of genetic sources and mechanisms that produce crucial new functions. Although most new genes evolved from preexisting genic ancestors, some are deemed to have arisen from noncoding DNA. However, the pertinent mechanisms, functions, and selective forces remain uncertain. Our paper presents clear evidence that the antifreeze glycoprotein gene of the northern codfish originated from a noncoding region. We further describe the detailed mechanism of its evolutionary transformation into a full-fledged crucial life-saving gene. This paper is a concrete dissection of the process of a de novo gene birth that has conferred a vital adaptive function directly linked to natural selection.

A fundamental question in evolutionary biology is how genetic novelty arises. De novo gene birth is a recently recognized mechanism, but the evolutionary process and function of putative de novo genes remain largely obscure. With a clear life-saving function, the diverse antifreeze proteins of polar fishes are exemplary adaptive innovations and models for investigating new gene evolution. Here, we report clear evidence and a detailed molecular mechanism for the de novo formation of the northern gadid (codfish) antifreeze glycoprotein (AFGP) gene from a minimal noncoding sequence. We constructed genomic DNA libraries for AFGP-bearing and AFGP-lacking species across the gadid phylogeny and performed fine-scale comparative analyses of the AFGP genomic loci and homologs. We identified the noncoding founder region and a nine-nucleotide (9-nt) element therein that supplied the codons for one Thr-Ala-Ala unit from which the extant repetitive AFGP-coding sequence (cds) arose through tandem duplications. The latent signal peptide (SP)-coding exons were fortuitous noncoding DNA sequence immediately upstream of the 9-nt element, which, when spliced, supplied a typical secretory signal. Through a 1-nt frameshift mutation, these two parts formed a single read-through open reading frame (ORF). It became functionalized when a putative translocation event conferred the essential cis promoter for transcriptional initiation. We experimentally proved that all genic components of the extant gadid AFGP originated from entirely nongenic DNA. The gadid AFGP evolutionary process also represents a rare example of the proto-ORF model of de novo gene birth where a fully formed ORF existed before the regulatory element to activate transcription was acquired.

Evolutionary innovation of new genetic elements is recognized as a key contributor to organismal adaptation. For decades, the treatise of Ohno (1) shaped the paradigm of new gene creation in that it relies on the duplication of a preexisting protein gene. When subjected to selection, adaptive sequence changes in one copy may occur from which a gene with a novel function may emerge (1, 2). Creating new protein-coding genes de novo from noncoding DNA sequences was considered extremely rare. In recent years, however, examples of de novo genes have been reported in diverse animals and plants (see review ref. 3 and studies referenced therein). De novo gene births were generally deduced using a combination of phylogenetics and comparative genomic/transcriptomic analyses or the phylostratigraphy approach (4), which revealed evidence for lineage- or species-specific gene transcripts, whereas the orthologous sequences in sister species were nongenic. These revelations have spurred considerable interest and hypotheses of how de novo genes arise and evolve as well as questions regarding their functional importance (5, 6). Validating new genes identified from sequence-based comparisons is complicated by uncertainties around how comprehensive the genome assemblies and gene expression data are (7, 8). More challenging yet is identifying the selective pressures and molecular mechanisms that created these putative new genes, and the adaptive functions and species fitness they may confer.

In contrast, antifreeze protein genes of polar teleost fishes are unequivocal new genes that confer a clear life-saving function and fitness benefit. The selective pressure that compelled their evolution is also clear. They evolved in direct response to polar marine glaciations, preventing death of fish from inoculative freezing by environmental ice crystals in subzero waters (9, 10). Such a strong life-or-death selective pressure has driven the independent evolution of multiple structurally distinct types of antifreezes: antifreeze peptide (AFP) types I, II, and III and antifreeze glycoprotein (AFGP) in diverse fish lineages where they perform the same ice-growth inhibition function (10). The structural differences lie in their distinct genetic ancestry. Thus, fish antifreezes as a group can richly inform on the diversity of molecular origins and evolutionary mechanisms that produced a vital function.

The well-known mechanism of evolution by gene duplication from a preexisting ancestor as diverse as C-type lectin and sialic acid synthase followed by sequence tinkering by natural selection produced AFP II (11) and AFP III (12), respectively. AFGPs have evolved independently in two unrelated fish lineages at opposite poles: the Antarctic notothenioid fishes (Notothenioidei) and the Arctic/northern codfishes (Gadidae), providing a striking example of protein sequence convergence (9, 13). In both lineages, AFGPs occur as a family of size isoforms composed of varying numbers of repeats of a basic tripeptide unit (Thr-Ala-Ala) with each Thr glycosylated with a disaccharide (10, 14). They are encoded by a family of polyprotein genes, each of which produces a large polyprotein precursor consisting of many tandemly linked AFGP molecules that are then post-translationally cleaved to yield mature AFGPs (13, 15). The Antarctic notothenioid AFGP evolved through a more innovative process than gene duplication and sequence divergence. It exemplifies partial de novo gene evolution. A functionally unrelated ancestral trypsinogenlike protease (TLP) gene provided the secretory signal and a 3′ untranslated sequence of the incipient AFGP. The large repetitive AFGP polyprotein-coding region was generated de novo from duplications of a partly non-sense 9-nt sequence that straddled an intron–exon junction in the TLP, which happened to comprise the three codons for one Thr-Ala-Ala unit (15, 16).

Where from and how the northern gadid AFGP evolved have remained lasting enigmas. Despite voluminous collections of genes and genome sequences available in databases, there are no meaningful homologs to any part of the gadid AFGP to hint at ancestry. This peculiar absence of related genes suggests that the gadid AFGP gene may have originated from nonprotein-coding DNA. Gadid AFGP presumably evolved very recently, in response to the cyclic northern hemisphere glaciation that commenced in the late Pliocene about 3 Mya. We reason it is unlikely that mutational processes could completely obscure even noncoding sequences within such a short evolutionary time such that the extant form of the AFGP nongenic ancestor should remain identifiable. We, therefore, decided to track the AFGP genotype and its homologs within the gadid phylogeny to pinpoint the ancestral DNA site of origin and reconstruct the gadid AFGP evolutionary path. Here, we report the identification of the noncoding founder sequence and the mechanism by which it gave rise to a new functional gadid AFGP gene. Our results also show that the gadid AFGP evolutionary process likely represents a rare example of the proto-ORF model of de novo gene birth (6, 17) where the noncoding founder ORF existed well before the novel gene arose.

Results and Discussion

At the minimum, de novo formation of a functional protein gene requires the acquisition of an ORF encoding the new protein and the basic cis-regulatory elements to activate its transcription and translation. AFGPs are secreted plasma proteins, thus, a signal peptide (SP) will also be needed to instruct cellular export of AFGP molecules into the blood circulation. Thus, to reconstruct the formation of the gadid AFGP gene requires elucidating how these essential genic components were generated and became properly linked into a functional whole gene. We began with precise delineation of these components that make up the structure of functional AFGPs in AFGP-bearing gadids. We then juxtaposed them against the structures of the AFGP homologs in more basal non-AFGP-bearing species representing progressively more ancestral states. This enabled us to decipher the essential molecular steps and timing in the de novo formation of the AFGP gene in the gadid lineage.

Phylogenetic Context for the Selected Gadid Species. The phylogenetic tree in Fig. 1 (detailed in SI Appendix, Fig. S1) depicts the relationships of the northern cod species used in this paper. We characterized the AFGP genes or noncoding homologs of seven species (gene structures to the right of the tree, Fig. 1), which were chosen for the strategic positions they occupy in the gadid tree. The monophyletic Gadidae family [sensu (18)] includes both AFGP-bearing and non-AFGP-bearing species; the former occurs in two subclades within the subfamily Gadinae (Fig. 1). The selected AFGP-bearing Boreogadus saida (polar cod) (19) and Gadus morhua (Atlantic cod) (20) represent one gadine subclade, and Microgadus tomcod (Atlantic tomcod) (21) represents the other. The four AFGP-lacking gadids were chosen for their evolutionary distances from the AFGP bearers. Two of them are gadines; Merlangius merlangus (whiting) nests within the AFGP-bearing subclade containing B. saida and G. morhua and thus shares the last common ancestor with all AFGP-bearing species (Fig. 1, blue dot), whereas Trisopterus esmarkii (Norway pout) is basal to the two AFGP-bearing subclades. The other two, Brosme brosme (cusk) and the freshwater Lota lota (burbot) belong to the subfamily Lotinae and are basal species that serve as ancestral proxies before the AFGP trait emerged (Fig. 1). Fig. 1. Gadid phylogeny and AFGP gene/homolog structures. The phylogenetic tree of Gadidae is a congruent cladogram derived from Bayesian and maximum likelihood trees using complete ND2 gene sequences (SI Appendix, Fig. S1). Light blue branches indicate lineages of the Gadinae subfamily. The two gadine subclades containing AFGP-bearing species (red vertical bars), their most recent common ancestor (blue dot), and the emergence of the AFGP trait are as indicated. The three AFGP-bearing species (AFGP+) and four AFGP-lacking species analyzed in this paper are shaded in blue and yellow, respectively. The structure of their AFGP gene or nongenic homolog is shown to the right. Gray and purple shaded areas indicate homologous regions. Cyan segments are sequence repeats. The dark blue segment is a repetitive AFGP cds or AFGP-like sequence.

Formation of an In-Frame SP-Coding Sequence. As a secreted protein that functions in extracellular fluids to arrest the growth of invading ice crystals, all functional AFGPs have a proper SP cds (Fig. 2 and SI Appendix, Figs. S5 and S6). We examined the 5′ sequence region to determine how a SP cds was acquired. In sequence alignment, we discovered that the 5′ GCA-rich duplicate II of functional AFGPs lack a “T” nucleotide that is present in the consensus sequence and persists in M. merlangus AFGP pseudogenes (ψ) and the T. esmarkii noncoding AFGP homolog (Fig. 3 A and C and SI Appendix, Fig. S8). Thus, this indel very likely resulted from a deletion event. The impact of this 1-nt deletion was that it produced a 1-nt reading frameshift in the presumptive propeptide (encoded by the 5′ 27-nt duplicate pair), resulting in the upstream sequence that could supply a SP being linked with the downstream (Thr-Ala-Ala) n cds in a single read-through ORF. The emerging AFGP gene was thus endowed with the necessary secretory signal.

Functionalization of the Emergent AFGP Gene. Forming proper coding regions of the AFGP gene alone would not lead to a gene product unless a minimal promoter was acquired to activate transcription thereby functionalizing the gene. All extant functional AFGPs have a TATA box, the core promoter for transcriptional activation, appropriately placed at 25–30 nt upstream from the presumptive transcription start site, but it is missing in the AFGP-like sequences of the basal species (Fig. 1 and SI Appendix, Fig. S9). To uncover the origin of the promoter region, we examined the sequences of this upstream region in all sequenced species. As already noted above, all AFGP homologs share an ∼240 nt 5′ region of high sequence similarities with functional AFGPs from the Met start codon through the SP cds (gray highlight, Fig. 1 and SI Appendix, Fig. S9), but only M. merlangus shares sequence similarities further upstream with the AFGP 5′ UTR and upstream regulatory region inclusive of the TATA box (SI Appendix, Fig. S9). The counterparts of this further upstream sequence in the basal gadine T. esmarkii and the lotines B. brosme and L. lota are drastically divergent and lack promoter elements (SI Appendix, Fig. S9, nt 1–120/130), but, interestingly, they share high similarities with each other. These results suggest the following. First, the divergent upstream sequences of the AFGP homologs shared by the basal T. esmarkii, B. brosme, and L. lota indicate they occupy a homologous genomic location (the putative ancestral site) that is distinct from the location of the extant AFGP genotype in the more derived M. merlangus and species of the AFGP-bearing clade. This is supported by a complete lack of microsynteny in our comparison of the sequence contigs of the AFGP-homolog loci of B. brosme with those of B. saida AFGP loci. We found none of the predicted neighboring genes are shared between the two species (SI Appendix, Table S1). Second, the acquisition of the proximal 5′ promoter region and functionalization of the emerging AFGP gene occurred after the divergence of the Trisopterus lineage. Since a (Thr-Ala-Ala) n -like cds exists in T. esmarkii (SI Appendix, Fig. S7B), expansion of the Ala(GCA)-rich codons that could lead to an AFGP ORF likely began at the ancestral site before the recruitment of a cis-regulatory region. Without it, an emerging AFGP-like cds could not be transcribed and would remain as non-sense repetitive DNA in the Trisopterus and more basal lineages. We propose the possibility that the cis-promoter region was acquired in the most recent common ancestor of the AFGP-bearing clade through a stochastic translocation of the ancestral AFGP founder region to a new genomic site that happened to contain a TATA motif thereby conferring transcriptional capability. Although the specific mechanism of the translocation is currently unclear, cryptic transcriptional initiation sites and regulatory signals are deemed prevalent throughout genomes as increasing evidence suggests large portions of genomes become transcribed at some time (25). Regarding translational activation, all examined AFGP and homologs contain the Kozak consensus sequence ACCATGG for eukaryotic translation initiation (26) (Fig. 2 and SI Appendix, Fig. S9). Therefore, this motif likely existed in the founder genomic site and became functionalized when the promoter region was acquired.

Nongenic Origin of SP and Promoter Region. We experimentally verified that the AFGP SP cds and promoter sequence did not originate from any existing protein-coding genes in the gadid genome. We hybridized the genomic BAC library macroarrays of B. saida and M. tomcod with the AFGP 5′ probe that is specific to this region. The hybridized clones were exactly the same clones that hybridized to the (Thr-Ala-Ala) n cds probe (SI Appendix, Fig. S10). This strongly supports that no homologs of the SP, 5′ UTR, and promoter regions of AFGP exist outside of the AFGP genomic loci. Thus, the promoter and SP cds of functional AFGP also originated de novo, unassociated with any preexisting protein gene.

Further Verifications of Nongenic Origin of AFGP. Recently evolved genes and the extant homologs of their genetic ancestor often remain as near neighbors in the genome. For example, the AFGP gene family of the Antarctic notothenioids closely clusters with its ancestral homologs: the trypsinogenlike protease genes along with the broader trypsin gene family within an ∼400 kbp region (27). In contrast, we found none of the neighboring genes (e.g., MAK16 and RAB14) in the AFGP genomic regions of the three AFGP-bearing gadids B. saida, G. morhua, and M. tomcod share any sequence similarity with AFGPs (SI Appendix, Fig. S3), and, thus, they are evolutionarily unrelated to AFGP. The absence of a potential protein gene ancestor nearby is consistent with gadid AFGP having evolved de novo. We further reasoned that an absence of transcription of the AFGP homologs in the AFGP-lacking gadids would provide compelling support that they are nonfunctional or nongenic DNA. Thus, we performed Northern blot hybridizations of RNA from pancreatic tissue [the site of AFGP synthesis (28)] of the four AFGP-lacking species using their respective species-specific AFGP-homolog sequence as probes and included B. saida for comparison. No transcripts of AFGP homologs were detectable in any of the three AFGP-lacking gadids. Only B. saida pancreatic RNA showed hybridization with strong intensity to its own AFGP cds probe and in varying intensity to the AFGP homolog probes from the other species due to various degrees of nt sequence identity (SI Appendix, Fig. S11). Since L. lota, B. brosme, and T. esmarkii are basal to the AFGP-bearing clade, their AFGP homologs must represent the ancestral transcriptionally inactive noncoding form. The AFGP-lacking M. merlangus is nested within the AFGP-bearing clade, and its AFGP homolog most closely resembles a functional AFGP except for inactivating mutations in the (Thr-Ala-Ala) n cds. Thus, it represents a subsequent nonfunctionalization into a nontranscribed pseudogene after the emergence of AFGP in the common ancestor of the AFGP-bearing clade. The loss of function relates to the nonfreezing water (Tromsø fjord in this study) M. merlangus inhabits today where antifreeze protection is not needed.

Gadid AFGP Evolved from Entirely Nongenic DNA. Fig. 4 summarizes the forgoing deductions on the noncoding origins of the essential AFGP sequence components and the possible molecular steps in the evolutionary transformation of these components into a complete new functional AFGP. The AFGP founder structure (Fig. 4A) existed in the gadid ancestor as a short noncoding genomic sequence comprising a segment (∼240 nt) with latent-coding exons (bronze segments) that have the potential to form a peptide sequence with properties for a secretory signal. The adjoining 27-nt GCA(Ala)-rich sequence (cyan segment) contained multiple nested 9-nt elements, any of which could become the three codons for the AFGP tripeptide (Thr-Ala-Ala) building block through a 1-nt substitution. Chance duplications of this ancestral 27-nt GCA-rich sequence produced four tandem copies (Fig. 4B). One of the 9-nt AFGP tripeptide-coding elements in the midst of the four copies likely underwent microsatellitelike duplications producing a budding ORF for the repetitive AFGP tripeptide cds, which began spreading the two pairs of 27-nt GCA-rich duplicates apart to the flanking positions (Fig. 4C). A putative translocation event in the last common ancestor of AFGP-bearing gadids moved the hitherto unexpressed AFGP precursor to a new genomic location that fortuitously contained a TATA motif thereby enabled transcription (Fig. 4D). Concurrently or subsequently, a 1-nt frameshift deletion in the second 5′ 27-nt duplicate likely occurred and served to link the latent cds for the SP and the downstream AFGP (Thr-Ala-Ala) n repeats in a single read-through ORF. Expression and secretion of the nascent antifreeze protein became possible (Fig. 4E). The smallest (and often the most abundant) functional AFGP isoform (AFGP8) comprises only four tripeptide repeats (10), which could be achieved through only two tandem duplications. The fledgling antifreezing protection could, therefore, augment fitness in the individual at the onset of northern hemisphere marine glaciation. Subsequent intensification of environmental selection pressures likely drove the intragenic (Thr-Ala-Ala) n cds expansion forming large AFGP polyprotein genes (Fig. 4F) as well as additional whole gene duplications. The result manifests in the multigene family of AFGP polyproteins (SI Appendix, Fig. S3) and the robust antifreeze activities the AFGP-bearing gadids possess today (10, 13). Fig. 4. Evolutionary mechanism of the gadid AFGP gene from noncoding DNA. The color codes of the sequence components follow Fig. 1. (A) The ancestral noncoding DNA contained latent signal peptide-coding exons with a 5′ Kozak motif, adjacent to a duplication-prone 27-nt GCA-rich sequence. (B) The 27-nt GCA(Ala)-rich sequence duplicated forming four tandem copies. (C) A 9-nt in the midst of the four 27-nt duplicates became the three codons for one AFGP Thr-Ala-Ala unit and underwent microsatellitelike duplication forming a proto-ORF. (D) A proximal upstream regulatory region acquired through a putative translocation event. (E) A 1-nt frameshift led to a contiguous SP, a propeptide, and a Thr-Ala-Ala-like cds in a read-through ORF. (F) Intragenic (Thr-Ala-Ala) n cds amplification, fulfilling the antifreeze function under natural selection.