Discovery of FLERVs in amphibian and fish genomes

By using tBLASTn and the reverse transcriptase (RT) protein of CoeEFV as a screening probe, 1,752 RT sequences were retrieved from publically accessible nucleotide GenBank databases, including the database of GenBank non-redundant nucleotide sequences, expressed sequence tags, high throughput genomic sequences, whole genome shotgun sequences and transcriptome shotgun assembly sequences. Together with 304 additional RT sequences from seven retroviral genera, phylogenetic analyses showed that 161 sequences from 28 distinct species (4 salamanders, 1 frog, 20 ray-finned fish, 2 lobe-finned fish and 1 shark) cluster together with the known mammalian FVs and fish ERVs previously identified as FV-like. We therefore only focused on these 161 retroviruses, and defined them as FLERVs.

To identify potentially full-length FLERVs, we extended the sequences on both sides and BLASTed the sequences against themselves to search for long terminal repeats (LTRs)—the key characteristic feature of ERVs that defines the boundary of the element. By doing so, we discovered one potentially full-length novel FLERV in the eastern newt (Notophthalmus viridescens) and four in the midas cichlid (Amphilophus citrinellus) genomes. We designated them ‘NviFLERV’ and ‘AciFLERV’, respectively. Furthermore, upon the ERV insertion, the target site DNA is also duplicated, resulting in small target site duplications (TSDs) flanking the ERVs19. We were able to identify NviFERLV’s TSDs as well as those of the four AciFLERVs (Supplementary Note 1), supporting that they are bona fide ERVs, and are not recombinants of multiple elements. Furthermore, we also found five potentially full-length FLERVs in the zebrafish (Danio rerio) and three in the West Indian Ocean coelacanth (Latimeria chalumnae) genomes. These ERVs have already been described as DrFV-1 (ref. 14) and CoeEFV12, respectively.

To search for additional FLERVs, we screened the five publically accessible nucleotide databases again by using the RT protein sequences of SloEFV, NviFLERV and AciFLERV. Thirty-two RT sequences from four additional ray-finned fish species were identified as FV-like, phylogenetically grouping with mammalian FVs and fish FLERVs. By expanding the sequences and based on the identification of their LTRs and TSDs, we discovered two more potentially full-length FLERVs in annual killifish (Austrofundulus limnaeus), and designated them ‘AliFLERV’. We were able to identify TSDs of only one AliFLERV however, as the contig that harbours the other AliFLERV does not contain complete LTRs (Supplementary Note 1). In total, 193 sequences from 32 unique vertebrate species were identified as FLERVs (Supplementary Table 1), and their phylogeny together with other retroviruses is shown in Fig. 1. Our analyses show that they are indeed most closely related to the known FVs and FLERVs with strong support (bootstrap clade support=95% and Bayesian posterior probability=1.00). Detailed phylogenetic relationships among FVs and FLERVs, inferred from RT sequences are shown in Supplementary Fig. 1.

Figure 1: Retroviral phylogeny illustrating how FVs and FLERVs relate to other retroviruses. The un-rooted phylogenies were estimated from a reverse transcriptase protein alignment. Since both Bayesian maximum clade credibility and maximum-likelihood trees have very similar topologies, only the Bayesian tree is shown. The scale bar (black solid line; bottom right corner) represents genetic divergence of length 0.2 in units of amino acid substitutions per site. The numbers on the nodes are bootstrap (first) and Bayesian posterior probability (second) clade support values. Full size image

Our RT phylogeny strongly supports monophyletic clades of salamander, lobe-finned fish and ray-finned fish FLERVs (bootstrap clade support >84% and Bayesian posterior probability=1.00 for all clades). The clade of mammalian FVs was strongly supported by the Bayesian phylogenetic analysis (Bayesian posterior probability=0.99), but not by the maximum-likelihood method (bootstrap clade support=57%). On the other hand, we found that the frog FLERV clusters together with shark FLERVs (bootstrap clade support=74% and Bayesian posterior probability=1.00), and the shark FLERVs appear to form two separate clades; however, the latter phylogenetic pattern is not strongly supported (bootstrap clade support=38% and Bayesian posterior probability=0.73). We also observed that mammalian FVs, salamander FLERVs and lobe-finned fish FLERVs cluster together (bootstrap clade support=90% and Bayesian posterior probability=1.00), and ray-finned fish and shark FLERVs are basal to this clade, reflecting the host history. The phylogeny also shows that mammalian FVs are more closely related to salamander FLERVs than lobe-finned fish FLERVs, mirroring the host evolutionary relationship and thus consistent with a long-term co-speciation history between this retroviral lineage with their tetrapod hosts. However, again, the support for this relationship is low (bootstrap clade support=49% and Bayesian posterior probability=0.88), likely because the RT sequences used to reconstruct the tree were short (130 aa), and thus, the topology of this tree should not be over-interpreted. Indeed, as noted by others5,20, RT trees are suitable only for broad classification of viruses. Below, we reconstructed a phylogeny of longer Pol protein sequences to determine the relationship among FVs and FLERVs more precisely (see ‘Results’ section: Phylogenetic analyses). Interestingly, our RT phylogenetic analysis shows that some fish and amphibian genomes harbour two FLERV lineages, including the genomes of Japanese fire belly newt (Cynops pyrrhogaster), fathead minnow (Pimephales promelas), Atlantic cod (Gadus morhua), midas cichlid (Amphilophus citrinellus) and Amazon molly (Poecilia Formosa), as well as West Indian Ocean coelacanth (Latimeria chalumnae) (Supplementary Fig. 1). To our knowledge, no known mammals have been shown to harbour two distinct lineages of FVs.

Characterization of full-length FLERVs

To characterize our novel amphibian and fish FLERVs, we first focused on the potentially full-length FLERVs, NviFLERV, AciFLERV and AliFLERV, and used mammalian FVs as the main reference for comparison (Fig. 2; see detailed genome annotation in Supplementary Fig. 2 and Supplementary Note 1). We did not characterize DrFV-1 and CoeEFV as they have been previously described12,14. We found that many of the potentially full-length NviFLERV, AciFLERV and AliFLERV contain large insertion/deletion mutations, in-frame stop codon and frameshift mutations as well as transposable elements (TEs), making genome annotation difficult and potentially inaccurate. To address this problem, we used the full-length elements to retrieve additional (fragmented) elements from their respective genomes using BLASTn, and reconstructed the maximum-likelihood sequence of the basal node on the phylogeny. This inferred ancestral sequence was used for genome annotation. Unfortunately, only two NviFLERV elements were found in N. viridescens, designated ‘NviFLERV-1’ and ‘NviFLERV-2’, and thus its ancestral sequence could not be reconstructed reliably.

Figure 2: Genomic organizations of FVs and FLERVs. NviFLERV-1 and NviFLERV-2 are endogenous viral elements present within the genome of N. viridescens. This is in contrast to AciFLERV and AliFLERV, which were annotated by using their maximum-likelihood ancestral sequences, reconstructed from 9 and 23 elements, respectively. Under each element are the distributions of stop codons (TAA, TAG and TGA; pink) and start codon (ATG; blue) in frame +1 (top), +2 (middle) and +3 (bottom), used to determine potential protein coding regions (blue thin vertical dashed lines). ‘t’-subscription indicates that the domain is truncated (5′-truncated: preceding the domain name; 3′-truncated: following the domain name; blue thick vertical dashed lines indicate where domains are truncated). Prototype foamy virus (PFV) was included as a reference. The scale bar (black solid line; top right corner) represents a nucleotide length of 1 kilobase. Full size image

NviFLERV-1 is a complete full-length FLERV (9,200 nt), containing putative gag, pol and env genes that are similar to those of simian FVs (SFVs), flanked by 5′- and 3′-LTRs. Situated on the 3′ end of the 5′-LTR is a tRNAAsn-utilising putative primer binding site (PBS), identified via sequence similarity, similar to that of prosimian galago FV6, but different from those of other mammalian FVs, which have tRNALys-utilising PBSs. Several in-frame stop codons and frameshift mutations were found in protein coding regions, indicating that it is defective, typical for an ERV. Between the env and 3′-LTR is a stretch of uniquely-mapping 492 nt, 164 aa, that does not exhibit similarity to any genes apart from a gene of Porcine reproductive and respiratory syndrome virus (BLASTx: AEQ61854; E=6 × 10−4). This relatively high E-value of 6 × 10−4 makes it unlikely that they are homologues; nevertheless since mammalian FVs contain accessory genes in this region (bel1 and bel2), and because it exhibits some similarity to a viral gene, we hypothesize that this nucleotide region might be an accessory gene of the progenitor of NviFLERV-1. Indeed, further analyses revealed that other salamander FLERVs also contain this gene (Table 1, see below), supporting that it has a viral origin and is not a host gene. We in turn annotated the gene as an ‘acc’ gene.

Table 1 Genomic structure of amphibian and fish FLERVs. Full size table

The fact that NviFLERV-1 has paired 5′- and 3′-LTRs enables us to estimate its age. Due to the process of retroviral genomic integration, the paired LTRs are at first identical, but after becoming endogenous, the two gradually diverge from one another3. Assuming that LTRs evolve neutrally after endogenisation, it is possible to calculate the integration date of the element, ie, its age, from the genetic distance of the two paired LTRs. The synonymous substitution rate of amphibian evolution was estimated to be ∼0.924–1.53 × 10–9 substitutions per site per year21. Since synonymous substitutions are largely neutral, we considered it to be a reasonable approximation for a neutral rate of evolution of amphibians. We estimated the pairwise distance of the paired LTRs to be 4.4%. Given this estimated LTR pairwise distance and the synonymous substitution rate, we estimated NviFLERV-1 to be ∼14 Myr old. The fact that we could identify the TSDs of NviFLERV-1 indicates that NviFLERV-1 is likely a bona fide ERV and its ancient age estimate is not an artefact of recombination between multiple ERVs.

NviFLERV-2 is a truncated element (4,391 nt), containing a 5′-truncated 5′ LTR, a complete gag gene, and a partial pol gene. A tRNAAsn-utilising PBS similar to that of NviFLERV-1, and prosimian galago FV, was also identified by via sequence similarity. In contrast to NviFLERV-1, NviFLERV-2 gag and pol genes do not contain in-frame stop codon or frameshift mutations. This raises two possibilities: either (i) that NviFLERV-2 is so young that it has not yet gained any in-frame stop codon or frameshift mutations or (ii) that it is a result of viral genomic contamination during the eastern newt genomic sequencing. To evaluate these possibilities, we screened short read archives for reads that span across the viral–host junction. Numerous reads mapping the junction were found, supporting that NviFLERV-2 is indeed a (very young) ERV, and not a result of viral genomic contamination.

Unlike NviFLERVs, we found 9 elements of AciFLERVs and 23 elements of AliFLERVs, allowing us to reconstruct their ancestral sequence for better genomic annotation. The original elements contain in-frame stop codon and frameshift mutations as well as TEs and large insertion/deletion mutations, confirming that they are indeed ERVs. The inferred ancestral sequence of AciFLERV and AliFLERV are 17,409 and 17,490 nt long, respectively, much longer than the length of typical mammalian FV genomes, ∼10 kb. It is noteworthy that these lengths are not inflated by TEs and/or large insertions as they were removed during the reconstruction of ancestral sequences. By investigating the distribution of start and stop codons, we determined AciFLERV to have 8 open reading frames (ORFs; ORF-1 to -8, from the 5′ end of the genome) flanked by 5′- and 3′ LTRs. Again, this is very different from mammalian FVs, which has only 5 ORFs, including (from the 5′ end) gag, pol and env genes followed by two accessory genes bel1 and bel2.

On the basis of sequence similarity and manual sequence inspection, we identified a tRNALys-utilising PBS on the 3′ end of the 5′-LTR, similar to those utilized by most mammalian FVs, and determined that ORF-2, and ORF-5 are pol and env genes, respectively. We in turn annotated ORF-1 as gag gene, and ORF-6, −7 and −8 as accessory genes based on our knowledge of retroviral genomic structure. Further inspection revealed that ORF-6 and −7 proteins are highly similar (BLASTp: E=1 × 10−26), indicative of paralogy. We hence annotated ORF-6 (759 nt, 253 aa), −7 (753 nt, 251 aa) and −8 (702 nt, 234 aa) as an ‘acc1a’, an ‘acc1b’ and an ‘acc2’ gene, respectively. Acc1a and acc1b are found in the same position as bel1, while acc2 in the same position as bel2. The hypothetical proteins of these accessory genes, however, do not exhibit similarity to any known proteins, and/or contain any known conserved domains. Further studies of their functions and structures are required to shed more light on these genes. Similarly, ORF-3 (1,395 nt, 465 aa) and −4 (1,002 nt, 334 aa) proteins, which are much longer than those of ORF-6, −7 and −8, also do not contain any known conserved domains; therefore, it is unclear what they are. Nevertheless, it is known that some retroviruses possess accessory genes between pol and env. For example, the lentiviral human immunodeficiency virus contains accessory genes in this region, vif, vpr and vpu, which are essential for viral replication, assembly and release22. We thus hypothesize that ORF-3 and −4 are viral accessory genes, and designated them ‘acc3’ and ‘acc4’, respectively. Again, our analyses revealed that other ray-finned fish FLERVs also contain these genes (Table 1, see below), supporting that they are indeed accessory genes of these FV-like viruses.

These gene annotations were then transferred to AliFLERV via protein sequence similarity. By examining the distribution of start and stop codons, we determined AliFLERV to have 6 ORFs (ORF-1 to -6) flanked by 5′- and 3′ LTRs. Our analyses suggest that ORF-1 to -6 are gag, pol, acc3, env, acc1 and acc2 genes, respectively. A tRNALys-utilising PBS was also identified after the 5′-LTR via sequence similarity as anticipated, similar to those of AciFLERV and most mammalian FVs. Detailed genomic annotations, such as PBS and TSD sequences, gene lengths and locations of the genes on the contigs, are in Supplementary Note 1.

To estimate the age of AciFLERV and AliFLERV, we used the LTR-dating method as described above. Four pairs of AciFLERV paired LTRs were found, and their pairwise distances were estimated to be between 0.2 and 1.1%. Likewise, we obtained two pairs of AliFLERV paired LTRs, and estimated their pairwise distance to be 3.1 and 4.9%. The neutral rate of fish genomic evolution has been reported to be ∼1.46 × 10−8 substitutions per site per year23. Given the estimated pairwise distances and the neutral rate of fish evolution, we inferred the oldest element of AciFLERV and AliFLERV to be 0.377 and 1.68 Myr old, respectively, which are surprisingly young given the presence of TEs within their protein coding regions and their high copy numbers (at least 9 for AciFLERV and 23 for AliFLERV). This finding indicates that these FLERVs may still be active, and/or that they have been rapidly proliferating by helper infectious retroviruses via complementation in trans. Alternatively, it also could be that these groups of young FLERVs represent concomitant germ-line infections of closely related viruses. These young ages, however, should be interpreted with caution, as LTR-dating could severely underestimate the ages of ERVs if LTR gene conversion happens, reducing the divergence between paired LTRs24. Nevertheless, this is unlikely as LTR gene conversion had to happen for all six elements to explain the observation. Further analyses revealed that the TEs within these FLERVs (none of which share the same genomic location) belong to TE groups that contain numerous highly similar elements, a strong indication of recent TE bursts. This observation might explain the presence of TEs in AciFLERV and AliFLERV despite their young age.

Characterization of fragmented FLERVs

We used the annotated genomes of NviFLERV, AciFLERV and AliFLERV, as well as those of CoeEFV and DrFV-1, to further annotate the genomic structure of other fragmented FLERVs in other vertebrate genomes via protein sequence similarity (Table 1). We were able to detect the presence of NviFLERV-like gag, pol, env and acc genes in the genomes of the Japanese fire belly newt (Cynops pyrrhogaster) and Iberian ribbed newt (Pleurodeles waltl). Phylogenetic analyses showed that Gag, Env and Acc proteins of C. pyrrhogaster FLERVs (CpyFLERVs) form two separate clades (Supplementary Fig. 3), confirming the existence of two lineages of CpyFLERVs as initially suggested by the RT phylogenetic analyses. The presence of NviFLERV-like gag, pol and env genes were also detected in the genome of the Ezo salamander (Hynobius retardatus), but not the NviFLERV-like acc gene. This could be due to the lack of sequence data, genomic deletion, and/or high degree of sequence divergence. Our analyses also detected a NviFLERV-like gag gene in the genome of the smooth newt (Lissotriton vulgaris), a salamander that was not initially listed as a species containing FV-like RT sequences (Supplementary Table 1). Again, the absence of sequence data and/or genomic deletion might explain this result. Lastly, we could not detect any genes other than pol in the western clawed frog (Xenopus tropicalis).

Regarding ray-finned fish FLERVs, while our analyses detected AciFLERV/AliFLERV-like pol genes in all of the FLERVs (pol: 27/27), AciFLERV/AliFLERV-like gag, env and acc1 genes could only be detected in about half of the FLERVs (gag: 12/27; env: 12/27, acc1: 12/27), and acc2 genes in about a fifth of the FLERVs (5/27). (Note that when there are multiple FLERV lineages present in a single genome, the genomic structure of each lineage was annotated separately by keeping track of which contigs the genes were found in.) Interestingly, in addition to the four fish species, our analyses revealed that sablefish (Anoplopoma fimbria) also contain two lineages of FLERVs; one with detectable gag, pol, acc3, env and acc2 genes, and another with only a detectable pol gene. Sablefish was not listed as a species containing two FLERV lineages in the initial RT analyses as the RT coding region for the former lineage was absent from the database. Another interesting observation was that, when the acc1 gene was detected, only a single copy was found in all cases, except in AciFLERV. This suggests that the acc1 duplication in the AciFLERV progenitor occurred very recently, and is lineage-specific. Our analyses confirmed the presence of acc3 and acc4 in 14 and 10 ray-finned fish FLERVs, respectively. The phylogenetic distribution of acc3 and acc4 was further examined to shed more light on their evolutionary history (see below). Finally, our analyses detected DrFV-like gag, pol and env genes in the genome of the fathead minnow (P. promelas), and CoeEFV-like gag, pol and env genes in the Indonesian coelacanth (Latimeria menadoensis), but could not detect any genes other than pol in Australian ghostshark (Callorhinchus milii). Again, we note that the absence of evidence is not the evidence of absence; the fact that we could not detect some genes in several fish FLERVs could be due to the lack of sequence data, genomic deletion, genetic divergence or indeed the genuine absence of the genes. To distinguish these possibilities requires genomic examination of their exogenous viral counterparts.

Phylogenetic analyses

To investigate the phylogenetic relationship between FLERVs and FVs in more detail, we estimated their Bayesian phylogeny based on a Pol protein alignment (Fig. 3). A. fimbria FLERV (AfiFLERV) that contains detectable gag, pol, acc3, env and acc2 genes was not included in this analysis because its Pol sequence was extremely short (42 aa after alignment curation). Also, note that the phylogenetic placement of L. vulgaris FLERV (LvuFLERV) was determined separately by using a phylogenetic analysis of Gag proteins, as its Pol protein is not available.

Figure 3: Coevolution of FVs and FLERVs and their vertebrate hosts. A Bayesian phylogeny of FVs and FLERVs (left) is compared with the published vertebrate host cladogram35,36,37,38 (right). Preceding viral names are the contig accession numbers containing viral sequences. Class III retroviruses were used to root the viral tree. Solid grey lines between the two trees indicate viral–host associations. The scale bar (black solid line; underneath the virus phylogeny) represents genetic divergence of length 0.3 in units of amino acid substitutions per site, and Arabic numerals on nodes are Bayesian posterior probability node support values. Roman numerals indicate the nodes of which the total per-lineage substitutions to the chimpanzee simian FV (U14327 SFVcpz) were used to construct the time-dependent rate phenomenon model (Fig. 4). The presence of acc3 (orange squares) and acc4 (red squares) is mapped onto the viral phylogeny. The most parsimonious timing of acc3 and acc4 acquisition is indicated (orange and pink arrows, respectively). Full size image

Overall, the results from phylogenetic analyses of RT and Pol protein sequences are largely consistent. We found that, as shown by the RT phylogenetic analysis, ray-finned fish, lobe-finned fish and salamander FLERVs, as well as mammalian FVs all form well-supported monophyletic clades (Bayesian posterior probability >0.99 for all clades), with tetrapod and lobe-finned fish FVs/FLERVs grouping together to the exclusion of ray-finned fish and shark FLERVs (Bayesian posterior probabilities=0.98). Unlike the RT phylogeny however, the Pol phylogeny shows that shark FLERVs are paraphyletic instead of forming two separate clades; nevertheless, this phylogenetic pattern is not well-supported (Bayesian posterior probability=0.72), similar to that in the RT phylogeny. Thus, it is still unclear how the progenitor of shark FLERVs interacted with their hosts.

As previously reported5,6,25, we found strong evidence supporting the co-speciation history of mammals and their FVs (maximum number of co-speciation events inferred=13/15 (86.67%); random tip mapping: N=500, P<0.002). Unlike in the case of mammals, several cross-species transmissions were found among ray-finned fish (Fig. 3). Despite this observation however, our analyses still showed that ray-finned fish FLERVs co-diverge broadly with their hosts with strong support (maximum number of co-speciation events inferred=18/27 (66.67%); random tip mapping: N=500, P<0.002). In the case of salamanders, no significant evidence was found for a history of co-speciation with their FV-like viruses (maximum number of co-speciation events inferred=4/5 (80%); random tip mapping: N=500, P=0.094).

We also mapped the presence of acc3 and acc4 onto the phylogeny to determine the history of their acquisition. We found that they are exclusively limited to just one clade of ray-finned fish FLERVs (Fig. 3), suggesting that they were acquired only once and perhaps both at the same time. This result further supports the phylogenetic distinctiveness of the two lineages of ray-finned fish FLERVs. Nonetheless, we noted that acc3 and acc4 genes could not be detected in some of the FLERVs in this clade. This could be due to the lack of sequence data or indeed multiple independent gene deletions as indicated by the non-monophyly of the absence of acc3 and acc4 genes (Fig. 3).

Regarding the deeper evolutionary history, our analyses show that mammalian FVs are most closely related to lobe-finned fish FLERVs (posterior probability >0.99) and basal to this clade are salamander FLERVs (Bayesian posterior probability >0.99). This branching pattern, however, does not match that of the hosts, where mammals are more closely related to amphibians than lobe-finned fish. We note that this topology was not supported by the phylogeny estimated from a much shorter RT alignment (130 aa), which shows a sister-taxon relationship between mammalian FVs and salamander FLERVs (Fig. 1), but with no statistical support. Indeed, when the tree estimation uncertainty is taken into account, this phylogenetic pattern falls within the RT tree estimation uncertainty. Combined, our results suggest that there were likely one or more ancient transmissions of FV-like viruses between tetrapods. We also found that, as shown by the RT phylogenetic analysis, instead of clustering with salamander FLERVs, the frog FLERV (XtrFLERV) was inferred to cluster with shark FLERVs (CmiFLERVs) with strong support (Bayesian posterior probability >0.99), again indicative of viral cross-class transmission. Nevertheless, interpretation of these findings could be complicated by incomplete lineage sorting. Incomplete lineage sorting is a phenomenon where multiple lineages of viruses continue to exist after the host basal diversification, but are sorted into different host lineages in the subsequent host divergence, and this could give raise to mismatches in virus/host evolutionary histories despite virus-host co-speciation. Temporal evidence can be used to differentiate between this phenomenon and viral cross-class transmissions (see ‘Discussion’ section). Another possibility is that these mismatches in virus/host evolutionary histories are a result of neutral genetic changes accumulating within FLERVs. However, a study has shown that neutral genetic changes only increase the branch length and decrease the clade support without altering the tree topology6; thus it is unlikely that this is the case. It may nonetheless explain the extremely long branch of XtrFLERV (Fig. 3, left).

Evolutionary timescale of FLERVs

Studies have shown that the rate estimates of mammalian FV evolution are time dependent16,18, and that this time-dependent rate phenomenon (TDRP) can be empirically described well by a power-law decay function16. In fact, it has been demonstrated that virtually all viruses exhibit this evolutionary feature, and that the TDRP pattern is extremely stable across a very large timescale, spanning nine orders of magnitude, and across a wide range of host organisms from plants to animals15. A consequence of this phenomenon is that the relationship between total per-lineage substitutions (S) and evolutionary timescales (T) will also be a power-law16. To estimate the timescale of the FLERVs, we first built a TDRP model describing the relationship between S and T estimates of mammalian FVs by tracing the chimpanzee SFV lineage backward in time (U14327 SFVcpz; Fig. 3). This was achievable due to the remarkably stable FV-mammal co-speciation history, allowing us to directly infer viral evolutionary timescales from those of mammals, and thereby obtaining a set of corresponding S and T estimates for model construction16 (Table 2). We then extrapolated the model to estimate the timescale of FLERVs from their S estimates, under an explicit assumption that the TDRP dynamics of mammalian FVs are the same as those infecting their ancient ancestral (perhaps marine) vertebrates (Fig. 4 and Table 2).

Table 2 Evolutionary timescales of FVs and FLERVs. Full size table

Figure 4: Evolutionary timescale of FLERVs estimated by using the power-law rate-decay model. The total per-lineage amino acid substitutions (S) from various nodes to the chimpanzee simian FV (U14327 SFVcpz) are plotted against corresponding evolutionary timescales (T). The S and T estimates are labelled with Roman numerals (I–XI), referring to nodes in Fig. 3. Solid black dots are median estimates, and the associated 95% HPDs intervals are indicated by error bars. The T estimates of node I–IX were directly inferred from those of their hosts35,39,40. 7,500 power-law TDRP models were fitted to the posterior distributions of the S and T estimates of the nodes I–IX (grey lines). The median model parameter values, adjusted R2 scores, and corresponding 95% HPDs (in the parentheses) are shown in the bottom right. The model was extrapolated to infer the branching date of lobe-finned fish FLERV lineage (node X), the separation date of salamander FLERV lineage (node XI) and the age of the entire FV/FLERV clade (node XII). See Table 2 for the values. Full size image

Our analyses showed that the model can describe the dynamics of FV substitutions very well (adjusted R2=0.954, 95 percent highest probability density interval (95% HPD)=0.912–0.987). By assuming that viruses infecting modern-day mammals and their ancient vertebrate ancestors share the same TDRP dynamics, and based on the posterior distribution of the Bayesian Pol phylogeny, the clade of mammalian FVs and lobe-finned fish FLERVs was estimated to be ∼263 (95% HPD=195–342) Myr old. The separation of the salamander FLERV lineage was inferred to have happened ∼348 (95% HPD=251–478) Ma. The age of the entire clade of FVs/FLERVs was estimated to be ∼455 (95% HPD=304–684) Myr old.

Since our tree contains both extant retroviruses (which evolve under pure viral evolutionary rates) and ERVs (which have mixed rates of host and virus evolution) (Fig. 3), we note that some branches in our tree have mixed rates of virus and host evolution, and because our model was built based on evolutionary dynamics of extant viruses, these mixed rates of ERV evolution could potentially bias our date estimates. Nevertheless, since we did not impose a molecular clock onto the tree (see ‘Methods’ section), each branch can therefore vary their length and height independently of time, allowing them to have different rates. Subsequently, those mixed rates of evolution will tend to limit to terminal branches that lead to ERVs. We avoided these branches with mixed rates of ERV evolution in our date estimation by measuring the node heights from SFVcpz tip down the tree, assuming that internal branches reflect only exogenous viral evolution and do not contain any evolutionary periods in which viruses might have been infectious ERVs.

As an independent verification that our date estimates are not biased by the mixed rates of ERV evolution, we performed phylogenetic analyses and date estimation by using FLERV Pol ancestral sequences, from which the host evolution has been removed. Indeed, we obtained similar results, estimating the divergence date of the lobe-finned fish FLERV lineage to be ∼267 (95% HPD=150–452) Ma, the separation date of the salamander FLERV lineage to be ∼342 (95% HPD=157–603) Ma and the overall age of FVs/FLERVs to be ∼473 (95% HPD=225–897) Myr old (Supplementary Fig. 4). Overall, these similarities in the date estimates suggest that we have successfully avoided the problem of mixed rates of ERV evolution in terminal branches. We note that the date estimate uncertainties are significantly larger than those of our initial date estimates. This is to be expected however, as there were fewer data points available for the TDRP model calibration after the exclusion of endogenous FVs for which ancestral sequences could not be reconstructed due to the lack of data (nine data points in the initial analyses, Fig. 4, compared to six data points in this analysis, Supplementary Fig. 4). In addition to this, it is likely that the ancestral sequence reconstruction may have also introduced some extra uncertainty into the genetic divergence estimation, contributing further to the increase in the date estimate uncertainties.