Distribution and abundance of TEs across species

Our findings show that there are two phases in HT: effective insertion of the TE, followed by expansion throughout the genome. Figure 1 shows that both BovB and L1 elements are present in a diverse array of species including mammals, reptiles, fish, amphibians, arthropods and primitive species such as sea urchins and sea squirts. Both retrotransposons also have a patchy distribution across our sampled eukaryotes. The main difference between BovB and L1 lies in the number of colonised species. BovBs are only present in 72 of the 759 species analysed, strictly within animals, so it is easy to trace their HT between the distinct clades (e.g. squamates, ruminants). In contrast, L1s encompass a total of 559 species, including plants, animals and several fungal species, and they are ubiquitous across the well-studied therian mammals. The only species which appear to have BovBs yet lack L1s are the two monotremes, platypus and echidna.

Fig. 1 Presence and coverage of L1 and BovB elements across eukaryotes. The Tree of Life [59] was used to infer a tree of the 759 species used in this study; iTOL [58] was used to generate the bar graph and final graphic. The black arrow marks the proposed L1 HT event into therian mammals 160–191 MYA. Branches are coloured to indicate which species have both BovB and L1 (green), only BovB (orange), only L1 (blue) or neither (black). Bar graph colours correspond to BovB (orange) and L1 (blue). Connections indicate possible HT events involving BovB (yellow) or L1 (red) elements. An interactive and downloadable version of this figure is available at: http://itol.embl.de/shared/atma Full size image

The abundance of TEs differs greatly between species. As shown in Fig. 1, mammalian genomes are incredibly susceptible to BovB and L1 expansion. More than 17% of the cow genome comprises these TEs (12% BovB, 5% L1; see Additional file 1: Table S4). This is without considering the contribution of TE fragments [18] or derived short interspersed elements (SINEs), boosting retrotransposon coverage to > 50% in some mammals [1]. Even within mammals, there are noticeable differences in copy number; for example, bats and equids have a very low number of full-length BovBs (< 50 per genome) compared to the thousands found in ruminants and Afrotherian mammals. The low copy number here is TE-specific rather than species-specific; there are many L1s in bats and equids. Hence, the rate of TE propagation is determined both by the genome environment (e.g. mammal versus non-mammal) and the type of retrotransposon (e.g. BovB vs L1).

Widespread HT of BovB in animals

To develop a method for identifying HT events, we used BovB, a TE known to undergo HT. First, we generated a representative BovB phylogeny using consensus and centroid approaches (see ‘Methods’ for details). Figure 2a shows the centroid BovB tree, where the centroid for each species was the longest intact BovB sequence. The phylogeny supports previous results [8]—with the topology noticeably different from the tree of life (Fig. 1)— although we were able to refine our estimates for the times of insertion. For example, the cluster of equids includes the white rhinoceros, Ceratotherium simum, suggesting that BovBs were introduced into the most recent common ancestor before these species diverged. The low copy number in equids and rhinoceros, observed in Fig. 1, is not because of a recent insertion event; the most likely explanation is that the donor BovB inserted into an ancestral genome, was briefly active, but lost its ability to retrotranspose and was subsequently vertically transmitted.

Fig. 2 HT of BovB retrotransposons. a Representative BovB tree inferred using nucleotide centroid sequences, where the centroid was the longest intact BovB sequence for each species (min cut-off length 2 kb, max cut-off length 4 kb for species with chimeric BovBs). MUSCLE [50] was used to align sequences, Gblocks [51] was used to extract conserved blocks, FastTree [52] was used to infer a maximum likelihood phylogeny. FastTree branch support values are shown and branches are coloured taxonomically. RTE sequence from Schistosoma mansoni was included as an expected outgroup. b BovB subgroups in bats and frogs. All bat, frog, horse and rhino BovB nucleotide sequences in the range of 2–4 kb were grouped into a file. As above, MUSCLE [50] was used to align sequences, Gblocks [51] was used to extract conserved blocks, FastTree [52] was used to infer a maximum likelihood phylogeny and branches are coloured taxonomically. The Xenopus laevis subgroup (left) contains sequences from Perissodactyla (horses and rhino) and Chiroptera (bats). The Xenopus tropicalis subgroup (right) forms a distinct clade, clustering with the majority of the bat BovB sequences Full size image

The placement of arthropods in the BovB tree is intriguing, revealing potential HT vectors and the origin of BovB retrotransposons. For example, the RTE-like BovBs from butterflies, moths and ants appear as sister groups to the main BovB clade. This suggests that BovB TEs may have arisen as a subclass of ancient RTEs, countering the belief that they originated in squamates [14]. Within the central clade, we see a scattering of possible vector species including a leech (Helobdella robusta), two scorpion species (Mesobuthus martensii and Centruroides exilicauda) and a locust (Locusta migratoria). But the most interesting arthropod species is Cimex lectularius, the common bed bug, known to feed on animal blood. The full-length BovB sequence from Cimex shares > 80% identity to viper and cobra BovBs; their reverse transcriptase domains share > 90% identity at the amino acid level. Together, the bed bug and leech support the idea [8, 19] that blood-sucking parasites can transfer retrotransposons between the animals they feed on.

Our mining of BovB sequences further revealed two concurrent BovB subgroups in bats and frogs. Two frog species (Xenopus laevis and Xenopus tropicalis) each contain a single intact BovB sequence > 2 kb in length (and numerous fragments), but these two sequences are very different and correlate with the two distinct BovB subgroups observed in bats (Fig. 2b). This seems to indicate at least two independent insertion events, somehow connecting Xenopus laevis with the ‘horse-like’ BovB group, and Xenopus tropicalis with the bat-specific BovB group (most similar to the BovBa-1_EF consensus from RepBase [20]). Without intermediary species, it is difficult to infer the chain of events that led to these patterns.

Finally, to exhaustively search for all cases of BovB HT, we tested several all-against-all clustering approaches to detect individual HT candidate sequences. We first replicated the method described in El Baidouri et al. [21], which uses BLAST [22] to compare all sequences within a database, and SiLiX [23] to extract discordant clusters. This worked well for recent BovB transfers (e.g. Cimex lectularius—snakes) but failed to identify ancient transfer events and required considerable computational time and power. Next, we tested VSEARCH [24], which is orders of magnitude faster than BLAST [22]. A total of 174,510 BovB sequences were clustered in < 15 min on a high-performance computing cluster with 16 cores. We clustered full-length nucleotide sequences, nucleotide sequences from just the open reading frames (ORFs) and amino acid sequences from extracted reverse-transcriptase (RT) domains (see ‘Methods’).

Many of the resulting clusters contained BovBs from closely related species, e.g. cow and yak. To find the most compelling HT events, we imposed the restriction that clusters had to contain BovBs from species that belonged to different eukaryotic Classes (e.g. Mammalia and Insecta). We performed a machina validation for each candidate HT cluster: pairwise alignments of the flanking regions to rule out possible contamination or orthologous regions; phylogenetic reconstructions to confirm discordant relationships; and reciprocal best hit checks to confirm correct clustering (see ‘Methods’). Combining both nucleotide and amino acid results, a total of 67 HT clusters were detected (visualised as connections in Fig. 1, described in detail in Additional file 1: Table S5). This includes recent transfers between ruminants and reptiles, often grouped with bed bug or locust BovBs (as shown in Fig. 2a), and older transfers between scorpions and fish, mayfly and a Myotis bat, rotifer and leech. The Pogona vitticeps lizard appears in numerous different animal groupings, suggesting a high level of active retrotransposition (Additional file 2: Figure S38) and subsequent HT.

Altogether, our results demonstrate that the HT of BovB elements is even more widespread than previously reported, providing one of the most compelling examples of non-LTR HT across animals.

Possible L1 HT in aquatic species and plants

We carried out the same exhaustive search in L1s, which presented a challenge because of greater divergence and the sheer number of vertically inherited copies. Producing a consensus for each species was impractical as most species contained a mixture of old degraded L1s and young intact L1s. Instead, we used the all-against-all clustering methods on the collated dataset of L1 nucleotide sequences > 3 kb in length (> 1 million sequences total). Once again, VSEARCH [24] was substantially faster and identified more potential HT candidates than the BLAST+SiLiX method [21,22,23]. This is likely due to a crucial difference in clustering algorithms; SiLiX uses single linkage to draw connections between sequences, which is effective for recent HT events but clusters all ‘degraded’ elements into a single group. In contrast, VSEARCH relies on centroid/average linkage, and is thus more appropriate for ancient HT events (where the centroid is ideally the transferred TE).

Over 9000 clusters contained L1s from at least two different species: these were our HT candidates. The vast majority of these clusters contained L1s from closely related species. As before, to improve recognition of HT vs vertical inheritance, we looked for families displaying cross-Class or cross-Phylum transfer. We clustered full-length nucleotide sequences (Fig. 3a), nucleotide ORFs (Fig. 3b) and amino acid RT domains (Fig. 3c). We checked for discordance compared to orthologs (Fig. 3d), absence in neighbouring species and elevated sequence identity compared to flanking regions. To confirm the ortholog trees (particularly for species with no known ortholog data), we used TimeTree [25] to estimate species divergence times and infer species relationships from previous studies and fossil records (Fig. 4). By using the procedure we had established for BovB elements (see ‘Methods’), we were able to retain 18 L1 clusters as potential HT events that span different eukaryotic Classes or Phyla (Additional file 1: Table S6). Additional clusters which looked promising but could not be confirmed due to short scaffolds in the draft assembly or lack of functional domains in the ORFs are also listed in Additional file 1: Table S6 (marked as likely contamination or likely artefacts, respectively).

Fig. 3 L1: TE trees of cross-Phylum HT clusters. a Full-length nucleotide L1s. Phylogeny of a putative cross-Phylum L1 HT event involving ancient Tx-like L1 elements in aquatic species. MUSCLE [50] was used to align the sequences, Gblocks [51] was used to extract conserved blocks from the alignment, FastTree [52] was used to infer a phylogeny and FigTree was used for tree rendering. b Nucleotide ORFs. Same as (a), but this family was found from the all-against-all clustering of ORF sequences rather than full-length L1s. c Amino acid RT domains. Same as (a, b), but these families were found from the all-against-all clustering of amino acid RT domains. d Species tree inferred from gene orthologs (specifically, P-type ATPase). Sequences were obtained from OrthoDB [60], aligned with MUSCLE and the maximum likelihood phylogeny was inferred using FastTree. Note that there was no information available for Priapulus or Mnemiopsis Full size image

Fig. 4 L1: expected species trees. TimeTree [25] using published data from thousands of studies to infer the relationships of the species involved in all six putative L1 cross-Phylum HT events. Background is coloured to match the ages in the geological timescale. Note that there was no information available for Lottia gigantea Full size image

All of the cross-Phylum clusters involve marine eukaryotes, with potential vector species such as the Pacific oyster (Crassostrea gigas), the catus worm (Priapulus catus) and a sea worm usually found in coastal mud or sand (Saccoglossus kowalevskii) (Fig. 4). Notably, all of the cross-Phylum clusters contain the diverse Tx-like L1s originally discovered in Xenopus frogs [26, 27]. Likewise with the cross-Class clusters, with the exception of one plant cluster based on RT domains (r_1111 in Additional file 1: Table S6). In contrast to BovBs, there is no strong evidence to suggest ongoing L1 HT in mammals. Relaxing our clustering criteria (e.g. to include cross-Order or cross-species HT candidates) resulted in sporadic groupings of different mammals – most likely clustered together because they all contained ‘dead’, inactive L1s.

Finally, our mining of L1 HT candidates led to the serendipitous discovery of a chimeric L1-BovB element present in cattle genomes (Bos taurus and Bos indicus), shown in Fig. 5a. This rearranged copy likely arose from a recently active L1 element (98% identical to the canonical Bos L1-BT [20]) inserting into an active BovB (97% identical to Bos BovB [20]). Ruminants are the only mammals that currently have active lineages of both BovB and L1 elements (Fig. 5b), creating the ideal genomic environment for the genesis of chimeric repetitive elements. The hybrid element contains two RT domains and high similarity to active L1/BovB elements, although there is little evidence to suggest transcription (Additional file 2: Figure S55). Nonetheless, it raises an important question: can L1 elements to be transferred throughout mammals by being transduced in other, more prolific TEs, such as BovBs?