The overlap between mimiviruses and parasitic microbes is significant

The HMMs detected a large number of domains (4,679) and a rich repertoire of FSFs (304 distinct FSFs) in the medium-to-very-large viral proteomes we sampled (Table 2). Six out of these FSFs were viral-specific and had no representation in the cellular proteomes (Table 3). These viral-specific FSFs are responsible for functions unique to viruses, such as attachment to the host cell receptors and DNA, inhibition of caspases to trigger anti-apoptosis, and acting as major capsid proteins (Table 3). The median proteomic coverage (i.e., proteins with FSF assignments / total number of proteins) in viral proteomes was 34%, while A. polyphaga mimivirus had the highest coverage (59%) (Table 2). A comparison with the cellular supergroups (Additional file 1: Table S1) revealed that the proteomic coverage of mimivirus was higher than the median coverage in eukaryotes (55%) but lower compared to the bacterial (66%) and archaeal (61%) proteomes. The range of assignments in cells varies from 22-80% in Eukarya, 44-88% in Bacteria and 52-71% in Archaea (Additional file 1: Table S1). We note that mimiviruses overlap with cellular species that are similar in genome size and lifestyles. For example, despite considerable proteomic coverage, mimiviral assignments were restricted to only 163 distinct FSFs, a rather poor repertoire when compared for example with FSFs present in the proteomes of free living (FL) organisms (ranging from 407 FSFs in Staphylothermus marinus to 1,084 in Capitella sp.). FSF number was however comparable to small organisms with reduced genomes or parasitic/symbiotic lifestyles [e.g., Guillardia theta (189 distinct FSFs), Nanoarchaeum equitans (211 FSFs) and Candidatus Hodgkinia cicadicola (115 FSFs)]. The average reuse level of mimiviral FSFs (total FSFs/distinct FSFs) was quite low as well (530/163 = 3.25) but still comparable to that of organisms with similar genome size or lifestyles (e.g., 3.03 in Staphylothermus marinus, 1.42 in Candidatus Hodgkinia cicadicola, 2.01 in Nanoarchaeam equitans, and 2.48 in Guillardia theta). In summary, a significant number of FSFs exist in the proteomes of dsDNA viruses, including the ones that are not encoded by cells. Mimiviruses have a genome size comparable to numerous small bacteria and also share with them a very simple proteome. Both features overlap significantly with the parasitic unicellular organisms.

Table 2 List of dsDNA viruses sampled along with family name, total number of proteins, number of total and unique FSFs detected by the SUPERFAMILY HMMs ( E -value cutoff of 0.0001) and proteomic coverage Full size table

Table 3 Viral-specific FSFs and their molecular functions 1 Full size table

The distribution of domain structures is biased but shows widespread representation of viral FSFs

FSFs are not equally distributed in the proteomes of Archaea (A), Bacteria (B) and Eukarya (E), and viruses (V). In turn, FSFs exist that are uniquely present (groups A, B, E or V) or are shared by two (AB, AE, AV, BE, BV, EV), three (ABE, ABV, AEV, BEV) or all (ABEV). A Venn diagram (Figure 1A) describes the FSF distribution and highlights the differential enrichment of viral FSFs within these taxonomic groups. All cellular taxonomic groups share FSFs with the viral supergroup. ABE is the most populated with 557 FSFs, BE is the second largest with 291 FSFs and ABEV makes the third largest group with 229 FSFs. Eukaryotes have the highest number of supergroup-specific FSFs with 335 (~19%) of the total FSFs present only in eukaryotes, followed by Bacteria, Archaea and viruses with 163 (9.37%), 22 (1.26%) and 6 (0.345%) supergroup-specific FSFs, respectively. This complete and unique distribution of FSFs in supergroups suggests that viruses with medium-to-very-large proteomes maintain considerable structural diversity despite their reduced genomes and parasitic lifestyle. The lower number of viral-specific FSFs can be explained by the fact that current shift in genomics is towards the sequencing of viral genomes with medical importance [47]. The discovery and sequencing of viruses with large genomes (e.g., mimiviruses, megaviruses, and mamaviruses) is expected to add to the number of viral-specific structures. However, we expect that the relative patterns of FSF sharing with the cellular supergroups will remain conserved.

Figure 1 History of protein domain structures. A. The Venn diagram shows distribution of FSFs in the taxonomic groups. Viral families included in the analysis: Adenoviridae, Ascoviridae, Asfarviridae, Corticoviridae, Iridoviridae, Marseilleviridae, Mimiviridae, Phycodnaviridae, Poxviridae, Rudiviridae, and Tectiviridae (see Table 2). B. Phylogenomic tree of protein domain structure describing the evolution of 1,739 FSFs in 1,037 proteomes (4,63,915 steps; consistency index CI = 0.051; retention index RI = 0.795; tree skewness g 1 = −0.127). Taxa are FSFs and characters are proteomes. Terminal leaves were not labeled, as they would not be legible. C. Distribution index (f, the number of species using an FSF/total number of species) of each FSF plotted against relative age (nd, number of nodes from the root/total number of nodes) for the four supergroups and individually for sampled viruses, Archaea, Bacteria, and Eukarya. D. Boxplots displaying distribution of FSFs in viral and cellular taxonomic groups with respect to age (nd). Vertical lines within each distribution represent group median values. Dotted vertical lines represent important evolutionary events in the evolution of proteomes. Full size image

Reductive evolutionary processes explain viral make up

We generated trees of FSF domains from linearly ordered multistate phylogenetic features (FSFs as taxa and proteomes as characters) using maximum parsimony (MP) (Figure 1B). Trees of FSFs are rooted and highly unbalanced. However, the imbalance in trees results from the accumulation of protein domains in proteomes (i.e., genomic abundance) and portrays a biological process and not a phylogenetic artifact. From the FSF tree, we calculated the age of each FSF defined as the node distance (nd) [28]. nd is given on a relative 0–1 scale, with nd = 0 representing the origin of protein domains and nd = 1 the present [28]. nd is a good proxy for the age of each FSF, is linearly proportional to geological time, follows a molecular clock and can be used to accurately date domains defined at FF and FSF levels [39]. When plotted against the fraction of proteomes encoding each FSF (i.e., distribution index; f), nd described unprecedented patterns in the evolution and origins of proteomes in the total dataset and individual supergroups (Figure 1C).

We note that the majority of the viral FSFs originated either very early or very late, showing a clear bimodal pattern of domain appearance (red circles in the tree of Figure 1B and timelines of Figure 1C). The distribution of FSFs in the total dataset revealed that the most ancient FSF, the P-loop-containing NTP hydrolase (c.37.1) was present in all the proteomes (f = 1), including the viral proteomes (Figure 1C: Total). In total, 28 ancient FSFs had an f > 0.947 and were present in almost all cellular and most viral proteomes. However, the representation of FSFs decreased in the timeline with increasing nd until f approached 0 at about nd = 0.587 (Figure 1C:Total). The steady drop in the f value for ancestral FSFs (nd < 0.587) defines the reductive model of evolution for viral and microbial superkingdoms. We hypothesize that very early in the evolutionary timeline (nd < 0.587), f values smaller than 1 indicated loss of an existing FSF in few proteomes. In general, the probability for a few proteomes to loose an FSF was higher than the probability for the rest of the proteomes to discover the same FSF simultaneously (very much alike the probabilistic model for insertion and deletions in sequence alignment) [28]. This differential loss of structures probably triggered the early diversification of lineages emerging from an ancestral community (read below) [28]. The f value approached a minimum at nd = 0.587. Beyond this point, an opposite trend took place and the representation of FSFs in proteomes increased with increasing nd. We explain the increase in f value for younger FSFs (nd > 0.587) by evolutionary forces initiating the emergence of diversified supergroups (i.e., A, B, E, and V). These forces were primarily responsible for genome expansion in proteomes (especially in Eukarya) by evolutionary processes including gene shuffling, domain rearrangements, and HGT [28, 46].

Distribution plots for the individual supergroups confirmed that, in general, the most ancient FSFs (nd = 0–0.4) were shared by most proteomes (Figures 1C). However, the representation of ancient FSFs decreased in time, first in the viral proteomes (Figure 1C: Viruses), and then in the cellular proteomes, starting with Archaea (Figure 1C: Archaea), then Bacteria (Figure 1C: Bacteria) and finally Eukarya (Figure 1C: Eukarya). The decrease in the representation of ancient FSFs (as explained above) is consistent with the reductive tendencies described previously for the cellular proteomes [28, 38]. We propose that both sampled dsDNA viruses and Archaea experienced high levels of genome reduction through loss of ancient FSFs. While in general, they maintained small proteomic representations of younger FSFs, FSF representation increased considerably in the eukaryal proteomes as f reached 1 again at nd =1 (primarily triggered by domain rearrangements) (Figure 1C: Eukarya) [46].

Appearance of supergroups

The distribution of FSFs with respect to age (nd) (Figure 1D) revealed that ABEV was the most ancient taxonomic group with nd ranging between 0–1 and a median nd of 0.2324. This confirmed that the majority of the FSFs shared between giant viruses and cells were ancient, providing further support to the hypothesis of early coexistence of giant viruses with cellular ancestors. The appearance of the ABEV taxonomic group was followed by the ABE, BEV, and BE taxonomic groups, in that order. The late appearance of supergroup-specific taxonomic groups suggests giant viruses and Archaea diversified much later and concurrently with Eukarya (nd = 0.5867) (Figure 1D). Interestingly, the appearance of BV, EV and AV FSFs occurs soon after the appearance of the respective supergroup-specific FSFs or after the diversification of the respective supergroups (i.e., B, E and A). We hypothesize that these FSFs were discovered when dsDNA viruses began to infect their hosts and adopted a parasitic lifestyle. This occurred when lineages of diversified cellular organisms were already in existence. Therefore, parasitic adaptation in the viral proteomes appears to be an afterthought most likely triggered by massive amounts of genome reduction experienced very early in evolution (nd < 0.587).

Early reductive evolution of the translational machinery of giant viruses

The loss of FSFs was abrupt and massive for viruses. It started very early in evolution but substantially dropped in the nd = 0.4-0.6 range (Figure 1C: Viruses). Its first effects were on the repertoire of aminoacyl tRNA synthetase (aaRS) enzymes that are responsible for the algorithmic implementation of the genetic code [64–66]. The class II aaRS and biotin synthetase FSF (d.104.1) was the first domain structure to be completely lost in the viral proteomes we sampled (f = 0; nd = 0.0516) (Figure 1C: Viruses; boxplot for ABE in Figure 1D). This FSF includes the catalytic domain of class II aaRS enzymes that along with class I aaRS enzymes charge tRNAs with correct amino acids and make central components of the protein translation machinery [65, 66]. It has been reported that the mimivirus genome encodes four class I aaRS enzymes (TyrRS, MetRS, ArgRS, and CysRS) but no class II enzymes [5]. Our census of FSFs confirms these findings. This partial enzymatic set of aaRSs was proposed to be a likely remnant of a primordial translational apparatus that was once present in the genome of its ancestor (virus or more likely cell) [3]. The recent discovery of megavirus (a distant phylogenetic relative of mimivirus that is not included in this study) led to the identification of seven aaRS in the megavirus genome including both the class I (TyrRS, MetRS, ArgRS, CysRS, TrpRS, and IleRS) and class II (AsnRS) enzymes [3]. Megavirus is therefore the only virus known to possess both class I and class II aaRSs. We studied the genomic distribution of 28 FF domains that make structural components of aaRS enzymes in the 1,037 proteomes of the total proteome dataset of cells and viruses (Figure 2). These structures are catalytic, editing, trans-editing, anticodon binding and accessory domains of aaRS enzymes (Additional file 2: Table S2). As we illustrate with LeuRS, each of these domains contribute their own history to the evolutionary make up of individual aaRS enzymes (Figure 2A). The vast majority of the FF domains were only present in cells. The viruses we sampled encoded only four instances of the catalytic domain of class I aaRS (c.26.1.1) and four total instances of the anticodon-binding domains of both class I (a.27.1.1) and class II aaRS (c.51.1.1) per proteome, all in mimivirus (Figure 2B). The fact that the anticodon-binding domain typical of class II ProRS, ThrRS, GlyRS and HisRS is present even if the correponding class II catalytic domain is absent is remarkable and suggests reductive evolutionary processes are still actively at play. The average number of catalytic domains per proteome was substantially larger in cells (ranging 16–37) and increased in the order Archaea, Bacteria and Eukarya (see pie charts of Figure 2B), following corresponding increases in genome complexity. The exception is the highly reduced eukaryotic genome of Guillardia theta that contains only one instance of an aaRS catalytic domain (corresponding to a class II enzyme) and resembles mimivirus. More importantly, we note that viruses did not contain editing or trans-editing domains that always appear together and at rather constant ratios with anticodon binding domains in cellular organisms and are especially substantial in Bacteria (bar diagrams, Figure 2B). Similarly, viruses did not contain any of the many other C-terminal and N-terminal accessory domains. These domains are typical of aaRSs and enhance their functional repertoire, especially in Eukarya [64]. As expected, a ToD reconstructed from the genomic abundance counts of these FFs (with FFs as taxa) described the origin and evolution of aaRS domains and revealed interesting patterns (Figure 2C). Catalytic and editing domains appeared at the base of the tree very early in evolution, generally before anticodon-binding domains necessary to establish crucial aminoacylation specificities. For example, the ValRS/IleRS/LeuRS editing domain FF (b.51.1.1) appeared before the anticodon binding domains that are present in viruses. Remarkably, the progression of loss of domains of aaRSs in supergroups (Figure 2C) follows the progression observed for the entire proteomic dataset (Figure 1). The reductive evolutionary tendencies of the aaRS enzymes are therefore not atypical.

Figure 2 Evolution of the major domains of aminoacyl-tRNA synthetase (aaRS) enzymes. A. The leucyl-tRNA synthetase (LeuRS) enzyme in complex with tRNALeu (PDB entry 1WZ2) with its three domains (catalytic, editing and anticodon-binding) colored according to their age of origin. Domain ages were derived from a ToD at FF level of structural complexity [41]. Note how the variable arm of tRNA makes crucial contact with the anticodon-binding domain, which is evolutionarily derived, while the acceptor arm contacts the ancient catalytic domain in pre-editing conformation. B. Occurrence (box plots) and abundance (pie charts) of 28 fold family (FF) domains of aaRS enzymes with known structures in the total genomic dataset of 1,037 cellular organisms and viruses. The name and function of domains are described in Table S2 of the Additional file. C. Phylogenomic tree of protein domain structures describing the evolution of the 28 FFs of aaRSs in the 1,037 proteomes (982 parsimony informative characters; 26,638 steps; CI = 0.8479; RI = 0.8742; g 1 = −0.1.401). Taxa are aaRS FF domains and characters are proteomes. FFs are labeled with SCOP concise classification strings. Numbers on the branches indicate bootstrap values. FF domains present in viruses are highlighted in red. Note that d.104.1.1 has been identified in megavirus (not included in this study). Full size image

uToLs identify viruses as a distinct supergroup along with cellular superkingdoms

We reconstructed rooted uToLs built from the genomic abundance counts of individual FSFs (total character set: 1,739 FSFs) as previously described [28, 29]. For this reconstruction, we excluded cellular organisms with P and OP lifestyles (in order to remove noise from the data) and sampled 50 proteomes equally from each supergroup [29]. The reconstruction produced trees in which organisms in Archaea, Bacteria, Eukarya, and viruses formed four distinct groups, placed viruses as the most ancient group and Archaea as the second oldest. Figure 3A gives a radial representation of an example phylogeny of randomly sampled taxa. The viral supergroup is discriminated from other superkingdoms by 72% bootstrap support. Both the viral and archaeal supergroups were always paraphyletic while Bacteria and Eukarya appeared monophyletic (Figure 3A). Reconstructions supported the early divergence of viruses and Archaea relative to Bacteria and Eukarya [28–30]. Because the FSF repertoire of sampled viruses (a total of 304 FSFs) is considerably smaller than the FSF repertoires in cellular supergroups (885 FSFs in Archaea; 1,312 FSFs in Bacteria; and 1,508 FSFs in Eukarya) (Figure 1A), we also reconstructed the uToL from the ABEV taxonomic group (Additional file 3: Figure S1). The ABEV taxonomic group includes 229 FSFs that are encoded by both sampled viruses and cells and is the most ancient group with a median nd of 0.2324 (Figure 1D). This exercise reduced the effect of the number of supergroup-specific structures in Archaea, Bacteria and Eukarya that are significantly greater than the viral-specific structures (22, 163, 335 VS 6) and eliminated any bias resulting from the phylogenomic model (i.e., we consider primordial proteomes to encode very few structures and root trees by structural absence). The uToL and network tree diagram (also read below) reconstructed from the set of 229 universal FSFs resulted in a topology which overall favored the previous reconstructions (Additional file 3: Figure S1). Viruses were identified as a separate group along with superkingdoms Archaea, Bacteria and Eukarya with the cellular world stemming from viruses and Archaea (Additional file 3: Figure S1).

Figure 3 Universal tree of life (uToL) and proteomic diversity. A. One optimal most parsimonious phylogenomic tree describing the evolution of 200 proteomes (50 each from Archaea, Bacteria, and Eukarya and viruses; virus families are listed in Table 2) generated using the census of abundance of 1,739 FSFs (1,517 parsimoniously informative sites; 62,061 steps; CI = 0.156; RI = 0.804; g 1 = −0.325). Terminal leaves of viruses (V), Archaea (A), Eukarya (E) and Bacteria (B) were labeled in red, blue, black and green respectively Numbers on the branches indicate bootstrap values. B. FSF diversity (number of distinct FSFs in a proteome) plotted against FSF abundance (total number of FSFs that are encoded) for 200 proteomes. Major families/phyla/kingdoms are labeled. Both axes are in logarithmic scale. Full size image

A plot that describes the interplay between diversity (use) and abundance (reuse) of FSFs (total number of distinct FSFs versus the total number of FSF domains that are encoded in a proteome) shows viruses have the simplest proteomes, followed progressively by Archaea, Bacteria and Eukarya, in that order (Figure 3B). Organisms follow a congruent trend towards structural diversity and organismal complexity. This trend confirms our initial evolutionary model of proteome growth that we use for the rooting of the uToL and again supports the ancestrality of viruses and Archaea [28].

Phylogenomic networks give unity to sampled viruses

When the evolutionary model involves processes like gene gain/loss, duplications, and HGT, it is appropriate to provide an abstract representation of the phylogeny using networks [57]. A phylogenomic network is expected to reflect the evolutionary tree when there is no conflict between data and the tree and aids in phylogenetic analysis. A network tree reconstructed by the agglomerative NeighborNet algorithm [57] identified viruses as a distinct supergroup along with cellular superkingdoms (Figure 4). Each edge on the network represented a split of taxa. The splits discriminating viruses and eukaryotes from the rest of the supergroups were supported by 100% bootstrap support. The network tree is congruent with the phylogeny recovered in Figure 3A and defies theories attributing large proteomes of dsDNA viruses to massive amounts of HGT from cells. In contrast, the resulting network gave unity to the sampled viruses and suggests vertical acquisition of their gene repertoires (no mixing of viruses with cells was observed in the network) (Figure 4). However, we realize that HGT from cells to viruses harboring smaller proteomes or RNA genomes might be occurring at different (or higher) levels.

Figure 4 Network tree visualization of the supergroups. Network tree generated from the presence/absence matrix of 1,739 FSFs in 200 proteomes sampled equally from the four supergroups. The number of non-constant sites was 1,581. Nodes in the network tree are proteomes and are represented by rectangles labelled red, blue, green, and black for viruses, Archaea, Bacteria and Eukarya, respectively. Numbers on the major splits indicate bootstrap values. Full size image

Viruses enhance planetary biodiversity

The spread (f) of viral FSFs relative to cellular FSFs in the individual proteomes of Archaea, Bacteria and Eukarya appeared considerably biased (Figure 5). When compared to the cell-specific FSFs, FSFs shared by viruses and cells were significantly widespread in the proteomes of a superkingdom. Viruses hold 294, 265, and 239 FSFs in common with Eukarya, Bacteria and Archaea, respectively. Median f values of these FSFs were considerably higher than those of corresponding cellular FSFs in Eukarya (0.978 vs. 0.416), Bacteria (0.8826 vs. 0.329), and Archaea (0.742 vs. 0.514) (Figure 5). This bias is remarkable in the case of Eukarya where nearly all (98%) the proteomes were enriched with viral FSFs (Figure 5). Archaeal and bacterial proteomes were also enriched with viral FSFs but at lower levels. Remarkably, patterns of enrichment follow patterns of reductive evolution in the superkingdoms (i.e., Archaea < Bacteria < Eukarya). The popularity and abundance of viral FSFs in cellular proteomes suggests that viruses have been a very active and crucial factor in mediating domain transfer between cellular species and enhancing biodiversity. These domains are present in a remarkably diverse array of cellular hosts, ranging from small microbes to complex vertebrates, providing further support to the ancient and primordial nature of viruses [16] and highlighting their crucial contribution to the biosphere [47].

Figure 5 Enrichment of viral FSFs. Boxplots comparing the distribution index (f) of FSFs shared or not shared with viruses for each cellular superkingdom. Pie charts above each superkingdom represent distribution of FSFs in taxonomical groups within each superkingdom. Full size image

Functional makeup of viral proteomes

We studied the molecular functions of 293 (out of 304) FSFs in viral proteomes using the functional annotation scheme described by Vogel and Chothia [59–61]. For the rest of the 11 FSFs, functional annotations were not available. When plotted against time (nd), we note that a majority (n 1 = 164) of the viral FSFs either appeared very early (nd < 0.4) or very late (n 3 = 118) (0.6 < nd < 1.0) (Figure 6), supporting timelines of Figure 1C. For the ancient FSFs (nd < 0.4), we note that most of the viral FSFs perform metabolic functions, followed by informational FSFs, Intracellular processes, Regulation, General, Other and Extracellular processes, in that order. This order matches the functional distribution described previously for the cellular superkingdoms [27]. A significant drop in the number of FSFs/functions is seen in the nd range 0.4-0.6 (n 2 = 11) which is the period marked by massive gene loss in both viruses and cellular organisms. In contrast, a relatively even distribution of functions is seen in the nd range 0.6-1.0 which is the period marked by superkingdom diversification and genome expansion in Eukarya [28]. The specific functions acquired by viruses during this late period include those related to Extracellular processes (toxin/defense, immune response, cell adhesion), General (protein interaction, general, ion binding, small molecule binding), and Other (viral proteins and proteins with unknown functions) (Figure 7). We hypothesize that viruses acquired these functions in order to adapt to the parasitic lifestyle after suffering massive gene loss between nd 0.4-0.6. This is also evident by the appearance of superkingdom specific taxonomic groups (AV, BV, and EV) after the appearance of respective superkingdoms in Figure 1D. In contrast, the number of FSFs corresponding to Metabolism, Information, Intracellular processes, and Regulation is lower compared to nd < 0.4. The significant differences in the distribution of molecular functions for very early (nd < 0.6) and late periods (nd > 0.6) of the evolutionary timeline suggests that viruses started very much like cells (possibly as integral components of cells), experienced massive amounts of genome reduction and finally acquired specific structures and functions needed for a parasitic lifestyle in an expanding cellular world.

Figure 6 Functional distribution of viral FSFs in major functional categories. Histogram comparing the number of viral FSFs corresponding to major functional categories plotted against nd. The distribution of functions that appeared early and late is significantly biased. Numbers on top of individual bars indicate total number of FSFs corresponding to each functional category. Full size image

Figure 7 Functional distribution of viral FSFs in minor functional categories. Histograms comparing the number of viral FSFs corresponding to each of the minor categories within each major functional category. Full size image

Effect of HGT

Domains defined at the FSF level are evolutionarily more conserved than domain sequences [23, 29] and the evolutionary impact of HGT is limited at such levels of structural organization [31, 32, 34]. However, the HGT-derived domain structures are expected to be overrepresented in proteomes [41, 67] and this significant enrichment of FSFs in viruses (in the ABEV, BEV, AEV, ABV, AV, BV, and EV taxonomic groups) is taken as an indication that viruses have acquired FSFs from their cellular hosts via HGT. We calculated the probability of enrichment of a particular taxonomic group using the hypergeometric distribution and found that only the ABEV FSF group was significantly overrepresented (P < 0.05). All the other taxonomic groups were significantly underrepresented (P < 0.05) with the exception of the AEV FSF group, which was overrepresented at statistically non-significant levels (P = 0.29, Table 4:Sampled viruses). Because HGT is thought to have played an important role in the evolution of prokaryotes, especially bacteria [68], we also evaluated the enrichment of FSFs in bacterial taxonomic groups (ABEV, BEV, ABV, ABE, BE, AB, and BV) as control to the enrichment test on viruses. We found that all the bacterial taxonomic groups were significantly overrepresented (P < 0.05) (as expected) except for ABEV and BV groups (Table 4:Bacteria), supporting high levels of HGT in Bacteria [68]. The significant underrepresentation of viral taxonomic groups indicates that FSFs encoded by giant viruses were not transferred laterally from their cellular hosts, though they can still contribute innovations to the structural make up of cells [28].

Table 4 Statistical test for the enrichment of FSFs in taxonomic groups using hypergeometric distribution for sampled viruses and bacteria Full size table