Domains are modules within proteins that can fold and function independently and are evolutionarily conserved. Here we compared the usage and distribution of protein domain families in the free-living proteomes of Archaea, Bacteria and Eukarya and reconstructed species phylogenies while tracing the history of domain emergence and loss in proteomes. We show that both gains and losses of domains occurred frequently during proteome evolution. The rate of domain discovery increased approximately linearly in evolutionary time. Remarkably, gains generally outnumbered losses and the gain-to-loss ratios were much higher in akaryotes compared to eukaryotes. Functional annotations of domain families revealed that both Archaea and Bacteria gained and lost metabolic capabilities during the course of evolution while Eukarya acquired a number of diverse molecular functions including those involved in extracellular processes, immunological mechanisms, and cell regulation. Results also highlighted significant contemporary sharing of informational enzymes between Archaea and Eukarya and metabolic enzymes between Bacteria and Eukarya. Finally, the analysis provided useful insights into the evolution of species. The archaeal superkingdom appeared first in evolution by gradual loss of ancestral domains, bacterial lineages were the first to gain superkingdom-specific domains, and eukaryotes (likely) originated when an expanding proto-eukaryotic stem lineage gained organelles through endosymbiosis of already diversified bacterial lineages. The evolutionary dynamics of domain families in proteomes and the increasing number of domain gains is predicted to redefine the persistence strategies of organisms in superkingdoms, influence the make up of molecular functions, and enhance organismal complexity by the generation of new domain architectures. This dynamics highlights ongoing secondary evolutionary adaptations in akaryotic microbes, especially Archaea.

Proteins are made up of well-packed structural units referred to as domains. Domain structure in proteins is responsible for protein function and is evolutionarily conserved. Here we report global patterns of protein domain gain and loss in the three superkingdoms of life. We reconstructed phylogenetic trees using domain fold families as phylogenetic characters and retraced the history of character changes along the many branches of the tree of life. Results revealed that both domain gains and losses were frequent events in the evolution of cells. However, domain gains generally overshadowed the number of losses. This trend was consistent in the three superkingdoms. However, the rate of domain discovery was highest in akaryotic microbes. Domain gains occurred throughout the evolutionary timeline albeit at a non-uniform rate. Our study sheds light into the evolutionary history of living organisms and highlights important ongoing mechanisms that are responsible for secondary evolutionary adaptations in the three superkingdoms of life.

Funding: This research was supported by grants from the National Science Foundation (MCB-0749836 and OISE-1132791) and the United States Department of Agriculture (ILLU-802-909 and ILLU-483-625) to GCA and grants from KRIBB Research Initiative Program and from the Next-Generation BioGreen 21 Program, Rural Development Administration (PJ0090192013) to KMK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The analysis of retracing the history of changes in the occurrence and abundance of FF domains on each branch of the reconstructed ToLs revealed that FFs were subject to high rates of gains and losses. Domain gains generally outnumbered losses but both occurred with high frequencies throughout the evolutionary timeline and in all superkingdoms. Remarkably, the gains-to-loss ratios increased with evolutionary time and were relatively higher in the late evolutionary periods. Finally, functional annotations of FFs illustrated significant differences between superkingdoms and described modern tendencies in proteomes.

Here, we describe the evolutionary dynamics of protein domains grouped into fold families (FFs) and model the effects of domain gain and loss in the proteomes of 420 free-living organisms that have been fully sequenced and were carefully sampled from Archaea, Bacteria, and Eukarya ( Dataset S1 ). The 420-proteome dataset was previously used by our group to reconstruct the evolutionary history of free-living organisms (see [27] ) and was updated here to account for recent changes in protein classification and functional annotation. The dataset is very well annotated, especially regarding organism lifestyles that are otherwise problematic to assign, has already produced patterns of protein and proteome evolution that are very useful (including those described in [27] ), and has produced timelines of FF evolution that are being actively mined. We conducted phylogenomic analyses using the abundance (total redundant number of each FF in every proteome) [28] , [29] and occurrence (presence or absence) [30] , [31] counts of FFs as phylogenetic characters to distinguish the 420 sampled taxa (i.e. proteomes). FF information was retrieved from the Structural Classification of Proteins (SCOP) database, which is considered a ‘gold standard’ for the classification of protein domains into different hierarchical levels [32] . Current SCOP definitions group protein domains with high pair-wise sequence identity (>30%) into a common FF, FFs that are evolutionarily related into fold superfamilies (FSFs), FSFs with similar secondary structure arrangement into folds (Fs), and Fs with common secondary structure elements into a handful of protein classes [33] , [34] . A total of 110,800 SCOP domains (ver. 1.75) are classified into a finite set of only 1,195 Fs, 1,962 FSFs and 3,902 FFs. The lower number of distinct FSFs and FFs suggests that domain structure is far more conserved than molecular sequence (e.g. see [35] ) and is reliable for phylogenetic studies involving the systematic comparison of proteomes [27] . Another advantage of using SCOP domains is the consideration of known structural and inferred evolutionary relationships in classifying domains into FFs and FSFs [36] . In comparison, evolutionary relationships for the majority of the Pfam domains are unknown. We further restricted the analysis to include only FF domains as they are conserved enough to explore both the very deep and derived branches of the tree of life (ToL) and are functionally orthologous [37] . In contrast, FSF domains represent a higher level in SCOP hierarchy and are more conserved than FFs but may or may not be functionally orthologous. Moreover, high conservation of FSF domains is useful for exploring the deep branches of the ToL but may not be very informative for the more derived relationships.

In addition to the frequent reuse of domains, the dynamics between gains and losses also impacts the evolution of proteome repertoires [7] – [9] . Previous studies identified high rates of gene gains and losses in 12 closely related strains of Drosophila [7] , Prochlorococcus (a genus of cyanobacteria) [16] , and 60 isolates of Burkholderia (a genus of proteobacteria) [17] . A recent analysis of Pfam domains [18] revealed that ∼3% of the domain sequences were unique to primates and had emerged quite recently [19] . This implies that emergence of novel domains is an incessant evolutionary process [1] . In contrast, different selective pressures can lead to loss of domains in certain lineages and trigger major evolutionary transitions. For example, the increased rate of domain loss has been linked to reductive evolution of the proteomes of the archaeal superkingdom [20] , adaptation to parasitism in cells [21] (e.g. transition from the free-living lifestyle to obligate parasitism in Rickettsia [22] ), and ‘de-evolution’ of animals [23] , [24] from their common ancestor. In these studies, gain and loss inferences were restricted to only particular groups of phyla or organisms. A global analysis involving proteomes from the three superkingdoms remained a challenge. Finally, changes to domain repertoires are also possible by HGT that is believed to occur with high frequency in microbial species, especially Bacteria [25] , [26] .

Different mechanisms have been described to explain the evolution of domain repertoires in cells [3] . These include the reuse of existing domains [2] , [6] , interplay between gains and losses [7] – [9] , de novo domain generation [1] , and horizontal gene transfer (HGT) [10] . Domains that appeared early in evolution are generally more abundant than recently emerged domains and can be reused in different combinations in proteins. This recruitment of ancient domains is an ongoing evolutionary process that leads to the generation of novel domain architectures (i.e. ordering of domains in proteins) by gene fusion, exon recombination and retrotransposition [2] – [4] , [11] . For example, aminoacyl-tRNA synthetases are enzymes that charge tRNAs with ‘correct’ amino acids during translation [12] , [13] . These crucial enzymes are multidomain proteins that encode a catalytic domain, an anticodon-binding domain, and in some cases, accessory domains involved in RNA binding and editing [13] . Evolutionary analysis suggests that these domains were recruited gradually over time [14] . In fact, recruitment of ancient domains to perform new functions is a recurrent phenomenon in metabolism [15] .

Proteins are biologically active molecules that perform a wide variety of functions in cells. They are involved in catalytic activities (e.g. enzymes), cell-to-cell signaling (hormones), immune response initiation against invading pathogens (antibodies), decoding genetic information (transcription and translation machinery), and many other vital cellular processes (receptors, transporters, transcription factors). Proteins carry out these functions with the help of well-packed structural units referred to as domains. Domains are modules within proteins that can fold and function independently and are evolutionarily conserved [1] – [4] . It is the domain make up of the cell that defines its molecular activities and leads to interesting evolutionary dynamics [5] .

We conducted a GO enrichment analysis [56] , [57] on FF domains to identify biological processes [58] , [59] that were significantly enriched. For this purpose, the list of FF domains was given as input to domain-centric Gene Ontology (dcGO; http://supfam.org/SUPERFAMILY/dcGO ) resource and the most specific and significant associations to GO terms corresponding to different biological processes were retrieved. The statistical significance was evaluated by P-value computed under the hypergeometric distribution [56] , while the false discovery rate (FDR) was set to default at <0.01 [60] .

We used the SUPERFAMILY functional annotation scheme (based on SCOP 1.73) to study the functional roles of FF domains in our dataset [53] – [55] . The SUPERFAMILY annotation assigns a single molecular function to FSF domains (and by extension to its descendant FFs). The annotation scheme gives a simplified view of the functional repertoire of proteomes using seven major functional categories including, i) metabolism, ii) information, iii) intracellular processes, iv) extracellular processes, v) general, vi) regulation and vii) other (includes domains with either unknown or viral functions). We assumed that FFs grouped into an FSF performed the same function that was assigned to their parent FSF. While this simplistic representation does not demonstrate the complete functional capabilities of a cell, it is sufficient to illustrate the major functional preferences in proteomes (refer to [21] for further description and use of the functional annotation scheme in large-scale proteomic studies).

To determine the relative age of FF domains in our dataset, we reconstructed trees of domains (ToDs) from the abundance and occurrence matrices used in the reconstruction of ToLs. The matrices were transposed, treating FFs as taxa and proteomes as characters. The reconstructed ToDs described the evolution of domains grouped into FFs and identified the most ancient and derived FFs (refer to [27] for an elaborate description and discussion on ToDs). To root the trees, we declared character state ‘N’ as the most ancestral state. This axiom of polarization considers that history of change for the most part obeys the ‘principle of spatiotemporal continuity’ (sensu Leibnitz) that supports the existence of Darwinian evolution. Specifically, it considers that abundance and diversity of individual FFs increases progressively in nature by gene duplication (and associated processes of subfunctionalization and neofunctionalization) and de novo gene creation, even in the presence of loss, lateral transfer or evolutionary constraints in individual lineages. Consequently, ancient domains have more time to accumulate and increase their abundance in proteomes. In comparison, domains originating recently are less popular and are specific to fewer lineages. We note that the N to 0 polarization is supported by the observation that FFs that appear at the base of the ToDs are structures that are widespread in metabolism and are considered to be of very ancient origin (e.g. [27] ). The age of each FF was drawn directly from the ToDs using a PERL script that calculates the distance of each node from the root. This node distance (nd) is given on a relative scale and portrays the origin of FFs from 0 (most ancient) to 1 (most recent). The geological ages of FFs were derived from a molecular clock of protein folds [51] , [52] that was used to calibrate important events in proteome evolution. We have previously shown that nd correlates with geological time, following a molecular clock that can be used as a reliable approximation to date the appearance of protein domains [51] , [52] .

The spread of each FF was given by its distribution index (f-value), defined by the total number of proteomes encoding a particular FF divided by the total number of proteomes. The f-value ranges from 0 (absence from all proteomes) to 1 (complete presence).

To determine congruence between abundance and occurrence trees, we used the nodal module implemented in the TOPD/FMTS package ver. 3.3 [50] . The module takes as input a set of trees in Newick format and calculates a root mean squared deviation (RMSD) value for each pairwise comparison. The RMSD value is 0 for identical trees and increases with incongruence. To evaluate the significance of calculated RMSD values, we implemented the ‘Guided randomization test’ with 100 replications to determine whether the calculated RMSD value was smaller than the chance expectation. The randomization test randomly changes the positions of taxa in trees, while maintaining original tree topology, and calculates an RMSD value for each random comparison [50] . The result is a random distribution of RMSD values with a mean and standard deviation. The calculated RMSD value was compared with the mean of the random distribution to determine whether the observed differences were better than what would be expected merely by chance.

We considered the genomic abundance [28] , [29] and occurrence [30] , [31] of 2,397 FFs as phylogenetic characters to reconstruct phylogenies describing the evolution of 420 free-living organisms (i.e. taxa) using maximum parsimony. The raw abundance values of each FF in every proteome (g ab ) were log-transformed and divided by the logarithm of maximum value in the matrix (g max ) to account for unequal proteome sizes and variances (see formula below) [29] , [43] . The transformed abundance values were then rescaled from 0 to 23 (scaling constant) in an alphanumeric format (0–9 and A-N) to allow compatibility with the phylogenetic reconstruction software. The transformed abundance matrix with 24 possible character states was imported into PAUP* 4.0b10 [44] for the reconstruction of abundance trees. For occurrence trees, we simply used 0 and 1 (indicating absence and presence) as the valid character state symbols. We polarized both abundance and occurrence trees using the ANCSTATES command in PAUP* and designated character state 0 as the ancestral state, since the most ancient proteome is closer to a simple progenote organism that harbors only a handful of domains [20] , [38] . The stem lineage of this organism gradually increased its domain repertoire, supporting the polarization from 0 to N and Weston's generality criterion, in which the taxic distribution of a set of character states is a subset of the distribution of another [45] , [46] . Phylogenetic trees are adequately interpreted when rooted. This provides direction to the flow of evolutionary information and is useful to study species adaptations. In this study, we choose to root trees using the Lundberg method [47] . This scheme first determines the most parsimonious unrooted tree, which is then attached to a hypothetical ancestor. The hypothetical ancestor may be attached to any of the branches in the tree. However, only the branch that gives the minimum increase in overall tree length is selected [48] . This branch, which exhibits the largest numbers of ancestral (plesiomorphic) character states was specified using the ANCSTATES command in PAUP*. Thus, Lundberg rooting automatically roots the trees by preserving the principle of maximum parsimony. This method is simple and free from artificial biases introduced by alternative rooting methods (e.g. the outgroup method). While selection of an appropriate outgroup to root the ToL is virtually impossible, Lundberg rooting provides a parsimonious estimate of the overall phylogeny and should be considered robust as long as the assumptions used to root the trees are not proven false. To evaluate support for the deep branches of ToLs, we ran bootstrap (BS) analysis with 1,000 replicates. Character state changes were recorded by specifying the ‘chglist’ option in PAUP*. Trees were visualized using Dendroscope ver. 3.0.14b [49] .

The 420-proteome dataset used in this study included proteomes from 48 Archaea, 239 Bacteria, and 133 Eukarya. The dataset did not include any parasitic organisms as they harbor reduced proteomes and bias the global phylogenomic analyses (e.g. [38] ). FFs were assigned to proteomes using SUPERFAMILY ver. 1.73 [39] hidden Markov models [40] , [41] at an E-value cutoff of 10 −4 [42] . A total of 2,397 significant FF domains were detected in the sampled proteomes. The definitions of eight FFs in the 420-proteome dataset were updated in SCOP ver. 1.75 and were therefore renamed in our dataset. FFs were referenced using SCOP concise classification strings (css) (e.g. ‘Ferredoxin reductase FAD-binding domain-like’ FF is b.43.4.2, where b represents the class [all-beta proteins], 43 the fold, 4 the FSF and 2 the FF).

Results

We first describe the patterns of FF use and reuse in superkingdoms and then build on this knowledge to infer the meanings of domain gain and loss in proteomes.

Global patterns of domain gains and losses To quantify the relative contributions of domain gains and losses impacting the evolution of superkingdoms, we retraced the history of character state changes (i.e. changes in the abundance or occurrence of FFs) on each branch of the reconstructed ToLs. For each FF domain, we counted the number of times it was gained and lost in different branches of the phylogenetic tree. Gains were recorded when the abundance/occurrence of a particular FF at a node was higher than the corresponding value at the immediate ancestral node. In contrast, losses were incremented when the abundance/occurrence of a particular FF at a node was lower. Because we allowed character changes in both forward and backward directions (Wagner parsimony), each FF character could be both gained and lost a number of times across the many branches of the ToL. This assumption is reasonable as different lineages of organisms utilize domain repertoires differently. Because abundance counts are expected to be higher in the eukaryotic species (especially in metazoa) due to increased gene duplication events and a persistence strategy that favors flexibility and robustness (Figure 1D) [64], we also considered gains and loss statistics from the occurrence trees. To evaluate the performance of both models, we first compared the number of FFs that were gained (i.e. net sum above zero) and lost (net sum below zero) in both reconstructions. Out of the total 2,397 (2,262 parsimony informative) FF domains in the abundance model, 1,955 (86%) were gained, while only 236 (10%) were lost (Dataset S2). In contrast, occurrence identified 60.1% FFs as gained (1,353/2,249) and 30.5% (686/2,249) as lost (Dataset S3). Nearly 96% (1300/1,353) of the occurrence gains were also gained in abundance while only 26% (178/686) losses were common to both models. This suggested that abundance included nearly all the occurrence gains and likely overestimated the number of gains (due to gene duplications and domain reuse). In contrast, occurrence led to more balanced distributions and likely overestimated losses (read below). To provide additional support to the gain/loss model, we pruned taxa from the original ToLs leaving only one superkingdom and recalculated character state changes on the pruned trees. This eliminated any biases resulting from the differences in the persistence strategies of the three superkingdoms and yielded four phylogenetic trees, Total (taxa = 420, total FF characters = 2,397), Archaea (48, 703), Bacteria (239, 1,510) and Eukarya (133, 1,696). For each of the four trees, we calculated the sum of gain and loss events for all parsimony informative FF characters and represented the values in boxplots (Figure 4A). In all distributions, medians were above 0 indicating that the sum of net gains and losses was a non-negative number for both abundance (Figure 4A:abundance) and occurrence (Figure 4A:occurrence) models. The exception was the eukaryal tree pruned from the occurrence model, for which the median was exactly zero. The result revealed that while both gains and losses occurred quite frequently, the former was more prevalent in proteome evolution. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Global patterns of gains and losses in superkingdoms. A) Sum of gains and losses for each FF domain is represented in boxplots for Total, Archaea, Bacteria, and Eukarya reconstructions using abundance and occurrence models. Numbers in parentheses indicate total number of parsimony informative characters in each analysis. A horizontal red line passes through zero on the x-axis. B) Histograms comparing the relative counts of gains and losses for each FF domain character, plotted on the nd scale. Bars in red and blue indicate gains and losses respectively. The global gain-to-loss ratios are listed along with the total number of gain and loss events and gain-to-loss ratios. n is the number of parsimony informative characters in each analysis. C) Histograms comparing the distribution of FF gains and losses in Archaea, Bacteria and Eukarya. Bars in red and blue indicate gains and losses respectively. The x-axes indicates evolutionary time. Numbers in parenthesis indicate total number of proteomes in each dataset. https://doi.org/10.1371/journal.pcbi.1003452.g004 The histograms in Figure 4B describe the distributions of gain and loss counts for all parsimony informative FF characters in the Total dataset. When plotted against evolutionary time (nd), results highlighted remarkable patterns in the evolution of domain repertoires. Domain gains outnumbered losses in both abundance (80,904 gains vs. 47,848 losses) and occurrence (17,319 vs. 13,280) tree reconstructions (Figure 4B). The gain-to-loss ratios were 1.69 and 1.30, respectively, indicating an increase of 69% and 30% in gains relative to losses. Relative differences in the numbers of gains (red) versus losses (blue) suggested that gains increased with the progression of evolutionary time in both reconstructions (read below). We note that different evolutionary processes may be responsible for shaping the proteomes in individual superkingdoms. For example, the origin of Archaea has been linked to genome reduction events [20], [84], while HGT is believed to have played an important role in the evolution of bacterial species [25]. In contrast, eukaryal proteomes harbor an increased number of novel domain architectures that are a result of gene duplication and rearrangement events [6], [43]. Therefore, to eliminate any biases resulting from the effects of superkingdoms in the global analysis (Figure 4B), we recalculated the history of character changes on the pruned superkingdom tress recovered earlier (Figure 4C). For abundance reconstructions, the exercise supported earlier results where the number of gains was significantly higher than the corresponding number of losses for Archaea (4,616 vs. 2,009), Bacteria (36,606 vs. 20,196), and Eukarya (40,515 vs. 25,036) (Figure 4C: abundance). The overall gain to loss ratios decreased from 2.30 in Archaea to 1.81 in Bacteria and 1.62 in Eukarya (Figure 4C: abundance). The increased gain-to-loss ratios in akaryotic microbial species are remarkable; it implies that the rate of gene discovery in akaryotic microbes (by de novo creation, gene duplication, acquisition by HGT and/or recruitment) is higher than the rate in eukaryotes. This tendency in microbial species could be a novel ‘collective’ persistence strategy to compensate for their economical proteomes. For histograms representing occurrence models, global gain-to-loss ratios decreased in the order, Archaea>Bacteria>Eukarya (Figure 4C: occurrence). Remarkably, the ratio in Eukarya dropped below 1 indicating prevalence of domain loss events relative to gains. This result supports recent studies that have proposed the evolution of newly emerging eukaryal phyla via genome reduction [85].

Accumulation of gains and losses in evolutionary time When partitioned into the early, intermediate, and late evolutionary epochs, the gain-to-loss ratios exhibited an approximately linear trend towards increasing gains (Figure 5). For abundance, the ratios increased from 1.32 in the early epoch to 1.45 in the intermediate and 1.96 in the late evolutionary epochs. Similar trends were also observed for occurrence, with calculated ratios of 0.61, 0.97, and 1.68, respectively (Figure 5A). In fact, both gains and losses increased linearly with evolutionary time in all reconstructions. However, accumulation of gains overshadowed the number of losses (Figure 5). Remarkably, the occurrence model suggested predominant losses in the first two phases of evolution (0.61 and 0.97) that were compensated by significantly higher amounts of gains (1.68) in the late epoch. In contrast, abundance failed to illustrate this effect and indicated overwhelming gains in all evolutionary epochs. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Cumulative numbers of gains and losses. Scatter plots reveal an approximately linear trend in the accumulation of FF gains and losses in both the global analysis (A) and in individual superkingdoms (B). Gains are identified in red while losses in blue. The three evolutionary epochs are marked with corresponding gain-to-loss ratios in italics. https://doi.org/10.1371/journal.pcbi.1003452.g005 When looking at the individual epochs for pruned trees (Figure 5B), we noticed that the rate of domain gain increased with time (as before) (Figure 5A). However, the ratios in the initial two evolutionary epochs were considerably higher in Archaea for both the abundance and occurrence models. For example, Archaea exhibited gain-to-loss ratios of 2.06 and 2.14, in comparison to 1.26 and 1.39 in Bacteria, and 1.55 and 1.67 in Eukarya for early and intermediate evolutionary epochs (Figure 5B:abundance). In contrast, Bacteria exhibited an overwhelming gain-to-loss ratio of 2.88 in comparison to 2.67 in Archaea and 1.61 in Eukarya, in the late evolutionary epoch. Overall, the gain-to-loss ratios increased with evolutionary time in all superkingdoms with the sole exception of Eukarya that had a lower ratio in the late (1.61) compared to the intermediate (1.67) epoch (Figure 5B:abundance). Results based on occurrence indicated similar trends but with relatively more balanced gain-to-loss ratios and still highlighted the abundance of domain gains in evolution. The individual ratios were 1.42, 1.66, and 2.44 in Archaea, 0.60, 0.91, and 2.61 in Bacteria, and 0.51, 0.95, and 0.95 in Eukarya (Figure 5B:occurrence). Both Bacteria and Eukarya showed increased levels of ancient domain loss. However, Bacteria compensated this decrease by engaging in massive gain events during the late evolutionary epoch (ratio of 2.61). In contrast, Eukarya exhibited an even exchange between FF gain and loss events (ratio = 0.95) in both the intermediate and late epochs. Occurrence results also supported the evolution of Eukarya by gene loss, which is in line with recently published analyses [23], [85]. Abundance also indicated this drop in gene discovery rate for recent domains in Eukarya. However, the drop appears to be compensated by increased duplications of other domains that lead to an increase in the overall number of domains that are gained (Figure 5B: abundance). This apparent discrepancy can be explained by the power of both models in depicting true evolutionary relationships between organisms. Abundance accounts for a number of evolutionary processes such as HGT, gene duplication, and gene rearrangements while occurrence merely describes presence and absence of FFs and because of its more ‘global’ nature fails to illustrate a complete evolutionary picture (Discussion).

Effect of unequal sampling of proteomes To test whether unequal sampling of proteomes per superkingdom was contributing any bias to the calculations of domain gains and losses, we extracted 100 random samples of 34 proteomes each from the three superkingdoms and generated 100 random trees. From each of the random trees, we recalculated the gain-to-loss ratios using both abundance and occurrence models (Figure 6). Random and equal sampling supported the overall conclusion that gains were overwhelming during the evolution of domain repertoires (Figure 6). The median ratios for random trees were 2.47 in Archaea, 2.35 in Eukarya, and 2.34 in Bacteria for abundance reconstructions (Figure 6A). In comparison, the ratios decreased from 2.11 in Archaea to 1.93 in Bacteria and 1.11 in Eukarya for occurrence reconstructions (Figure 6B). Based on the results of random and equal sampling, we safely conclude that the gain of domains in proteomes is a universal process that occurs in all three superkingdoms of life. Moreover, the gain-to-loss ratios increase with time (Figure 5) and their effects are directly responsible for evolutionary adaptations in superkingdoms (Discussion). We also propose that using abundance increases the reliability of the phylogenomic model and accounts for many important evolutionary events, a feat that is not possible when studying occurrence. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 6. Equal sampling of proteomes. Boxplots comparing the distribution of net gains and losses in 100 random phylogenetic trees for both abundance (A) and occurrence (B). Numbers in parentheses indicate group median values. https://doi.org/10.1371/journal.pcbi.1003452.g006