To be able to draw reliable conclusions about universally valid mechanisms in protein evolution, it is necessary to ensure that a sufficient number of observable rearrangements can be explained by the six different rearrangement events defined in this manuscript (fusion, fission, terminal loss/emergence and single domain loss/emergence; see Methods). For this purpose we reconstructed the ancestral domain content and arrangements at all inner nodes of the phylogenetic trees of five eukaryotic clades (vertebrates, insects, fungi, monocots and eudicots). For all domain arrangements that differ from the parental node, we examined whether the change could be explained uniquely by one of the six events.

Unique solutions are either exact solutions, where only a single event can explain the arrangement change, or non-ambiguous solutions, where multiple events of the same type can explain a new arrangement (e.g. ABC: A+BC / AB+C). Only unique solutions were further analysed in detail to focus on changes which can be explained with certainty (Additional file 2). Unique solutions can explain 50% to 70% of all observed new arrangements, depending on the analysed phylogenetic clade (Fig. 1).

Fig. 1 Frequency of the different solution types. Exact and non-ambiguous solutions can be found in about 50% of the cases Full size image

However, there is a small percentage of new arrangements which can be explained by multiple different event types, i.e. ambiguous solutions (e.g. ABC: ABC-D / AB+C). Beside these ambiguous solutions, some new arrangements cannot be explained by the defined single step events. These so-called complex solutions (25%-50%), would require several successive single step events.

Comparison between clades

One major goal of this study is to find, beside clade-specific differences, universally valid evolutionary mechanisms of protein innovation that are present in all clades. Therefore, we analyse whether common patterns in domain rearrangements can be observed by measuring the relative contributions of each rearrangement event and compare them between the different clades (see Table 1 and Additional file 4).

Table 1 Frequencies of the six rearrangement events (in %) Full size table

The percentage of fusion events in our study ranges from 29% in fungi to 64% of all observed events in monocots. Only in fungi, fusions represent not the most frequent event type, but single domain loss is most frequent. Furthermore, in all clades except fungi, fissions and terminal losses account for a similar percentage of all domain rearrangements. In fungi, loss of terminal domains accounts for twice as many rearrangements as fissions. The exceptional distribution of event frequencies in fungi compared to the other clades is discussed below.

The very low contributions of the two emergence categories, terminal and single domain emergence, of only 0.13% to 3.89% show that domain emergence is indeed rare compared to a much higher number of domain rearrangements and losses.

We observed three general patterns of the ranks of rearrangement events corresponding to the taxonomic kingdoms of animals, fungi, and plants. In the first pattern, observed in animals (i.e. vertebrates and insects), the most frequent domain rearrangement event is domain fusion (32% and 42% of rearrangements respectively), followed by single domain loss (27% and 20%) and terminal domain loss (21% and 19%). Arrangement gain by fission is slightly less common (20% and 17%), but still more frequent than the very low rates of single domain emergence (0.6% and 1.7%) and terminal emergence (0.1% and 0.4%).

The functional analysis of gained arrangements in insects (Additional file 5) using GO term enrichment reveals olfaction related adaptations (represented by GO terms of ’sensory perception of smell’, ’olfactory receptor activity’ and ’odorant binding’) are overrepresented in insects. Other overrepresented GO terms include ’sensory perception of taste’ and ’structural constituent of cuticle’.

We did not find expansions of vertebrate specific GO terms at the root of vertebrates. However, we found overrepresented GO terms related to binding (e.g. ’protein binding’, ’nucleic acid binding’) and terms related to signal transduction (Additional file 6).

The distribution and rank of rearrangement rates in Fungi (Additional file 7) resemble those of animals, with the only qualitative difference being that single domain losses were more frequent than fusions. A more detailed analysis of this phenomenon can be found below.

The third pattern of arrangement changes is observed in plants, i.e. monocots and eudicots. As in metazoans, but with an even higher percentage, the majority of new arrangements is explained by fusion (64% and 58%). The fission of one arrangement into two new arrangements is the second most frequent mechanism (12% and 16%) followed by slightly smaller numbers of terminal (11% and 13%) and single domain loss (10% and 10%).

Some GO terms are enriched in gained arrangements at the root of both plant clades that might be related to plant development and evolution, i.e. ’recognition of pollen’ in both plant clades or ’plant-type cell wall organization’ in eudicots (Fig. 2 and Additional file 8).

Fig. 2 Number of rearrangement events across the eudicot phylogeny. Digit representation of the total number of rearrangement events at a specific node is indicated next to the pie chart. For details on ’Outgroups’ see Methods. Significant GO terms in gained domain arrangements are shown in a tag cloud (box). GO terms that might point to eudicot specific evolution are: ’recognition of pollen’ and ’plant-type cell wall organization’ Full size image

Domain loss in fungi

We analysed the distribution of domain arrangement sizes in the five clades (see Additional file 9) to find possible explanations for the different patterns of event frequencies mentioned above. The results show that a strikingly high number of fungal domain arrangements consists of just a single domain and their arrangements are generally much shorter compared to vertebrates or insects. Both plant clades, monocots and eudicots, also have much shorter domain arrangements than the metazoan clades.

We found that both plant clades show the highest copy number of domain arrangements. Eudicots have 5.79 copies on average per single domain arrangement per species, while monocots have 5.64. This high number of duplications of the same domain arrangement could be explained by multiple whole genome duplications in these clades. Vertebrates follow with 1.93 copies per single domain arrangement and finally insects (1.27), while fungi show the lowest duplication count (1.15).

Effects of domain rearrangements

The general rates of rearrangement events and their distribution in a given phylogenetic tree can provide an insight into the evolutionary history of a whole clade as well as general adaptational processes in certain lineages. However, by taking a more detailed look at the specific domains involved in the rearrangement events at specific time points, we can trace back some major steps in the evolutionary history of the studied species. Here, we show three examples of new or outstanding functions at specific nodes in the evolution of vertebrates, plants and insects which can be related to the emergence of new domains or domain arrangements.

The origin of hair and adaptations of the immune system in mammals

One remarkable pattern in the distribution of rearrangement events in the vertebrate phylogeny is the high rate (33%) of single domain emergences at the root of all mammals. This represents the highest percentage of single domain emergences at any node in the vertebrate tree. A closer investigation of the function of these emerged domains shows that ∼30% of the emerged domains (domains of unknown function excluded) are associated with hair. This finding is a strong signal for the origin of hair or fur, respectively, in the common ancestor of all mammals.

One of the most important structural protein families of mammalian hair is the keratin-associated protein family (KRTAPs). Hair keratins are embedded in an inter-filamentous matrix consisting of KRTAPs located in the hair cortex. Two major types of KRTAPs can be distinguished: high-sulfur/ultra-high-sulfur and high-glycine/tyrosine KRTAPs [22]. Three of these high-sulfur proteins can be found in the set of emerged domains as ’Keratin, high sulfur B2 protein’ (Pfam-ID: PF01500), ’Keratin-associated matrix’ (PF11759) and ’Keratin, high-sulphur matrix protein’ (PF04579). The proteins are synthesised during the hair matrix cell differentiation and form hair fibres in association with hair keratin intermediate filaments. Another domain that can be found in this set is the ’PMG protein’ (PF05287) domain, which occurs in two genes in mice (PMG1 and PMG2) that are known to be expressed in growing hair follicles and are members of a KRTAP gene family [23]. PMG1 and PMG2 are additionally involved in epithelial cell differentiation, while a further member of the emerged domains - ’KRTDAP’ (PF15200) - is a keratinocyte differentiation-associated protein. Keratinocytes are a cell type of the epidermis, the layer of the skin closest to the surface [24]. The KRTDAP related gene was isolated in rats between skin of prehair-germ stage embryos and hair-germ stage embryos, and shows high expression in regions of the hair follicle [25]. We can infer that the emergence of hair and fur also involved adaptation and restructuring of the skin, resulting in novel skin cell types and cell differentiation regulation mechanisms. Furthermore, the skin, and keratinocytes in particular, act as a first barrier against environmental damage and pathogen infestation and are therefore related to the second barrier, the immune system. Indeed, immune system related domains are the second biggest group in these emerged domains (>20% of domains with known function). As an example, the ’Interleukin’ domain (PF03487) emerged at the root of mammals and is associated with a group of secreted proteins and signalling molecules. The mammalian immune system is highly dependent on interleukins with certain deficiencies linked to autoimmune diseases and other immune system defects [26]. ’Lymphocyte activation family X’ is a domain also found in this set (PF15681), which is membrane-associated and expressed in B- and T-cells in addition to other lymphoid-specific cell types [27]. Additionally, out of all events occurring at the root of mammals, ’regulation of lymphocyte activation’ is an overrepresented term in the GO term enrichment analysis (see Additional file 10). These results reinforce the importance of the immune system for the early evolution of mammals.

Resistance to fungi in wheat

The functional analysis of gained domain arrangements using GO terms revealed an interesting pattern for the node leading to Triticeae which includes the two wheat species Triticum urartu and Triticum aestivum as well as the grass species Aegilops tauschii. Five out of the 15 enriched GO terms in Triticeae can be related to resistance to fungal pathogens via three different mechanisms. Chitinases are enzymes, which are known to be involved in plants’ fungal resistance and have been extensively studied in wheat species [28, 29]. The ability of these enzymes to degrade chitin, a primary component of fungal cell walls, can lead to the lysis of fungal cells and therefore provide resistance against them. We found the three significant GO terms ’chitin catabolic process’, ’cell wall macromolecular catabolic process’ and ’protein phosphorylation’ related to chitinases, which explain the innate fungal resistance of wheat and can also be utilized in genetic engineering to enhance fungal resistance in other crop plants [30]. The GO term ’protein kinase activity’ and the underlying Serine Threonine kinase has also been shown to be used in plants’ defense to fungi [31]. Another mechanism of fungal resistance is based on an ATP-binding cassette transporter, which is used in many crop plants [32]. We relate the GO term of ’ATP binding’ to this function of fungal resistance. Overall, the gained arrangements in Triticeae can be linked to the increased resistance of this clade to fungal pathogens.

Eusociality in bees

We found an example of interesting GO terms enriched at a node in Apidae, i.e. in the last common ancestor of the honey bee Apis mellifera and the bumblebee Bombus terrestris. This node marks one of the transitions of solitary bees to eusocial bees [33]. The overrepresented GO terms that relate to the evolution of eusociality comprise ’embryonic morphogenesis’, ’insulin-like growth factor binding’ and ’regulation of cell growth’ [33] and are additionally expanded in the species Bombus terrestris and Apis cerana. Insulin and insulin-like signalling (IIS) pathways have been shown to be differently expressed between castes in the honeybee and play a role in caste differentiation [34, 35]. Additionally, IIS modifies the behaviour of honey bee workers in foraging [36]. Functions of some domains that are associated with overrepresented GO terms can possibly be related to the emergence of eusociality, either by being involved in development or have been shown to be differentially expressed in different castes. Two domains are associated with growth factors, ’Insulin-like growth factor binding’ (PF00219) [34, 35] and ’EGF-like domain’ (PF00008). Epidermal growth factor (EGF) has been shown to be involved in caste differentiation in the honey bee by knockdown experiments [37, 38]. Several domains have been found to be differentially expressed in queens and workers in the honey bee and might be related to eusociality [39], i.e. ’Fibronectin type III domain’ (PF00041), ’Protein kinase domain’ (PF00069), ’Myb-like DNA-binding domain’ (PF00249) and ’Insect cuticle protein’ (PF00379). ’Insect cuticle protein’ is also suspected to play a role in the transition from solitary to eusocial bees [40].