Inference of homology groups

To perform a genome wide examination of novelty, we designed a bioinformatics pipeline to identify homologous groups of proteins within and between genomes, and determine their phylogenetic origin (Methods section and Supplementary Fig. 1). In contrast with other approaches (e.g., phylostratigraphy5,6), we use a large and representative taxon sampling and reciprocal sequence comparison7 combined with Markov clustering8 of complete proteomes to identify homology instead of one-way BLAST. We also define gene age based on occupancy in all taxa of a clade (see below) instead of using a single species as an anchor. While this is a robust methodology, it has some limitations seen in other BLAST-based analyses in that homology assignment might be affected by gene fusions, fissions, exon shuffling, lateral gene transfer, or presence of repetitive motifs9. We compared 62 genomes throughout Metazoa and their outgroups, including 44 genomes for 14 metazoan phyla (Fig. 1, Supplementary Fig. 2, Supplementary Data 1). These genomes were generated with different sequencing technologies, coverage, and gene prediction software. The completeness of these genomes was assessed using BUSCO10 (Supplementary Data 1, Supplementary Fig. 3); 10 animal genomes showed percentages of missing genes above 15%, but five of them belong to animals that had suffered known genome reductions, while the others are within clades in which other genomes display very low values of missing genes in the BUSCO analyses. After performing all-vs-all BLAST, MCL displays similarities in two-dimensional space, within which the distance between proteins is proportional to the e-value of their BLAST comparison. MCL then applies graph theory and hidden Markov models to cluster sets of proteins based on their proximity in that space8. All 1,494,207 proteins were clustered in 268,440 homology groups (HG); all human genes are grouped within 9451 HG and fruit fly genes within 7618 HG (Supplementary Data 4).

Fig. 1 Reconstruction of ancestral genomes. Evolutionary relationships of the major groups included in his study2. Different categories of HG are indicated in each node, from top to bottom, Ancestral HG, Novel HG, Novel Core HG, and Lost HG. Values assume sponges as the sister group to other animals, and placozoans as sister group to Planulozoa (=Cnidaria + Bilateria); alternative phylogenetic hypotheses are explored in Supplementary Data 3-8. Organism outlines from phylopic.org and the authors Full size image

The inferred HG provide a contrast between the hierarchical nature of gene classifications used in the literature (e.g., gene families, classes, and superfamilies) and their evolutionary dynamics (clustering based on sequence divergence), drawing a parallel with organisms’ classical taxonomy and their phylogenetic relationships. A given HG can either include genes traditionally defined as gene families (e.g., Iroquois gene family, part of the TALE class, which is part of the homeobox genes superfamily11), as gene classes (e.g., POU class of the homeobox superfamily, containing multiple gene families), or as superfamilies (e.g., Wnt ligands, containing multiple gene classes that comprise diverse gene families). The recovery of traditional gene families/classes/superfamilies demonstrates the reliability of the clustering approach taken here. However, the evolutionary history of some genes is evident from their clustering; for example, even though the Iroquois gene family is classified within the TALE class of the homeobox superfamily and emerged by duplication of a TALE gene, here it is clustered in a distinct HG separate from other TALE genes due to high sequence divergence. Hence, each HG defines a set of proteins that have distinctly diverged from others; it may contain one ortholog (gene family) or multiple paralogs (gene classes or superfamilies), regardless of their mechanism of origin and traditional classification. We believe this assignment is relevant from an evolutionary perspective as it highlights groups of proteins that are highly diverged from others. In contrast to approaches that atomize HG down to gene families based on further dissection of BLAST e-values (e.g., OrthoMCL12), these large clusters of orthologs and/or paralogs are less prone to misassignments caused by evolutionary patterns associated with gene family expansions, such as asymmetric evolution13.

Phylogenetic genomics

A phylogenetically aware parsing script, which takes into consideration the evolutionary relationships between taxa together with the taxonomic occupancy of the HG, was written to reconstruct the HG present in each node. We assume that a given homology group can only emerge once, but may be lost multiple times. We extracted (a) ‘Ancestral HG’: all those present in the Last Common Ancestor (LCA) of a given clade, (b) ‘Novel HG’: those present in the LCA of a clade but not outgroups, (c) ‘Novel Core HG’: a subset of novel HG present in every representative species within the clade, but permitting one absence, and (d) ‘Lost HG’: HG lost in the LCA of a clade (Fig. 1, Supplementary Data 2 to 8). The definition of ‘Core’ HG (Supplementary Note 1, Supplementary Data 2) allows for absence of the HG in one member of the ingroup to accommodate incompleteness of some genomes; ancestral lists of HG are therefore not affected by these genomes, as these HG are defined by their presence in at least two genomes belonging to the two main lineages within a clade. The Novel Core HGs were further validated by running BLASTP searches against the complete GenBank database (Methods section and Supplementary Data 9). We propose that the high level of retention of Novel Core genes is an indicator of functional importance. We placed poriferans as the sister group to all the other animal lineages, but alternative phylogenetic hypotheses14,15 do not change the general patterns described below (Supplementary Data 3-8). The full list of gene IDs for each category of HG and different ancestors can be found in Supplementary Data 10.

The genome of the Urmetazoon

Our analyses infer the LCA of extant metazoans contained a total of 6331 Ancestral HG (4929 HG if ctenophores are sister group to metazoans, Supplementary Data 3 and 4). This will be below the total number of genes in the genome, since HG can contain multiple genes and some HG could have been lost from all surviving lineages. Tracing forward in time, human and fruit fly genomes have 13,034 and 8270 genes, respectively, belonging to these Ancestral HG; hence, 55–60% of human and fly protein-coding genes belong to HG present in the first animals. Interestingly, only 922 of these 6331 Ancestral HG are retained by 43 or 44 of the animal genomes, highlighting extensive gene loss or divergence across metazoan evolution. As the ancestral functions of these HG cannot be directly inspected, the functions of their descendants in modern genomes were used as a proxy. Gene ontology (GO) analyses16 using fruit fly, which has extensive GO annotations, indicate that HG derived from the first animal genome were abundant in protein classes implicated in gene regulation (e.g., nucleic acid binding, transcription factors) and catalytic activities (e.g., hydrolases, transferases, and oxidoreductases; Supplementary Data 4). The most abundant GO pathways include many signalling pathways (e.g., Wnt, TGF-beta, cadherin signalling, integrin signalling; Supplementary Data 4). These findings are consistent with previous studies based on whole-genome comparisons4,17,18. At the level of both protein functional classes and pathways, there is considerable similarity between the ancestral animal genome and the extant genomes of fruit fly and human, and with those of successively earlier holozoan nodes (Supplementary Data 4).

Genomic novelty in the origin of animals

Concerning gene novelty, we infer the ancestral metazoan genome included a remarkable 1189 Novel HG; this number is similar to that inferred by previous study centred on the genome of a demosponge19. Our analyses indicate a threefold increase compared to novelties in the previous nodes (389, 340, and 399 novel HG in the older holozoan nodes; Fig. 1, Supplementary Data 3). The Novel HG comprise 19% of the total HG in the first metazoan, compared to only 8–10% in most older nodes examined; Planulozoa and Bilateria LCA nodes also have high proportions of Novel HG (Figs. 1, 2a, Supplementary Note 3, Supplementary Data 5). If ctenophores are considered sister group to the other animals, 782 Novel HG are recovered (Supplementary Data 3), a twofold novelty increase (16% of the HG in the genome, Supplementary Data 5). In addition, the metazoan LCA has a fivefold increase of Novel HG that are subsequently retained in all (or almost all) descendent lineages (Fig. 2b, Supplementary Note 3, Supplementary Data 5), independently of the nature of the earliest animal lineage. The 1189 metazoan Novel HG contain a large number of regulatory functions compared to the metazoan Ancestral HG set (e.g., 23 vs 6% transcription factors, 11 vs 4% signalling), and are depauperate in enzymatic and metabolic functions (Supplementary Data 4 and 6). Comparing the Novel HG of each phylogenetic node, the number genes for several protein classes displays a peak in the animal ancestor (Fig. 2c); the novel functions which are more abundant in the LCA of animals (Supplementary Data 6) are nucleic acid binding (23%), transcription factors (23%), and signalling molecules (11%). Thus, the first animal genome was not only showing a higher proportion of Novel HG, but these also perform major multicellular functions in the modern fruit fly genome. The implication is that the transition was accompanied by an increase of genomic innovation, including many new, divergent, and subsequently ubiquitous genes encoding regulatory functions associated with animal multicellularity.

Fig. 2 Novelty in ancestral genomes. a Proportion of Novel HG in the Ancestral HG for different holozoan ancestors. b Percentage of Core HG that are novel, and percentage of highly preserved genes among the Novel HG across different LCA. c Number of Protein Class GO hits for the fruit fly representatives of the Novel HG for the various phylogenetic nodes Full size image

Twenty five novel core groups of genes in animals

We identified which novel gene functions were more retained through evolution of animals. We find a total of 25 Novel Core HG: protein groups emerging in the genome of the first animal and still present in at least 43 of the 44 metazoan genomes examined (Table 1); these are independent of alternative phylogenetic scenarios at the root of animals. For these 25 HG, we give examples of modern bilaterian gene families contained in these HG. Together they cover the spectrum of classical functions linked to animal multicellularity: gene regulation, signalling, cell adhesion and cell cycle. Earlier opisthokont LCA have much lower numbers of Novel Core HG (Fig. 1, Supplementary Data 3 and 7), supporting the importance of genetic innovation in the emergence of animals. Of the 25 HG, eight encode key components of two major signalling pathways: Wnt and TGF-Beta. Thus, for the Wnt pathway we find the ligand, a GPCR membrane receptor of the pathway (frizzled), a component of the system that translocates signal to the nucleus (armadillo/beta-catenin), and the downstream effector (pangolin). Both pathways have been hypothesized to be key to the transition to animal multicellularity, possibly through provision of an axial body patterning system20; their recovery confirms the validity of our approach to find biologically relevant genes. Seven of the Novel Core HG encode transcription factors (Table 1): NKL, SIX and POU homeodomain proteins (implicated in developmental decisions), Hes bHLH proteins (effectors of Notch signalling), bHLH-PAS proteins (responders to environmental and physiological signals), a bHLH group containing twist-related genes (cell lineage determination) and Hand (development of muscle and nervous systems), and a group of ETS genes (winged helix-turn-helix domain). Some, but not all, of these genes had previously been postulated as arising at the origin of Metazoa21,22,23,24,25,26,27,28,29,30.

Table 1 Novel Core genes in the Animal Kingdom Full size table

Several of the 25 Novel Core HG have not been previously implicated in the rise of metazoans; we suggest these overlooked proteins were equally important. They include CPEB proteins which bind the polyA tail of mRNA and activate or repress translation31, adding a new layer to the regulation of gene expression in metazoans. Three other HG contain genes involved in cell adhesion, a key multicellular process; for example, fermitin and liprin are involved in activation of the integrin pathway and other cell-extracellular matrix interactions via focal adhesions32. The origin of these genes may have been pivotal to the origins of epithelia in poriferans and other Metazoa33. Two HG contain genes involved in cell cycle regulation, again critical in a multicellular organism: MADD and RUN. There is evidence that MADD interacts with pathways linked to apoptosis34, and the nematode orthologue is a guanine exchange factor (GEF)35. RUN is an effector of the Ras-like GTPase signalling pathway associated with cell proliferation36. Finally, there are four HG with functions related to the nervous system, striking considering our dataset includes animals with no nervous system (sponges and placozoans). These HG include neurotransmitter receptors, which had previously been associated with animal origins18,37, but also CADPS and RIM proteins. CADPS and RIM regulate the fusion of vesicles with the cell membrane in neuroendrocine cells and in the presynaptic neurons, respectively38; both interact with soluble NSF-attachment protein receptors (SNARE), which are known for having undergone expansions in the origin of animals3. The study of those genes in early animal lineages could be important to understand the genesis of the nervous system.

Gene losses

Finally, ‘Lost HG’ (Fig. 1, Supplementary Data 2, 3, and 8) display a different temporal pattern to the ancestral and novel complements. The numbers of Lost HG are more consistent across nodes (347–520), with Metazoa and Bilateria showing the highest numbers (503 and 520, respectively). The exception is the Planulozoa LCA with 72 HG losses; however, inference of HG losses is sensitive to the identity of the immediate sister group of the clade of interest, and this is the area of the tree with highest uncertainty. The identification of the biological functions of ‘Lost HG’ (Supplementary Data 8) is hampered by their absence in model organisms, not only in the in-groups (e.g., humans, fruit fly, etc.) but also in key outgroups (e.g., Arabidopsis thaliana, Saccharomyces cerevisiae, etc.). The largest portion of annotated HG contains enzymes, including some kinases; developmentally relevant genes barely feature among the ‘Lost HG’, with few exceptions, such as Bromodomain containing transcription factors and WRKY genes in the LCA of Holozoa, or a Glutamate receptor in the origin of Bilateria (Supplementary Data 8). Two nodes with the highest values of both lost and novel HG are the Metazoa and Bilateria, which indicates that these transitions were associated with high turnover of genes and greater genomic plasticity.