Drosophila species, Caenorhabditis species and primates have up to hundreds of active foreign genes

To determine the scale of HGT across well-characterised taxonomic groups, we examined 12 Drosophila species, four Caenorhabditis species and 10 primates (Figure 1) for which high quality genomes and transcriptomes are available. For each transcribed gene, we calculated the HGT index, h (the difference between the bitscores of the best non-metazoan and the best metazoan matches), which gives a relative quantitative measure of how well a given gene aligns to non-metazoan versus metazoan sequences, with positive numbers indicating a better alignment to non-metazoan sequences [12]. For example, the C. elegans gene gut-obstructed 1 (gob-1), which encodes a trehalose-6-phosphate phosphatase, has a best non-metazoan match with a bitscore of 135 and a best metazoan match with a bitscore of 39.3 resulting in an HGT index of 95.7. As we were interested in more than just very recent HGT, we excluded members of the test species’ phylum from the metazoan matches. This allowed us to identify HGT over evolutionary periods encompassing hundreds of millions of years, as opposed to only identifying HGT that occurred since the test species’ divergence from its most closely related species (likely up to tens of millions of years). Hereafter, when we refer to matches to metazoan sequences, we mean these subsets.

Figure 1 Phylogenetic relationships of the main taxonomic groups studied. The blue numbers indicate the ortholog groups mapping to each branch (HGT events). Events may have occurred anywhere along the branch, not just where the number is indicated. Events found at the base of the tree have occurred anywhere between the origin of the phylum and the base of the tree. Trees are not drawn to scale with each other. Full size image

We first identified a base level of HGT (called class C) by using conservative thresholds of h ≥30 (as in [12]) (meaning that the gene aligns much better, and is therefore much more similar, to non-metazoan genes) and bitscore of best non-metazoan match ≥100 (thereby excluding bad alignments to non-metazoans). The example given above (gob-1) passes these thresholds and is therefore at least class C HGT. This per-species information was then combined for each taxon (Drosophila, Caenorhabditis and primates) to construct ortholog groups. For each ortholog group we calculated the average h value of all members (h orth ) and defined the genes with h orth ≥30 as class B, a subset of class C. These genes are, on average, predicted as HGT in all tested species they are found in. The gene gob-1 has homologs in C. brenneri, C. briggsae and C. japonica, with values of h = 102, h = 97.1 and h = 86.4 respectively, giving an average h (h orth ) of 95.3 and as such gob-1 (ands its homologs) are also class B HGT. Finally, we applied a still more stringent filter to define class A foreign genes (a subset of class B), which had only very poor alignments to metazoan sequences and whose orthologs, as used to define class B, also had similarly poor alignments to metazoan sequences. To do this, we identified those sequences whose best match to a metazoan had a bitscore <100 and whose ortholog groups contain no genes with metazoan matches of bitscore ≥100 (Figure 2A). The gene gob-1 has no metazoan matches with bitscore ≥100 (best metazoan match = 39.3) and the same is true for its homologs (best matches of 37, 38.9 and 36.6, respectively), as such it is also class A HGT.

Figure 2 HGT genes by class. ( A ) The left panel shows a schematic representation of the HGT classes: class B and C genes have h index ≥ 30 and bitscore of the best non-metazoan blastx hit ≥ 100 (they are distinguished by h orth , which is not shown on this figure), while class A genes must additionally have bitscore <100 for the best metazoan blastx hit. The right panel shows the scores for all genes in H. sapiens, colour-coded according to their classification (class A: red, class B: orange, class C: blue, native genes: grey). ( B ) Box-plot of the number of genes in each class, for the three main taxa analysed (Drosophila spp. Caenorhabditis spp., primates species), colour-coded according to the same scheme (class A: red, class B: orange, class C: blue). Full size image

We then performed phylogenetic analyses for all genes of each of the above classes and found that an average of 55% of all class C genes, 65% of all class B genes and 88% of all class A genes were phylogenetically validated as foreign. This validation and further manual analysis (Additional files 1 and 2) suggested that, while false positives are minimised as C → B → A, some true positives are also lost. Therefore, class A genes represent a minimum estimate of the level of HGT for a given species.

We found that Caenorhabditis species have, on average, 173, 127 and 68 genes in HGT classes C, B and A, respectively. In contrast, Drosophila species have fewer active foreign genes with, on average, 40 genes in class C, 25 in class B, and only four in class A. Primate HGT levels fall between those of the invertebrate taxa, with an average of 109, 79 and 32 genes per species in classes C, B and A, respectively (Figure 2B, Additional files 2 and 3).

Identified foreign genes are unlikely to be explained by alternative hypotheses

To verify that the foreign genes we identified do indeed belong to the species under study and are not contamination (this is a problem in a number of animal genome sequences; see ‘Phylogenetic validation’ in Additional file 1), we tested whether they were found on the same genomic scaffolds as (that is, were linked to) genes of metazoan origin (native genes). Across all species we found an average of only nine class C genes per species (6.6% of foreign genes) that were not linked to native genes (Additional file 2), with correspondingly low proportions for class B and A genes. Demonstration of such high levels of linkage was only possible due to the high quality of the assemblies of these genomes. Although most species showed a high degree of linkage, there were three outliers (see Additional file 1), but even if all unlinked genes were contamination, which is not necessarily the case, this would have a minimal impact on the levels of HGT detected.

An alternative hypothesis to explain our data is that the genes we label as foreign in any single species are actually the result of conventional vertical descent, but have been lost in all other animal lineages. The parsimony principle tells us that we should choose the simplest explanation, which might be determined by comparing the rate of gene loss and the rate of gene gain by HGT. However, while the rate of gene loss over time can be estimated, at this point we cannot accurately estimate the rate of HGT over anything less than the time since the common ancestor of all metazoans, due to limited data. The rates that should actually be compared are the rates of gene loss and HGT at the earliest branches of the eukaryotic tree, but these rates are especially difficult to determine as the very long periods of time involved mean that ortholog determination (necessary to find which genes have been lost/gained) is hard. Furthermore, published estimates of the rate of gene loss typically treat all genes as equal, but the actual rate varies between types of genes and types of organisms (for example, parasites have higher loss rates [25,26]). As HGT involves the transfer of only a subset of genes (see section ‘Many horizontally acquired genes code for enzyme activities’, below), a generic gene loss rate is not comparable to the HGT rate.

Given these difficulties we attempted to differentiate between the two hypotheses with a different method. We looked at the functions of foreign genes and compared them to those of native genes that are known to have been lost in all other animal lineages, but were not predicted as foreign (genes for which the alternate hypothesis is true) and found significant differences between the foreign genes we identified and native genes fulfilling these criteria (see section ‘Many horizontally acquired genes code for enzyme activities’, below). Therefore, while we cannot entirely discount the gene loss hypothesis, it seems an unlikely explanation for the tens or hundreds of foreign genes per genome that we observe.

Identification of new foreign genes and confirmation of previously reported examples

The first report of the human genome sequence highlighted 223 protein sequences (of which 113 were confirmed as present in the genome by PCR) that were proposed to originate from bacteria by HGT [19]. While some of these genes were later confirmed as foreign, many were rejected [20-22]. At the time of writing, it is difficult to assess all of these sequences because some early identifiers have not been maintained, but we have been able to confirm or reclaim 17 previously reported examples as foreign (some also confirmed by other studies; Additional file 4). Furthermore, we identified up to 128 additional foreign genes in the human genome (128 class C, of which 93 are class B and 33 class A), giving a total of 145 class C genes, of which 110 are class B and 39 class A.

Among these examples, we reclaim those encoding the hyaluronan synthases (HAS1-3). These were originally proposed as examples of prokaryote-to-metazoan HGT [19], but later rejected [20]; however, neither study considered foreign taxa other than bacteria. We were able to identify all three hyaluronan synthases as class A HGT, originating from fungi, an assessment supported by our phylogenetic analysis (Figure 3). The HAS genes appear in a wide variety of chordates, but not in non-chordate metazoans, suggesting they result from the transfer of a single gene around the time of the common ancestor of Chordata, before undergoing duplications to produce the three genes found in primates. As the original rebuttal paper [20] only focused on recent HGT, and did not look for eukaryotic matches outside Chordata, they could not detect this ancient HGT.

Figure 3 Phylogenetic tree for the human gene HAS1. For each branch the species name and UniProt accession is shown. The human gene under analysis is shown in orange, proteins from chordates are in red, other metazoa in black, fungi in pink, plants in green, protists in grey, archaea in light blue and bacteria in dark blue. Numbers indicate aLRT support values for each branch where higher than 0.75 (on short terminal branches the support values are not shown). Full size image

We also identify cases of HGT reported more recently that have not been analysed in detail despite the potentially interesting consequences of such a finding. For example, the fat mass and obesity associated gene (FTO, in Additional file 5: Figure S1A) seems to be present only in marine algae and vertebrates [27,28], which is a highly unusual distribution. Another gene proposed to have been horizontally transferred is the ABO blood group gene (ABO, in Additional file 5: Figure S1B), which is suggested to enhance mutualism between vertebrates and bacteria [29]. We identified both these genes as class A HGT with phylogenetic validation (Additional file 3).

In the invertebrates, Dunning Hotopp et al. [17] reported the horizontal transfer of nearly the entire Wolbachia genome into the D. ananassae genome and the expression of 44 formerly Wolbachia genes, as evidenced by PCR amplification of their transcripts in flies that had been cured of Wolbachia and partial re-sequencing of the D. ananassae genome. These genes are still not included in the official genome of D. ananassae, likely because eukaryote genome sequencing projects routinely screen for and exclude bacterial sequences on the assumption that these represent contamination. Consequently, they were not in the dataset tested in this study and therefore do not appear among the foreign genes identified in D. ananassae. However, we did find one gene in D. ananassae, GF19976, which has not been identified previously as foreign and that may originate from Wolbachia.

Parkinson and Blaxter [30] identified four horizontally acquired genes in C. elegans. We identified all of these, three as class B HGT and the fourth as class A (highlighted in yellow in Additional file 3), but we also discovered a further 135 class C genes, of which 113 were class B and 71 class A (Additional file 3). This discrepancy with Parkinson and Blaxter [30] arises largely because these authors aligned C. elegans sequences with only a single non-metazoan (S. cerevisiae). Accordingly, we identified three of the four known foreign genes as fungal in origin, with the fourth also aligning well to fungal proteins (although we find it originated from a protist). Overall, however, only 4% to 15% of C. elegans HGT (depending on class) is of fungal origin (Additional file 2), with rather more (52% to 72%) deriving from bacteria (not assessed in ref. [30]). As mentioned in the Background section, there is phylogenetic evidence that the Cysteine Synthase Like genes found in nematodes, including C. elegans (cysl1 - 4), may have been acquired from plants [9]. Our analysis supports this conclusion with all four genes being class B HGT of plant origin and three being phylogenetically validated. HGT also occurs in a number of other nematodes, in particular the parasitic root-knot nematodes, in which as many as approximately 3% of all genes may be horizontally acquired [3,24].

Many horizontally acquired genes code for enzyme activities

In prokaryotes, horizontally acquired genes tend to be operational, typically encoding enzymes, rather than informational, that is, genes involved in transcription and translation [31]. It has recently been suggested that network connectivity is a more important consideration than function [32], but nevertheless most identified foreign genes are concerned with metabolism. Consistent with this, 83% of foreign genes in the bdelloid rotifer encode enzymes [12]. To determine whether this applies to HGT throughout metazoans, we first inspected Gene Ontology (GO) terms that were found at unexpectedly high levels among class A foreign genes (‘enriched GO terms’), then determined which GO terms indicated enzyme activities, and finally calculated the percentage of enzyme activities for both enriched and un-enriched terms. In almost all Caenorhabditis and primate species, enriched GO terms were significantly more likely (chi-squared test: 3E-9 ≤ P ≤ 0.05; Additional file 2) to describe enzyme activities than un-enriched terms (on average, 42% of enriched terms relate to enzyme activities vs. 26% of un-enriched terms; Additional file 2). In Drosophila species, insufficient terms were enriched to perform the calculation. Enriched GO terms in class B genes were also more likely to relate to enzyme activities. The second largest group of foreign genes codes for membrane-bound proteins, another category of operational genes. Therefore, like in prokaryotes [22], HGT is biased towards operational genes in metazoans.

A possible alternative explanation for the genes suggested to result from HGT is that they are actually the product of vertical descent in the concerned species, but have been lost in all other animal lineages (as discussed above in ‘Identified foreign genes are unlikely to be explained by alternative hypotheses’). This explanation is more likely in the primates as their HGT is predominately ancient (see section ‘Horizontal gene transfer is both ancient and ongoing’ below), reducing the number of times the gene must be lost. To test this hypothesis, the same GO analysis was performed on native genes that are found in chordates and not in non-chordate metazoans (that is, genes that have been lost in all non-chordate metazoans; a possible alternative explanation for the putative foreign genes we identify). In all primate species, enriched GO terms for these genes (when compared to those from all other native genes) were significantly less likely (chi-squared test: P ≤0.05; Additional file 2) to describe enzyme activities than un-enriched terms (on average, 4% vs. 20%; Additional file 2). This is the opposite of the result for foreign genes, suggesting that an alternative hypothesis of gene loss does not explain our findings.

Foreign gene functions

Many foreign genes are, like many native genes, currently uncharacterised, even in intensively studied model organisms; for example, the human (foreign) gene ENSG00000136830 is annotated ‘family with sequence similarity 129, member B’, but there is no information on its role. Where foreign genes have meaningful annotation, it is clear they code for a wide variety of different functions across a broad range of categories, some of which may overlap. Here we describe the six most noteworthy categories, from largest to smallest, across C. elegans, D. melanogaster and the human (Additional file 3).

In C. elegans, the largest category includes genes connected to the innate immune response (16 genes), including genes that specifically target bacterial cell walls (lys family), genes targeting fungi (for example, endochitinases) and other more general immune system genes (for example, thn-3 and thn-4). The second largest category comprises eight genes involved in lipid metabolism, including the breakdown of fatty acids by beta-oxidation (for example, fatty acid CoA synthetase family, acs-11), as well fatty acid synthesis (for example, fatty acid desaturase, fat-1). The third category includes four genes involved in macromolecule modification, which encompasses activities such as phosphorylation, methylation and ubiquitination. The fourth category governs stress responses and includes a heat shock protein (dnj-16), an LEA protein (lea-1) and two genes involved in the trehalose synthesis pathway: trehalose-6-phosphate phosphatase (gob-1) and trehalose-phosphate synthase (tps-1). Trehalose production allows C. elegans dauer larvae to survive extreme desiccation [33], while LEA proteins are also linked to tolerance of water stress in C. elegans [34] and other invertebrates, as well as plants and microorganisms (reviewed in [35]). The fifth category consists of antioxidant activities (one gene; glutathione peroxidase, gpx-2) and the sixth category is amino-acid metabolism, also consisting of a single gene, coding for asparagine synthesis (asns-2).

There are far fewer foreign genes in D. melanogaster, but we do see genes belonging to some of the same categories as in C. elegans, namely macromolecule modification (two genes), the innate immune response (three genes) and stress response (three genes). The three D. melanogaster immune response genes all belong to the same family of proteins, which is involved in the phagocytosis of bacteria, while the three stress response genes are all involved in the trehalose synthesis pathway: two trehalose phosphatases (FBgn0031907, FBgn0031908) and a trehalose-phosphate synthase (Tps1). While this last gene has the same name as a C. elegans trehalose-phosphate synthase gene (Tps1/tps-1), alignment shows they are dissimilar, especially outside the catalytic domain, suggesting they do not originate from the same HGT event (in Additional file 5: Figure S2). Likewise the trehalose phosphatases are not conserved across species.

In the human we find genes in five of the six categories: amino-acid metabolism (two genes), macromolecule modification (15 genes), lipid metabolism (13 genes), antioxidant activities (five genes) and innate immune response (seven genes). The lipid metabolism genes include genes with similar functions to the C. elegans genes, such as the breakdown of fatty acids by beta-oxidation (for example, enoyl-CoA, hydratase/3-hydroxyacyl CoA dehydrogenase, EHHADH), as well as a wide variety of other functions including the formation of glycolipids via chain extension (for example, globoside alpha-1,3-N-acetylgalactosaminyltransferase 1, GBGT1) and transmembrane transporters required for lipid homestasis (for example, ATP-binding cassette, sub-family G (WHITE), member 5, ABCG5). The innate immune response genes include genes involved in the inflammatory response (for example, immunoresponsive 1 homolog, IRG1), genes for immune cell signalling (for example, phosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit gamma, PIK3CG) and antimicrobial genes (for example, epididymal peptidase inhibitor, EPPIN).

We do not find any of the same foreign genes in common across the three species because our method precludes this: such genes would have been present in a common ancestor and would be screened out as metazoan. However, we do find shared functions, such as the trehalose synthesis pathway in the invertebrates. Few genes are found in shared pathways. This may indicate that transfers happen one gene at a time, with each gene being separately integrated into the metabolic networks of the organism. Broadly speaking we do not see differences between the species in the functions encoded by foreign genes, except in the immune response category: the majority of the invertebrate genes encode enzymes that break down bacterial and fungal cell walls, which would seem to confer a clear adaptive advantage, while the human genes are more likely to code for signalling and regulation of the immune response and have less obvious advantages to the organism. This likely reflects the differences in age between the vertebrate and invertebrate HGT (see section ‘Horizontal gene transfer is both ancient and ongoing’, below), with the more recently acquired foreign genes in the invertebrates having a clearer role than the ancient foreign genes in the vertebrates, which have had longer to integrate into networks.

Foreign genes predominately originate from bacteria and protists

When calculating h, the likely taxon of origin of a foreign gene was taken to be the taxon of the best-matching protein. Bacteria and protists are the most common donors in all groups (Figure 4), which might reflect the relative abundance of the respective donor species in the environments of the recipient organisms. The phylogenetic validation of the foreign genes occasionally indicated a different origin than the original calculation (based on alignments and h index), but both methods agreed on average 92% of the time; performing the analysis shown in Figure 4 using phylogenetically predicted origins instead shows the same pattern of donors (data not shown). The identity of the actual donor species is much harder to determine, as the identified ‘donor’ is almost certainly just the most closely related species currently sequenced. This is especially the case for older HGT events where the same foreign gene appears in more than one species, that is, where horizontal transfer predates the divergence of species. However, we did find a number of recent transfers (present in only a single studied species) that were identified as originating specifically from Wolbachia, with one example each in D. ananassae, C. briggsae and C. japonica (GF19976, CBG07424 and Cjp-ubc-6, respectively).

Figure 4 Mean origin of class C foreign genes for each taxon. Numbers show percentage contribution within each taxon (row). The same analyses for Class B or A genes show very similar patterns. The colour scheme is as in Figure 3: origin from archaea is light blue, from bacteria is dark blue, from protists is grey, from plants is green and from fungi is pink. Full size image

Our method also identified putative HGT from viruses: while rare in both Drosophila and Caenorhabditis, up to 50 more foreign genes of viral origin per species were identified in the primates (‘Class V’: Additional files 2 and 3). The majority of such genes only align to viral and metazoan sequences, making the direction of transfer unclear, and therefore we excluded them from the rest of our analysis.

Foreign genes are as likely to contain introns as native genes

The many foreign genes that originate from bacteria would originally have lacked introns, but may have gained them while becoming adapted to the recipient species (domestication). To test this we looked at whether bacterial-origin foreign genes have introns. The Drosophila species generally have too few foreign genes to perform the analysis, but in three Caenorhabditis species (all except C. japonica) and all primates the percentage of bacterial-origin foreign genes with introns is around 95%. For all three classes of foreign gene (C, B and A), there was no significant difference between the proportion of bacterial-origin foreign genes with introns and the proportion of native genes with introns (as measured by a chi-squared test; Additional file 2). The same was true for foreign genes as a whole (all origins; Additional file 2). This observation also makes it unlikely that the detected HGT is actually contamination of the genome with bacterial sequences, as these would lack introns. The exception, C. japonica, has significantly fewer bacterial-origin foreign genes with introns than native genes in all three classes (P < 8E-6), averaging only 29% of bacterial-origin foreign genes with introns. It also has significantly fewer class A foreign genes with introns than native genes with introns (P < 0.001) as discussed below.

Horizontal gene transfer is both ancient and ongoing

To determine whether the detected HGT is ancient (prior to the divergence of the studied taxon), or has occurred throughout the evolution of a particular taxon, we mapped the foreign ortholog groups (representing founding HGT events) for each taxon onto the corresponding phylogenetic trees. In Drosophila species, there is a broad correspondence between length of branch (time) and the number of HGT events along each branch, suggesting that HGT has occurred throughout Drosophila evolution and is likely to be ongoing (Figure 1).

The same can be inferred for the Caenorhabditis species. Interestingly, a much larger number of HGT events have occurred in C. japonica than in the other studied Caenorhabditis species or their common ancestors, and its foreign genes also have different properties: it is the only species studied where significantly fewer multi-exon genes are found among foreign genes of prokaryotic origin than among native genes (Additional file 2). Transferred prokaryotic genes presumably require some time to acquire introns, and the lower proportion of intron-containing foreign genes is consistent with comparatively recent HGT events. An alternative explanation is that the C. japonica genome is contaminated, since around twice as many of its foreign genes are unlinked to native genes as in other species (Additional file 2). However, even if all unlinked genes are considered to be contamination and are discounted, there would still be more HGT events unique to C. japonica than unique to the other studied Caenorhabditis species.

The distribution of transfer events is different in the primates, with most foreign groups mapping to the base of the tree (a common ancestor of primates), suggesting that the majority of HGT in primates is ancient. In these cases we are not inferring that the HGT event occurred in the most recent common ancestor of all primates, but that it occurred sometime between the common ancestor of Chordata and the common ancestor of the primates, that is, prior to the time period shown in Figure 1. For example, in the case of HAS1 (Figure 3), which is found in a wide variety of chordates, the HGT event likely occurred soon after the common ancestor of Chordata arose.

Foreign genes undergo duplication and are distributed throughout the genome

Horizontally acquired genes can undergo duplication and diversification: for example, the three hyaluronan synthases in Homo sapiens belong to the same ortholog group and probably result from a single transfer event, followed by duplications. We observed the same scenario for other genes in H. sapiens (for example, the four peptidyl arginine deiminases and the nine PRAME family members; Additional file 3), and also in other species. In an extreme case (the O-acyltransferases belonging to the same ortholog group in C. elegans; Additional file 3) as many as 30 genes probably derive from a single HGT event.

To ask whether there are ‘hotspots’ undergoing more frequent HGT in the studied genomes, we plotted the locations of foreign genes on the chromosomes or scaffolds of the respective genomes (Additional file 5: Figure S3). We found no evidence for ‘hotspots’, but the limited number of HGT events per species and the frequent occurrence of chromosomal rearrangements during evolution, which complicate cross-species comparisons, make it difficult to draw reliable conclusions.

HGT is a general feature of chordate genomes

Because there is limited information on HGT in the chordates, we also identified foreign genes for 14 other vertebrate species (Additional file 2). We find 60 to 240 class C genes (approximately 0.4% to 1.3%) across all of these species, in line with our findings for Drosophila, Caenorhabditis and primates, suggesting that HGT is not restricted to a few animal groups. We did not try to identify class A and B genes, as our method does not produce reliable ortholog groups for species separated by large evolutionary distances.