Choice of model organisms and inference of orthogroups

To test the hypothesis that differences in diet between organisms can impact on the nucleotide composition of their genomes, a comparative genomic analysis was performed using bacterial (Mollicutes) and eukaryotic (Kinetoplastida) parasites that have adapted to different host niches (Fig. 1; Additional file 1: Figures S1 and S2). These parasites were chosen for analysis because none of the species fix nitrogen and so require nitrogenous compounds obtained from their environment [24, 25]. Furthermore, unlike opportunistic parasites or free-living organisms, these parasites are restricted to host niches that differ in the relative abundance of biologically available nitrogen. Specifically, parasites that colonise plant hosts are nitrogen limited in comparison to those that colonise animal hosts [21]. Additionally, the parasites’ pathways for ATP generation differ in liberation of biologically available nitrogen (Additional file 1: Figure S2) [23, 26–29]. The parasites that obtain energy through glycolysis obtain carbon skeletons and re-generate ATP, whereas the parasites that obtain energy through catabolism of arginine or amino sugars additionally obtain biologically available nitrogen (Additional file 1: Figure S2). Thus, the parasites were categorised into three groups depending on host type and whether their metabolism liberates nitrogen. Low nitrogen availability (L N ) parasites colonise plants and obtain energy through carbohydrate catabolism, medium nitrogen availability (M N ) parasites colonise animals and primarily catabolise carbohydrates, and high nitrogen availability (H N ) parasites colonise animals and obtain energy through amino acid or amino sugar catabolism. For further details on the metabolic properties of these parasites see Additional file 2.

Fig. 1 Phylogenetic trees of the parasites used in this study shaded according to their host and metabolic strategy. Green indicates plant parasites that obtain energy from carbohydrate catabolism (low nitrogen availability, L N ). Yellow indicates animal parasites that obtain energy from carbohydrate catabolism (medium nitrogen availability, M N ). Orange indicates animal parasites that obtain energy from amino acid and/or amino sugar catabolism (high nitrogen availability, H N ) Full size image

A set of orthologous gene groups (orthogroups) covering 15 kinetoplastid genomes (Fig. 1; Additional file 3: Table S1) and an independent set of orthogroups covering 17 Mollicute genomes (Fig. 1; Additional file 3: Table S1) were inferred. Both sets of orthogroups were subject to filtering such that orthogroups were retained for further analysis only if the orthogroup comprised a single-copy gene present in at least three species from each nitrogen availability group (i.e. three L N , three M N and three H N species). In this analysis, use of orthologous protein-coding genes allows direct investigation of the effect of adaptation to different metabolic strategies on nucleotide sequences that are derived from a common ancestral state. These genes may also be considered house-keeping genes as the organisms have only one tissue (unicellular) and these genes are conserved across all three groups. The same analysis cannot be done in intergenic regions where ambiguity of orthology prevents paired comparison of sites. Moreover, in the case of bacteria there are too few intergenic regions for robust statistical analyses. Of the 9526 orthogroups identified in kinetoplastids, 3003 satisfied the filtration criteria, encompassing ~40 % of all single-copy genes in these organisms. Similarly, of the 1280 orthogroups identified in the Mollicutes, 168 satisfy the filtration criteria, encompassing 28 % of all single-copy genes in these organisms.

Low nitrogen availability parasites have low nitrogen content sequences and vice versa

In the kinetoplastid parasites 878,193 orthologous codons in the 3003 conserved single-copy orthologous genes were compared. This revealed a significant difference in the nitrogen content of mRNA between the different nitrogen availability groups (Fig. 2a). On average the mRNAs in L N parasites cost one fewer nitrogen atom for every 15 codons compared to the same mRNAs in M N (p < 0.001) and one for every seven codons compared to H N (p < 0.001). This corresponds to nitrogen savings of ~0.6 % and ~1.3 %, respectively. Given a kinetoplastid cell has ~61,000 transcripts [30] with an average length of 630 codons, L N kinetoplastid parasites would use ~5.5 × 106 fewer nitrogen atoms than H N parasites to produce the same transcriptome. This is enough nitrogen atoms to make ~8700 average sized proteins. The kinetoplastids also exhibit an analogous difference in the nitrogen content of double-stranded DNA (dsDNA; Fig. 2b). Here genes in L N parasites cost one less nitrogen atom for every four codons compared to H N , saving roughly 157 nitrogen atoms per gene (~1.1 %). Considering that kinetoplastids are diploid with an average of 8000 genes, this difference in nitrogen cost means that L N parasites use ~2.5 × 106 fewer nitrogen atoms to encode the exact same cohort of genes.

Fig. 2 Nitrogen availability influences gene sequences. a The average mRNA nitrogen content per codon for 3003 orthologous genes in the Kinetoplastida. b The average nitrogen content per double stranded codon (dsDNA) for the same genes. c As in a but for 168 orthologous Mollicute genes. d As in b but for the Mollicutes. The y-axis is the probability density function (PDF) for the distributions Full size image

A similar phenomenon was observed when comparing the mRNA sequences in Mollicutes parasites (Fig. 2c). Here comparison of 38,255 orthologous codons in 168 orthologous genes revealed that L N parasites used one fewer nitrogen atom for every nine codons compared to the same mRNAs in M N (p < 0.001) and one for every five codons compared to H N (p < 0.001). This corresponds to nitrogen savings of ~1 % and ~1.8 %, respectively. Though the Mollicutes exhibit a nitrogen-dependent effect in their mRNAs, the same strong effect is not seen in their dsDNA (p = 0.025 when comparing L N and H N , p < 0.001 comparing M N with either L N or H N ; Fig. 2d). We propose that the absence of a clear nitrogen-dependent effect at the DNA level is due to a strong GC to AT mutation bias thought to be caused by a lack of dUTPase coupled with a reduced ability to correct erroneous dUTP incorporation in DNA [31, 32]. Thus, though the mRNA for the same genes has a lower nitrogen cost in nitrogen-limited species, the high AT nucleotide composition of the DNA reflects the mutational bias imposed by the lack of dUTPase.

For both the kinetoplastids and Mollicutes an analogous difference is also seen in the nitrogen content of the amino acid side chains of these orthologous sites. The L N parasites use amino acids whose side chains require less nitrogen than the M N and H N parasites (Additional file 1: Figure S3). The slight discrepancy between the M N and H N parasites can be explained by the reduced use of arginine in the H N species as they primarily obtain energy from arginine catabolism [23, 33, 34]. This is consistent with previous studies of plant and animal proteins that observed reduced nitrogen content of amino acid side chains in the nitrogen-limited plant species [17, 35].

Different metabolic strategies in the same host niche cause concomitant differences in gene sequence nitrogen content

To provide further insight into the relationship between metabolism and genome nucleotide composition, an additional analysis was conducted on Mollicutes parasites that occupy the same host niche but obtain energy through different metabolic strategies. Here, three Mollicutes species, Mycoplasma hominis, Mycoplasma genitalium and Ureaplasma parvum were analysed (note Ureaplasma parvum is also a Mollicute but a different species to Mycoplasma parvum used in the analyses above). Each of these three species reside in the same urogenital tract niche but obtain energy from catabolism of different biomolecules [23]. M. genitalium and U. parvum metabolise glucose and urea, respectively. However, M. hominis has lost the ability to generate ATP via glycolysis and instead generates ATP via nitrogen-liberating arginine catabolism [23].

Using the same methods outlined previously, 51,998 orthologous codons in 227 conserved single-copy orthologous genes (present in each of the three species) were compared (Fig. 3a). This revealed that despite inhabiting the same niche environment, there was a significant difference (p < 0.001) in the nitrogen cost of genes, equating to using one fewer nitrogen atom for every six codons in M. hominis (H N ) compared to M. genitalium (M N ) (~1.5 %). Since urea metabolism generates ammonia, one could expect U. parvum to be a H N parasite. However, U. parvum exports ammonia to drive ATP synthesis, meaning energy generation is linked with export of nitrogen from the cell. Thus, analogous to M. genitalium, U. parvum is a nitrogen-limited species and uses one fewer nitrogen atom for every five codons compared to M. hominis (~1.8 %). As before, the strong mutation bias in Mollicutes means that the same nitrogen-dependent effect is not seen in their dsDNA (Fig. 3b). Taken together, this comparison reveals that, in a common host niche, different metabolic strategies can result in concomitant differences in mRNA nitrogen content.

Fig. 3 Analysis of gene nitrogen content of three parasites that occupy the same host niche but utilise different metabolic strategies. a The average mRNA nitrogen content per codon for 227 single copy genes present in each species. b The average nitrogen content per double-stranded codon (dsDNA) for the same genes. The y-axis is the probability density function (PDF) for the distributions. Yellow indicates an animal parasite that obtains energy from carbohydrate catabolism (medium nitrogen availability, M N ). Orange indicates an animal parasite that obtains energy from amino acid and/or amino sugar catabolism (high nitrogen availability, H N ). Pink indicates an animal parasite that obtains energy by exporting ammonia from the cell Full size image

Differences in genome-wide patterns of synonymous codon use are explained by selection acting on codon nitrogen content

Given that there is a clear difference in the nitrogen content of genes between different nitrogen availability groups, it was assessed whether this phenomenon could be explained by differences in the nitrogen content of synonymous codons. To do this a novel model for genome-wide synonymous codon use was constructed that considers mutation bias and selection acting on the nitrogen content of codons (see the “A model for synonymous codon use under the joint pressures of selection and mutation bias” section in the “Methods”). Using this model, the value of the nitrogen-dependent selection bias (2N g s) and mutation bias (m) were found that best explained the real sequence data (see “Methods” for complete model description). Here a negative value for 2N g s indicates that selection is acting to decrease nitrogen content and vice versa.

For the kinetoplastids, application of this modelling approach was able to explain genome-wide patterns of synonymous codon use with >90 % accuracy across all nitrogen availability groups (Fig. 4). Moreover, sequences simulated using these fitted codon use frequencies recapitulated the observed patterns of nitrogen content in mRNA (Fig. 4d) and dsDNA (Fig. 4e (Additional file 1: Figures S4 and S5). Consistent with nitrogen availability, the value of the selection bias for incorporation of nitrogen atoms in gene sequences was most negative in L N parasites (2N g s = −0.09), intermediate in M N parasites (2N g s = −0.06) and least negative in H N parasites (2N g s = −0.03). The distribution of 2N g s parameters for individual species within each group were also significantly different between each group (ANOVA, p < 0.01). Thus, differences in nitrogen availability between species are reflected in the relative strengths of the selection bias on codon nitrogen content. Furthermore, mutation bias towards GC was lowest in L N parasites (m = 0.67) and highest in H N parasites (m = 0.31). Importantly, just considering selection acting on the nitrogen content of mRNA (Additional file 1: Figure S5b) or mutation bias (Additional file 1: Figure S5d) in isolation resulted in higher AIC values (Additional file 3: Table S1), indicating the dual parameter model is better. Thus, the pattern of codon use and gene nitrogen content is best explained by a model that considers both selection acting on the mRNA nitrogen content of genes and mutation bias (Fig. 4; Additional file 1: Figure S5e). Furthermore, the statistical significance of selection acting on the nitrogen content of coding sequences was assessed by a permutation test (see “Methods”). This showed that selection acting on the nitrogen content of the mRNA sequences was significant for L N (p = 0.004) and M N (p = 0.021) parasites but was not significant for the H N kinetoplastid parasites (p = 0.457). This is consistent with our findings that indicate H N kinetoplastids are not under selection to minimise the nitrogen content of their coding sequences. The change in codon bias also accounts for the majority of the difference in genome-wide GC content between species. Specifically, the coding regions constitute ~50 % of the genome in kinetoplastid parasites and thus changes in synonymous codon use account for 61 % of the observed difference in genome-wide GC content between H N and L N species (Additional file 3: Table S1).

Fig. 4 A selection-mutation model for synonymous codon use in kinetoplastids explains relative synonymous codon use with ~90 % accuracy and recapitulates the difference in the nitrogen cost of genes. a The average mRNA nitrogen content per codon for 3003 orthologous genes in the Kinetoplastida. b The average nitrogen content per double-stranded codon (dsDNA) for the genes in a. c Empirical codon use probabilities (expressed as percentages) plotted against themselves. d The average mRNA nitrogen content per codon for sequences simulated using synonymous codon use probabilities derived from fitting the selection-mutation model to the observed sequence data. e The average nitrogen content per double-stranded codon for the genes in d. f The synonymous codon use probabilities inferred using the selection-mutation model plotted against the empirical codon use probabilities expressed as percentages. Dot colour corresponds to nitrogen availability group. L N R2 = 0.92, M N R2 = 0.88, H N R2 = 0.90 Full size image

It should be noted that simulating sequences using perfect genome-derived codon use frequencies (i.e. using 61 constrained parameters; Additional file 1: Figure S5h) results in simulated sequences whose distributions are not significantly different to those obtained in our two-parameter selection-mutation model. Thus, the difference between the distributions of nitrogen content for the real (Fig. 4b) and simulated sequences (Fig. 4e) is a result of factors affecting codon bias in individual genes that are not encapsulated by our genome-wide model.

A similar phenomenon is observed for the Mollicutes, though the fitted mutation bias values are much larger (m > 3.5), indicative of a strong GC to AT mutation bias. This high value for m is consistent with the loss of dUTPase and a reduced ability to correct erroneous dUTP incorporation into the genome [31, 32]. The selection-mutation model is capable of explaining genome-wide patterns of codon use with 94 % accuracy across all nitrogen availability groups. Consistent with nitrogen availability, the value of the selection bias for incorporation of nitrogen atoms in gene sequences was most negative in L N parasites (2N g s = −0.24), intermediate in M N parasites (2N g s = −0.15) and least negative in H N parasites (2N g s = −0.13) (Fig. 5; Additional file 1: Figure S5m). The distribution of 2N g s parameters for individual species within each group was significantly different when comparing L N species with M N or H N (ANOVA, p < 0.01); however, the difference between M N and H N species failed to reach significance (ANOVA, p > 0.05) (Additional file 1: Figure S6). As for the kinetoplastids, the AIC values of the selection-mutation model were better than for the models that consider either selection or mutation bias individually (Additional file 1: Figure S5J, L; Additional file 3: Table S1). Furthermore, significance testing showed that selection acting on mRNA nitrogen content was significant for all Mollicutes groups (L N p = 0.001, M N p = 0.001, H N p = 0.04). As coding sequences comprise the majority of these Mollicutes genomes (~83 %) the difference in genome-wide GC content between M N and L N species is fully attributable to differences in synonymous codon use (Additional file 3: Table S1).

Fig. 5 Model for synonymous codon use for Mollicutes explains relative synonymous codon use with ~94 % accuracy and recapitulates the difference in the nitrogen cost of genes. a The average mRNA nitrogen content per codon for 168 orthologous genes in the Mollicutes. b The average nitrogen content per double-stranded codon (dsDNA) for the genes in a. c Empirical codon use probabilities (expressed as percentages) plotted against themselves. d The average mRNA nitrogen content per codon for sequences simulated using synonymous codon use probabilities derived from fitting the selection-mutation model to the observed sequence data. e The average nitrogen content per double-stranded codon (dsDNA) for the genes in d. f The synonymous codon use probabilities inferred using the selection-mutation model plotted against the empirical codon use probabilities expressed as percentages. Dot colour corresponds to codon use probabilities for each nitrogen availability group. L N R2 = 0.95, M N R2 = 0.97, H N R2 = 0.91 Full size image

To test whether the observed bias in codon use was also seen more broadly across the genome and not just in the conserved single copy genes, an additional analysis was conducted on all complete coding sequences (Additional file 3: Table S1). The pattern of codon bias was recapitulated for this larger gene set. However, the values obtained from the model when considering all complete coding sequences were less extreme than the values obtained when considering conserved orthologous sequences. This is expected as conserved sites in conserved genes have previously been shown to exhibit stronger codon bias [36].

Gene expression negatively correlates with selection acting on mRNA nitrogen content

Selection acting on coding sequences is typically considered weak, especially given the low effective populations of the parasites in this study. However, previous studies have shown that selection is detectable in highly expressed genes [37–39] and most theories of codon usage predict that the degree of bias due to selection should increase with gene expression [40]. Given that there is a clear signature of selection acting on nitrogen content genome-wide, it was assessed whether the magnitude of this selection was a function of mRNA abundance. Here, the magnitude of selection acting on the nitrogen content of each gene was compared to the mRNA abundance of that gene. For each species there was a negative correlation between mRNA abundance and the fitted 2N g s (Additional file 1: Figure S7). This shows that the strongest selection to minimise nitrogen content is observed in the most highly expressed genes. Moreover, the slope of the line was greatest for the L N species, intermediate for the M N species and weakest for the H N species. This gene-level analysis is consistent with the genome-wide analysis that showed that L N species have the greatest selective pressure to minimise nitrogen use.

Low nitrogen availability (L N ) parasites have ribosomal RNA sequences that use the lowest amount of nitrogen

Ribosomal RNA (rRNA) typically constitutes the majority of RNA within a cell. To investigate whether selection acting on nitrogen content extends beyond coding sequences, the total nitrogen content of rRNA per ribosome was calculated. Consistent with the analysis of coding sequences, L N parasite rRNAs require the lowest amount of nitrogen. In the Mollicutes, L N parasites used eight fewer nitrogen atoms compared to M N and 63 fewer atoms compared to H N parasites per 70S ribosome (Additional file 3: Table S1). This difference is lower than expected when compared to the analysis of protein-coding genes. Given the length of the rRNA sequence analysed, a difference of 77 and 140 nitrogen atoms would have been predicted. This reduced difference is most likely due to structural constraints on rRNA and the fact that it is not composed of codons and so may lack the flexibility provided by synonymous codons.

The same analysis of rRNA sequences was carried out for the kinetoplastids. Consistent with the analysis of the Mollicutes, the RNA component of the 80S ribosome required the least amount of nitrogen in the L N kinetoplastid parasites. However, due to large insertions in Trypanosoma cruzi rRNAs, the M N parasites required more nitrogen than the H N . These inserted regions increased the total nitrogen content in the T. cruzi rRNA by >1500 nitrogen atoms (~7 % more than the other M N species; Additional file 3: Table S1). Thus, with one exception, the analysis of rRNA genes is consistent with the analysis of protein-coding genes.

Nitrogen content of nucleotide sequences can predict metabolic capability

Given that the relative use of synonymous codons is affected by selection acting on nitrogen content, it was determined to what extent the selection-mutation model could predict the dietary nitrogen content of an organism. This was tested by analysing four additional Mollicute genomes not included in the original analysis. Each additional species was classified as H N by model selection through maximum likelihood estimation (Additional file 3: Table S1). To provide support for these classifications, the parasites’ genomes were searched for genes required for amino acid and amino sugar catabolism. This revealed that, in contrast to M N Mollicutes parasites, the genomes of the additional species each encoded complete metabolic pathways for catabolism of either arginine and/or amino sugars (Fig. 6; Additional file 3: Table S1). Moreover, the genes for these pathways were co-located in gene clusters, indicative of genes belonging to the same metabolic pathway (Additional file 1: Figure S8). These results demonstrate the utility of the model for providing information about the metabolic capabilities of an organism from raw nucleotide sequences.

Fig. 6 Selection-mutation model of codon use can predict the metabolic capacity of parasites from raw nucleotide sequences. L N (green), M N (yellow) and H N (orange) are species with low, medium and high nitrogen availability, respectively. Species denoted with an asterisk (grey) are the new species, each of which is predicted to be a H N parasite. The presence of named genes involved in arginine catabolism and amino sugar catabolism is indicated by orange boxes. Gene names are provided below the boxes and the abbreviations correspond to the following genes: tr. transporter, arcA arginine deiminase, argF ornithine carbamoyltransferase, arcC carbamate kinase, nanA N-acetylneuraminate lyase, nanK/nagC N-acetylmannosamine kinase, nanE N-acetylmannosamine-6-p epimerase, nagA N-acetyl-D-glucosamine-6-phosphate amidohydrolase, nagB D-glucosamine-6-phosphate aminohydrolase, HyL hylauronate lyase, ugl glucuronidase. Grey outlined boxes encode aguA, whose protein product catalyses the same reaction as the product of arcA on an equivalent substrate Full size image

Selection acting on nitrogen content is independent of selection acting on translational efficiency

Translational selection, which is a function of the number of iso-accepting tRNAs encoded in a genome, has long been considered a major driver of codon bias [8]. To determine how selection acting on nitrogen content acts in concert with selection acting on translational efficiency (tAI), the model was expanded to include tAI as an additional parameter (see “Methods”). For the Mollicutes, unlike the result above where selection acting on nitrogen content was significant for all three parasite groups, it was found that considering tAI values alone or in conjunction with mutation bias was not significantly better than when tAI was omitted (p > 0.05). However, when all three parameters (nitrogen content, mutation bias and tAI) were considered together, the model fits the data significantly better than when considering just selection acting on nitrogen content and mutation bias for the L N and H N parasites (p ≤ 0.02). Thus, selection acting on translational efficiency is independent of selection acting on nitrogen content and only provides a significant contribution to codon bias in L N and H N species (Additional file 1: Figure S5).

In contrast, for the kinetoplastids it was found that the fit to observed patterns of codon use was significantly better with the inclusion of tAI values in conjunction with mutation bias (L N p = 0.018, M N and H N p = 0). The contribution of tAI was also significant for all three kinetoplastid parasite groups when all three parameters (nitrogen content, mutation bias and tAI) were considered together (L N p = 0.006, M N p = 0.001, H N p = 0; Additional file 1: Figure S5). Thus, as for the Mollicutes, selection acting on translational efficiency is independent of selection acting on nitrogen content. Furthermore, inclusion of translational efficiency in the model improves overall fit by ~2 % to give an average accuracy of 94.3 %. This compares to a 3.1 % improvement in overall fit when selection acting on nitrogen content is added to the model that only considers mutation bias and translational efficiency.

Selection acting on nitrogen content can explain why the most translationally optimal codons are not always the codons that are most frequently used. For example, in Mollicutes parasites only 33 % of the most frequently used codons for each amino acid are the most translationally efficient while 66 % are those with the lowest nitrogen content (Additional file 1: Figure S9). A similar pattern occurs in the kinetoplastid parasites, although the most translationally efficient codon is the most frequently used codon more often than the most nitrogen-efficient codon (74 % compared to 30 %, respectively). This interplay between translation and nitrogen content is also seen when the relative order of all synonymous codons is analysed in these two parasite groups (Additional file 1: Figure S9). Furthermore, these observations are consistent with the global analysis of codon use presented above which showed that selection acting on nitrogen content was more important than selection acting on translation efficiency in determining patterns of codon bias in Mollicutes, while selection acting on nitrogen content and translation efficiency was required to explain patterns of codon use in kinetoplastids.