To describe variation within healthy gut microbiota across different human populations, we randomly selected 123 metagenomes of healthy individuals from the Human Microbiome Project (HMP, n=42) [13], controls in a study of type II diabetes (T2D, n=44) [33], and controls in a study of glucose control (GC, n=37) [34]. These span American, Chinese, and European populations, respectively (see the “Methods” section). We mapped these metagenomes to KEGG Orthology (KO) families with ShotMAP [35] and counted reads for 17,417 gene families.

Accurately normalizing gene read counts so that they are comparable across samples and studies is critical to our meta-analytical approach and any quantitative evaluation of shotgun metagenomes. We therefore quantified gene family abundance using reads per kilobase of genome equivalents (RPKG) [36]. This method of calculating abundances takes into account differences in the average genome size within different metagenomes, as well as factors such as gene length, that can also bias counts (long genes will generally have a greater proportion of reads).

Unadjusted calculation of gene variability yields misleading results

One straight-forward approach to determining gene family variability, which has previously been employed in the literature [13], would simply be to calculate the variance of gene family abundances across all datasets. The tails of this distribution—for example, the top and bottom 10%—could then be termed “variable” and “invariable” gene families. However, by this metric, the most “variable” gene families would actually be enriched for pathways such as the ribosome (FDR-corrected p value q=2.4×10−10), DNA replication (q=0.07), and aminoacyl-transfer RNA (tRNA) biosynthesis (q=1.2×10−6). These results contradict biological intuition: it would be very surprising for genes within the best-conserved “housekeeping” pathways to be among the most variable, since they appear in most microbial genomes. (Here, we define “housekeeping” gene families as those involved in fundamental, highly conserved cellular processes such as translation, DNA replication, and central metabolism). Indeed, out of a recent list of 74 protein-coding genes that were universally present and single-copy in bacterial genomes, 14 were ribosomal genes and 10 were tRNA synthetases or tRNA modification enzymes [37]; “housekeeping” pathways also dominated previous lists of bacterial universal and single-copy genes [38].

Furthermore, according to this same straight-forward metric, the least variable gene families would include those involved in disease signatures such as “salmonella infection” (q=0.027), “pertussis” (q=1.4×10−3), and “legionellosis” (q=4.9×10−3). The presence of genes in these disease signatures does not necessarily indicate the presence of that disease or an active infection. However, it seems unlikely for genes involved in pathogenicity to be among the most stable components of the healthy human microbiome.

The explanation for this counterintuitive result can be visualized by plotting the mean vs. variance for each measured gene family (Fig. 1): in shotgun metagenomic data, mean and variance are tightly correlated over the entire range of means. This phenomenon is robust to the number of samples assessed (Additional file 1: Figure S1). Similar mean-variance relationships are actually observed in other high-throughput sequencing applications, such as RNAseq [39, 40] (which is why standard hypothesis tests based on assuming normality are inappropriate for RNAseq data, if the correct variance-stabilizing transformations are not applied [40]).

Fig. 1 Shotgun metagenomic data show a very strong mean-variance relationship. The log10(mean) is plotted against log10(variance) for each gene family (points) in each study (headings). Bacterial ribosomal proteins (green), aminoacyl-tRNA charging genes (orange), and genes annotated to the T3SS-dependent Salmonella pathogenesis signature in KEGG (blue) are highlighted. Trend lines show a Poisson (dashed blue line) and a negative binomial (dashed red line) fit to the count data. Negative binomial provides a better fit in all three data sets Full size image

This mean-variance relationship means that gene families encoding, for example, the bacterial ribosome, which are among the most abundant in these metagenomes, will therefore have the highest sample variance as well. Meanwhile, gene families with low average abundance, such as those involved in the disease signatures listed above, will appear to be invariable when in reality they are simply very infrequently observed. For example, three out of five of the invariable gene families annotated to pertussis only have one read each in a single sample, which constitutes extremely weak evidence for their presence in the metagenome, let alone invariability. This approach also leaves us unable to detect gene families that are variable but relatively abundant, as well as the opposite (Fig. 2 a–d).

Fig. 2 The residual variance statistic captures variation in gene families after accounting for between-study variation. The left-hand panels (“original abundances") show filled circles representing log-RPKG abundances for gene families from the KEGG Orthology (KO), with per-study means shown in solid horizontal lines and the distance from these means shown as dashed vertical lines. The right-hand panels (“residuals") show the same gene families after fitting a linear model that accounts for these per-study means, with an accompanying density plot showing the distribution of these residuals. \(V_{g}^{\epsilon }\) values in bold underneath density plots are the calculated variances of these residuals. These gene families are sets of orthologs corresponding to the genes a tatA, b devR, c waaW, d thrC, e gspA, f tssB, g dctS, and h ecnB. Panels a,b show two invariable gene families with relatively high (a) and low (b) average abundance; similarly, panels c, d show two variable gene families with relatively low (c) and high (d) average abundances. Panels e, f show two gene families involved in secretion with similar abundances, but low (e) vs. high (f) variability. Finally, panels g, h show that both invariable (g) and variable (h) gene families can have substantial study-specific effects. (All gene families displayed were significantly (in)variable using CCoDA, FDR ≤5%) Full size image

Gene family abundances can also vary by study, because of both biological differences between populations and technical factors including library preparation, amplification protocol, and sequencing technology. However, gene families with large study effects may or not be variable within each study, and vice versa (see, e.g., Fig. 2 e–h). Our method should therefore also take this potential confounder into account.

Finally, to assess statistical significance, we need to assess the range of variances we would expect for a particular gene family given its mean abundance, requiring us to model the overall mean-variance relationship. Figure 1 shows that this mean-variance relationship cannot be adequately captured by a Poisson distribution (blue dashed line), in which the mean and variance are equal; however, a better fit can be obtained by using the negative binomial distribution (red dashed line), a count-based distribution that allows for overdispersion, i.e., variance that exceeds the mean. Indeed, simply based on this negative binomial best-fit, ribosomal proteins are likely less variable than expected since they fall well below the trend line in all three datasets (Fig. 1). The negative binomial is commonly used in other sequencing applications, such as RNAseq [21], which has similar overdispersion.

A new test, CCoDA, captures the variability of microbial gene families

We present a model that enables gene family abundance to be quantitatively compared across metagenomes for thousands of microbial genes. To account for study effects, we fit a linear model of log abundance D g,s for gene g in sample s as a function of the overall mean abundance μ g and a term β g,y that quantifies the offset for each study y:

$$ D_{g,s}=\mu_{g}+\sum_{y\in Y}I_{y,s}\beta_{g,y}+\epsilon_{g,s} $$ (1)

where I y,s is an indicator variable that is 1 if sample s belongs to study y and 0 otherwise. In this simple model, β g,y is simply the mean of gene g in study y after subtracting the overall mean μ g , and ε g,s are the residuals left after these study-specific means β g,y are subtracted out.

The residual ε g,s quantifies how much the abundance of gene g in sample s differs from the average abundance across samples in the same study as s. We denote the variance of the residuals across samples by \(V_{g}^{\epsilon }\). When this statistic is small, the gene has similar abundance across samples after accounting for study effects. A large value of \(V_{g}^{\epsilon }\) indicates that samples have very different abundances.

To assess the statistical significance of gene family variability, as suggested above, we compare the residual variance \(V_{g}^{\epsilon }\) to a data-driven null distribution based on the negative binomial distribution (see the “Methods” section and Additional file 2: Figure S2). This approach is necessary because there is no straight-forward formula for the p value of \(V_{g}^{\epsilon }\). Our method looks for deviations from the null hypothesis that gene families in the dataset have the same mean-variance relationship. This relationship is captured by the overdispersion parameters k y , such that the variance for a gene g in a study y is given by:

$$ \sigma^{2}_{g,y} = \beta_{g,y} + \frac{{\beta^{2}_{g,y}}}{k_{y}} $$ (2)

where β g,y are study-specific means for gene g as above.

Because this null distribution is generated stochastically per gene family from a count-based distribution matching the observed mean, i.e., by parametric bootstrapping, the null naturally accounts for the expected amount of noise based on the number of times a given gene family is observed. Gene families with low abundance or a high proportion of zeros are therefore more likely to be called non-significant than variable (Additional file 3: Figure S6 C–D).

We validated this approach further and assessed type I and type II error rates with simulated data (see the “Methods” section, Additional file 4: Figure S4), finding that CCoDA has high power and good control over the false positive rate when the overdispersion parameter k used in the null distribution is accurately estimated. To make the test more robust to factors affecting the estimation of k (Additional file 5: Figure S5), we also used simulation to control the false discovery rate empirically (Table 1).

Table 1 q value cutoffs to reach a given empirical FDR, estimated from simulation Full size table

CCoDA can be applied to shotgun metagenomes to sensitively and specifically identify variable genes in any environment without prior knowledge of factors that stratify relatively high versus low abundance samples.

Thousands of variable gene families in the gut microbiome

Using CCoDA, we found 2357 gene families with more variability than expected and 5432 with less (leaving 9628 non-significant) at an empirical FDR of 5% (Additional file 3: Figure S6A). Restricting the analysis to gene families with at least one annotated representative from a bacterial or archaeal genome in KEGG, we obtained 1219 significantly variable and 3813 significantly invariable gene families (and 2194 non-significant). The differences in the residual variation of these gene families can be visualized using a heatmap of the residual ε g,s values (Additional file 6: Figure S7 and Additional file 7: Figure S8). The large number of genes that were less variable than expected given their means supports the hypothesis of some functional redundancy in the gut microbiome, potentially due to selection for core functions that make microbes more successful in the gut environment. Notably, the HMP cohort tended to have overall lower variance in their metagenomes than the GC and T2D cohorts; this may be because the exclusion criteria for HMP, which explicitly studied only healthy individuals, were particularly stringent [41]. Nevertheless, our discovery of thousands of significantly variable genes across a range of abundance levels demonstrates that the gut microbiome is less invariable than prior work suggested.

This result highlights the importance of a quantitative, gene-level evaluation of functional stability. Importantly, the magnitude of the residual variance statistic \(V_{g}^{\epsilon }\) is not the sole determinant of significance, as illustrated by the overlap in distributions of \(V_{g}^{\epsilon }\) between the variable, invariable, and non-significant gene families. For example, both low-abundance gene families with many zero values and high-abundance but invariable gene families will tend to have low residual variance, but the evidence for invariability is much stronger for the second group. Our test accurately discriminates between these scenarios, tending to call the second group significantly invariable and not the first (Additional file 3: Figure S6A, inset), whereas an approach that simply thresholded \(V_{g}^{\epsilon }\) would be unable to distinguish between them.

Biological pathways contain both invariable and variable components

To test our hypothesis that the appearance of pathways and functional categories with similar abundance across samples can be explained by a subset of core components, we examined individual gene variability within KEGG modules. As expected, we observed an overall signal of stability at this broad level of gene groupings. Many of the pathways previously identified as invariable (e.g., aminoacyl-tRNA metabolism, central carbon metabolism) indeed have more invariable than variable genes. However, individual genes show a much more complex picture. Even the most invariable pathways also include significantly variable genes (Fig. 3). For example, the highly conserved KEGG module set “aminoacyl-tRNA biosynthesis, prokaryotes” included one variable gene at an empirical FDR of 5%, sepRS. sepRS is an O-phosphoseryl-tRNA synthetase, which is an alternative route to biosynthesis of cysteinyl-tRNA in methanogenic archaea [42]. Methanogen abundance has previously been noted to be variable between individual human guts: while DNA extraction for archaea may be less reliable than for bacteria, even optimized methods showed large standard deviations across individuals [43]. Another gene in this category was variable at a weaker level of significance (10% empirical FDR): poxA, a variant lysyl-tRNA synthetase. Recent experimental work has shown that this protein has a diverged, novel functionality, lysinylating the elongation factor EF-P [44, 45].

Fig. 3 Most pathways include a mixture of both variable and invariable gene families. a Stacked bar plots show the fraction of invariable (blue), non-significant (gray), and variable (red) gene families annotated to KEGG Orthology pathway sets (rows), at different false discovery rate (FDR) cutoffs (color intensity). Only gene families with at least one annotated bacterial or archaeal homolog were counted. b Fraction of strongly invariable, non-significant, and strongly variable gene families within the ribosomes of different kingdoms. Row labels with only one kingdom indicate gene families unique to that kingdom, and rows with multiple kingdoms (e.g., “Eukaryotes/archaea”) indicate gene families shared between these two kingdoms. As expected, the bacterial ribosome was completely invariable Full size image

By comparison, 77% of the tested prokaryotic gene families in the KEGG module set “central carbohydrate metabolism” were significantly invariable, and 5.6% (five genes) were significantly variable (Additional file 8: Figure S9) at an empirical FDR of 5%. In this case, the variable gene families highlight the complexities of microbial carbon utilization (see Additional file 9 for details).

One of the more variable pathways was the “bacterial secretion system.” We found that the majority of significantly variable gene families annotated to this pathway (16 out of 18) were involved in specialized secretion systems, especially the type III and type VI systems (Fig. 4). These secretion systems are predominantly found in Gram-negative bacteria and are often involved in specialized cell-to-cell interactions, between microbes and between pathogens or symbionts and the host. They allow the injection of effector proteins, including virulence factors, directly into target cells [46, 47]. Type VI secretion systems are also determinants of antagonistic interactions between bacteria in the gut microbiome [48, 49].

Fig. 4 Variable and invariable gene families involved in bacterial secretion separate by gene function. a Schematic diagram showing the type III (T3SS), type VI (T6SS), Sec, and Tat secretion system gene families measured in this dataset. Gene families are color-coded by whether they were variable (red), invariable (blue), or neither (gray), with strength of color corresponding to the FDR cutoff (color intensity). Insets show a summary of how many gene families in KEGG modules corresponding to a particular secretion system were variable or invariable and at what level of significance. b Heatmaps showing scaled residual log-RPKG for gene families (rows) involved in bacterial secretion. Variable (red) and invariable (blue) gene families were clustered separately, as were samples within a particular study (columns). log-RPKG values were scaled by the expected variance from the negative binomial null distribution. Genes in specific secretion systems are annotated with colored squares (T6SS: red-orange; T3SS: orange; Tat: yellow; Sec: grey) Full size image

In contrast, gene families in the Sec (general secretion) and Tat (twin-arginine translocation) pathways were nearly all significantly invariable at an empirical FDR of 5%, with only one gene in each being found to be significantly variable. This contradicts previous suggestions that the Sec and Tat pathways were some of the most variable in the human microbiome [13]. This discrepancy is probably due to our accounting for the mean-variance relationship in shotgun data. The Sec and Tat systems are abundant and phylogenetically diverse [50] and will therefore have greater variance than low-abundance genes just by chance. Our test adjusts for this feature of sequencing experiments and shows that these genes are in fact less variable than expected given their mean abundance.

Our results further demonstrate that analyzing functional variability at the level of pathways can obscure gene-family-resolution trends of potential biomedical importance. The variability of individual gene families involved in lipopolysaccharide (LPS) metabolism may exemplify such a case. LPS (also known as “endotoxin”) is a macromolecular component of the Gram-negative bacterial outer membrane, consisting of a lipid anchor called “lipid A,” a “core oligosaccharide” moiety, and a polysaccharide known as the “O-antigen” (which may be absent). Lipid A is sensed directly by the human innate immune system via the Toll-like receptor TLR4. Furthermore, lipid A variants with different covalent modifications (e.g., differentially acylated [51], phosphorylated [52], and palmitoylated [53] variants) have been shown to have different immunological properties (see Additional file 9: Supplementary information).

We found that all but one gene family involved in the biosynthesis of lipid A, as well as all gene families involved in the biosynthesis of the core oligosaccharide components ketodeoxyoctonate (Kdo) and glyceromannoheptose (GMH), were significantly invariable (16 out of 17; Fig. 5). The lone exception catalyzes the the final lipid A acylation step, adding a sixth acyl chain; this gene family was significantly variable (FDR ≤ 5%). Furthermore, we observe several variable gene families annotated as performing covalent modifications of LPS, including hydroxyl- (lpxO), palmitoyl- (pagP), and palmitoleoylation (lpxP), as well as deacylation and dephosphorylation. These modifications can lead to differential TLR4 activation [53, 54]. We also observe that gene families involved in O-antigen synthesis and ligation to lipid A tended to be variable (5 out of 6). These results suggest that healthy individuals may differ in the amount of hexa- vs. pentaacylated LPS, and in the amounts of other LPS chemical modifications, and thus in their baseline level of TLR4-dependent inflammation. Importantly, since the majority of gene families annotated to LPS biosynthesis were invariable, this result would have been missed by considering the pathway as a unit.

Fig. 5 Central Kdo and lipid A biosynthesis is invariable, but many genes involved in covalent modifications to LPS are variable. a Pathway schematic showing a selection of measured gene families involved in lipopolysaccharide metabolism. Gene families are color-coded by whether they were variable (red) or invariable (blue), with strength of color corresponding to the FDR cutoff (color intensity). Central Kdo and lipid A metabolism is highlighted in light gray. Abbreviated metabolites are (GlcNAc N-acetylglucosamine), (Kdo ketodeoxyoctonate), (R5P ribose-5-phosphate), (S7P sedoheptulose-7-phosphate), (GMH glyceromannoheptose), (aminoarabinose 4-amino-4-deoxy-L-arabinose). b Heatmaps showing scaled residual log-RPKG for gene families (rows) involved in lipopolysaccharide metabolism, as in Fig. 4 Full size image

Many invariable gene families are deeply conserved

Conservation of gene families across the tree of life is one factor we might expect to affect gene variability. For instance, ribosomal proteins should appear to be invariable merely because they are shared by all members of a given kingdom of life. To explore the relationship between gene family taxonomic distribution and variability in abundance across hosts, we constructed trees of the sequences in each KEGG family using ClustalOmega and FastTree. We then calculated phylogenetic distribution (PD), using tree density to correct for the overall rate of evolution (Dongying Wu, personal communication, 2015) (Fig. 6 a).

Fig. 6 Phylogenetic distribution (PD) of gene families partially explains gene family variability. Scatter plot shows log10 PD (x-axis) vs. log10 residual variance statistic (y-axis). Red points were significantly variable and blue points were significantly invariable. Gene families in specific functional groups are also highlighted in different colors, specifically the bacterial ribosome (green), the type VI secretion system (or “T6SS”; orange), the KinABCDE-Spo0FA sporulation control two-component signaling system (yellow), and hypothetical genes (tan squares). Gene families that were significantly invariable (ribosome and sporulation control) or significantly variable (hypothetical genes and the T6SS) at an estimated 5% FDR are outlined in black. The bacterial ribosome, as expected, had very high PD and was strongly invariable. The type VI secretion system genes, in contrast, were conserved but variable, and some genes involved in the Kin-Spo sporulation control two-component signaling pathway had low PD but were invariable. Only gene families with at least one annotated bacterial or archaeal homolog are shown Full size image

Overall, invariable gene families with below-median PD tended to be involved in carbohydrate metabolism and signaling. Specifically, these 2046 gene families were enriched for the pathways “two-component signaling” (q=1.5×10−15), “starch and sucrose metabolism” (q=1.8×10−3), “amino sugar and nucleotide sugar metabolism” (q=0.063), “ABC transporters” (q=2.4×10−5), and “glycosaminoglycan [GAG] degradation” (q=0.053), among others (Additional file 10). Enriched modules included a two-component system involved in sporulation control (q=0.018), as well as transporters for rhamnose (q=0.14), cellobiose (q=0.14), and α- and β-glucosides (q=0.14 and q=0.19, respectively). These results are consistent with the hypothesis that one function of the gut microbiome is to encode carbohydrate-utilization enzymes the host lacks [55]. Additionally, recent experiments showed that the major gut commensal Bacteroides thetaiotaomicroncontains enzymes adapted to the degradation of sulfated glycans including GAGs [56, 57] and that many Bacteroides species can in fact use the GAG chondroitin sulfate as a sole carbon source [58].

Out of the 298 significantly variable gene families with the above median PD, we found no pathway enrichments but three module enrichments. These included the archaeal (q=1.5×10−3) and eukaryotic (q=8.7×10−9) ribosomes, which reflects differences in the relative abundance of microbes from these domains of life across hosts (Fig. 3 b). The third conserved but variable module was the type VI secretion system (q=0.039). Intriguingly, specialized secretion systems were also observed to vary within gut-microbiome-associated species in a strain-specific manner, using a wholly separate set of data [59]. Finally, gene families described as “hypothetical” were enriched in the high-PD but variable gene set (p=2.4×10−8, odds ratio = 2.2) and depleted in the low-PD but invariable set (p=5.4×10−13, odds ratio = 0.41).

Transporters show strain-specific variation in copy number across different human gut microbiomes [59], and analyses by Turnbaugh et al. identified membrane transporters as enriched in the “variable” set of functions in the microbiome [12]. However, we mainly found transporters enriched among gene families with similar abundance across hosts, despite being phylogenetically restricted (low-PD but invariable genes; Additional file 11). Part of this difference is likely due to our stratifying by phylogenetic distribution, a step previous studies did not perform.

Proteobacteria are the major source of variable genes

To assess which taxa contributed these variable and invariable genes, we first computed correlations between phylum relative abundances (predicted using MetaPhlAn2 [60]) and gene family abundances. Specifically, we used a permutation test based on partial Kendall’s τ correlation. This test is rank-based and thus distribution-agnostic, handles ties (unlike Spearman’s ρ), and accounts for study-to-study variation by using partial correlation (see the “Methods” section). The resulting p values were corrected for multiple testing using the q value method and thresholded at an FDR of q≤0.05. Based on these results, we then determined whether phyla were enriched for variable or invariable genes by Fisher’s exact test (Bonferroni-corrected p≤0.05). This analysis revealed that the predicted abundance of Proteobacteria was strongly enriched for correlations with variable gene families (Bonferroni-corrected p≤10−8): Fig. 7 b). The abundance of the archaeal phylum Euryarchaeota was also enriched for correlations with variable gene families, to a lesser extent (Bonferroni-corrected p≤10−4).

Fig. 7 Variable gene families correlate with the predicted abundance of Proteobacteria. Bar plots give the fraction of gene families in each category (significantly invariable, non-significant, and significantly variable, 5% FDR) that were significantly correlated to predicted relative abundances of phyla, as assessed by MetaPhlAn2, using partial Kendall’s τ to account for study effects and a permutation test to assess significance. Asterisks give the level of significance by chi-squared test of non-random association between gene family category and the number of significant associations. (*** p≤10−8 by chi-squared test after Bonferroni correction; ** p≤10−4) Full size image

Proteobacteria were a comparatively minor component of these metagenomes (median = 1%), compared to Bacteroidetes (median=59%) and Firmicutes (median=33%: see Additional file 12: Figure S10), which were more associated with invariable genes (Bonferroni-corrected p≤10−8). Euryarchaeota comprised an even smaller fraction of the microbiome (median=0%) and was only detected in 33% of metagenomes (though this could potentially be explained by unreliable extraction efficiency for archaea, as mentioned above [43]). However, seven samples from the GC and T2D cohorts had ≥15% Proteobacteria, with four having ≥20% and one having 41%. Overgrowth of Proteobacteria has been associated with metabolic syndrome [32] and inflammatory bowel disease [61]. Also, Proteobacteria can be selected (over Bacteroidetes and Firmicutes) by intestinal inflammation as tested by TLR5-knockout mice [62], and some Proteobacteria can induce colitis in this background [63], potentially leading to a feedback loop. Thus, the variable gene families we discovered could be biomarkers for dysbiosis and inflammation in otherwise healthy hosts.

It has been proposed that a small number of “enterotypes” may exist in the human gut microbiome, each with distinct taxonomic composition [25, 26]. Most recently, abundances of the taxa Ruminococcaceae, Bacteroides, and Prevotella were found to explain the most taxonomic variation across individuals [28]. These enterotypes appeared to be linked to long-term diet, with Prevotella highest in individuals with the most carbohydrate intake and Bacteroides correlating with protein and animal fat. However, while these clades contributed most to taxonomic variation, in our study, all were actually depleted for associations with variable genes. In contrast, the Proteobacterial family Enterobacteriaceae (Additional file 13: Figure S12B), and to a lesser extent, Gammaproteobacteria in general (Additional file 13: Figure S12C) were much more likely to be associated with variable gene families. Similar results were also obtained using the centered log-ratio (clr) transform to correct potential compositionality artifacts (see Additional file 14: Figure S16). This suggests that compared to previously identified enterotype marker taxa, levels of Proteobacteria, and potentially Euryarchaeota, better explain person-to-person variation in gut microbial gene function. These less abundant phyla were missed in enterotype studies, likely because enterotypes were identified by methods that will tend to weight higher-abundance taxa more, and enterotypes were identified from taxonomic, not functional data.

Because Proteobacteria are a relatively well-annotated yet low-abundance phylum, we explored whether either of these characteristics explain their association with variable genes. Importantly, genes correlated with Actinobacteria did not tend to be variable, even though Proteobacteria and Actinobacteria had similar levels of abundance (Additional file 12: Figure S10). Also, while they were comparatively low abundance compared to Bacteroidetes or Firmicutes, Proteobacteria were also generally not close to the limit of detection when present: Proteobacterial relative abundance was more than 0.18 in 90% of samples, whereas MetaPhlAn2 was able to detect taxa with relative abundances of only 0.0004% in our data. Low abundance therefore does not appear to explain this association.

The number of phyla present in our data was not enough to determine whether there was any trend for low-prevalence or low-abundance taxa to be more correlated with variable genes. To answer this question, we conducted the same analysis with bacterial and archaeal taxa at the family level. However, when considering the 30 families with significant enrichments for (in)variable or non-significant gene families, there was no significant association between the degree of enrichment for variable genes and either prevalence (r=−0.07, p=0.72) or abundance (r=−0.1, p=0.58) (Additional file 13: Figure S12D-E). In fact, Enterobacteriaceae, a Proteobacterial family, was significantly enriched for correlations with variable genes despite a prevalence of 86%, in the top 25% of all families detected. Thus, prevalence and abundance do not explain the variability of Proteobacterial genes.

To investigate annotation bias, we first compared the numbers of genomes in KEGG for each phylum. There are 1111 Proteobacterial genomes compared to 575 for Firmicutes, 276 for Actinobacteria, 104 for Euryarchaeota, and only 97 for Bacteroidetes. Accordingly, Proteobacteria had the most gene families (1417) not annotated in any other phylum (“private” gene families), compared to 538 for Firmicutes, 342 for Euryarchaeota, 215 for Actinobacteria, and 21 for Bacteroidetes. Considering only these private gene families, Proteobacteria and Euryarchaeota were enriched for variable genes, as before, whereas variable genes were depleted in the other three phyla (Additional file 15: Figure S13A). This suggests that the level of annotation does not predict the amount of variable genes. In a further test, we repeated the entire statistical test on a subset of genes, sampling one part phylum-specific genes drawn equally from Proteobacteria, Actinobacteria, Firmicutes, and Euryarchaeota, and one part genes annotated to all four phyla (see the “Methods” section). Again, Proteobacteria- and Euryarchaeota-specific genes were significantly variable more often than those from either Actinobacteria or Firmicutes (Additional file 15: Figure S13B). We therefore concluded that phylum abundance and annotation bias do not explain the enrichment of variable genes in Proteobacteria.

Finally, variable genes also do not appear to be biomarkers for either taxonomic statistics or measured host characteristics. To explore this question, we used the same two-sided partial Kendall’s τ test as above. With regard to taxonomic statistics, we tested α-diversity (measured by Shannon entropy), the Bacteroidetes/Firmicutes ratio, and average genome size (AGS): however, all of these correlated more often with invariable gene families (see Additional file 9, Additional file 13: Figure S12A). For host characteristics, we selected body mass index, sex, and age, which were measured in all three studies we analyzed. None of these variables correlated significantly with any variable gene family abundances, even at a 25% false discovery rate.

One study (GC) measured blood levels of three inflammatory markers, TNF α, IL-1, and CD163, which did not correlate with Proteobacterial abundance in this study (Holm-corrected p value > 0.2, Kendall’s τ). However, other inflammatory markers directly linked to changes in Proteobacterial abundance (e.g., IgA, IL-10, and IL-17, reviewed in [32]) were not measured in this panel. These results suggest that major correlates of variation in microbiota gene levels, possibly including diet and specific inflammatory markers, remain to be measured.

Bacterial phyla have unique sets of variable genes

The variable gene families we identified seem to include both genes whose variance is explained by phylum-level variation (e.g., Proteobacteria) and genes that vary within fine-grained taxonomic classifications, such as strains within species. Also, some gene families may confer adaptive advantages in the gut only within certain taxa. To detect gene families that are variable or invariable within a phylum, we repeated the test, but using only reads whose best RAPSearch2 [64] alignments were to sequences from whole genomes of each of the four most abundant bacterial phyla (Bacteroidetes, Firmicutes, Actinobacteria, and Proteobacteria). Most (77%) gene families showed phylum-specific effects. Invariable gene families tended to agree, but the reverse was true for variable gene families: 19.4% of gene families that were invariable in one phylum were invariable in all, compared to just 0.34% (eight genes) in the variable set (Fig. 8 a, b). This trend was robust to the FDR cutoff (Additional file 16: Figure S14A–B). Gene families invariable in all four phyla were enriched for basal cellular machinery, as expected (Additional file 17: C–D).

Fig. 8 Phylum-specific tests reveal hidden variability in the most prevalent bacterial phyla. a, b Venn diagrams showing the number of significantly variable (a) and invariable (b) gene families across Proteobacteria, Bacteroidetes, and Firmicutes, FDR ≤5%. c Bars indicate the fraction of phylum-specific variable gene families that were also variable overall (yellow, “both tests”) or that were specific to a particular phylum (red, “phylum-specific test only”) Full size image

The relationship between phylum-specific and overall gene family abundance variability differed by phylum. Proteobacteria-specific variable gene families tended to be variable overall (59%), whereas the proportions of gene families that were also variable overall were much lower for Bacteroidetes- (12%), Firmicutes- (29%), and Actinobacteria-specific (18%) gene families (Fig. 8 c). This supports the hypothesis that Proteobacterial abundance is a dominant factor influencing functional variability in the human gut microbiome. It further suggests that many overall-variable gene families are not only merely markers for the amount of Proteobacteria (or some other phylum) but are also variable at finer taxonomic levels, such as the species or even the strain level [59, 65].

Comparing the two dominant phyla in the gut, Bacteroidetes and Firmicutes, we further observe that the overall proportions of variable and invariable families were similar across pathways, with some interesting exceptions. For example, LPS biosynthesis had more invariable gene families in Bacteroidetes than in Firmicutes, which we expected given that LPS is primarily made by Gram-negative bacteria. Conversely, both two-component signaling and the PTS system had many more invariable gene families in Firmicutes than in Bacteroidetes (Additional file 16: Figure S14C). However, phylum-specific variable gene families tended not to overlap (median overlap, 0%, compared to 46% for invariable gene families). This was even true for pathways where the overall proportion of variable and invariable gene families was similar, such as cofactor and vitamin biosynthesis and central carbohydrate metabolism (Additional file 16: Figure S14D). Thus, unique genes within invariable pathways vary in their abundance across microbiome phyla.

Furthermore, the enriched biological functions of the phylum-specific variable gene families differed by phylum (Additional file 18). For instance, Proteobacterial-specific variable gene families were enriched (Fisher’s test enrichment q=0.13) for the biosynthesis of siderophore group nonribosomal peptides, which may reflect the importance of iron scavenging for the establishment of both pathogens (e.g., Yersinia) and commensals (e.g., Escherichia coli) [66]. Another phylum-specific variable function appeared to be the type IV secretion system (T4SS) within Firmicutes (q=0.021). Homologs of this specialized secretion system are involved in a wide array of biochemical interactions, including the conjugative transfer of plasmids (e.g., antibiotic-resistance cassettes) between bacteria [67]. We conclude that our approach enables the identification of substantial variation within all four major bacterial phyla in the gut, much of which is not apparent when data are analyzed at broader functional resolution or without stratifying by phylum.