Study design and data characteristics

The study was divided into three parts (Fig. 1), namely (1) false positive rate (FPR) testing, (2) spike-in retrieval testing, and (3) beta-diversity optimization. We used seven large datasets: three for the FPR tests and spike-in retrieval tests (labeled A1-A3), one simulated set (labeled A4) for assessing the effect of spike-in independent sample strata, and three for the beta-diversity optimization tests (labeled B1–B3). The datasets and their characteristics are presented in Additional file 1: Table S1. The datasets are characterized by having high degrees of sparsity, large variation in library sizes, and an overdispersed mean-variance relationship (Additional file 2: Figure S1).

Fig. 1 Spike-in approach and analysis flowchart. Left, a theoretical sample where the count data for OTU A is multiplied by 3 before rescaling to original sequencing depth. Right, flowchart of the simulation study, yielding FPR and AUC for each method, dataset, and set of variables, as well as R 2 values for all combinations of normalization, transformation, and distance in the beta-diversity optimization Full size image

False positive rates

We found striking differences in the FPR of the tested methods using identical permutations of the three large datasets A1–A3 (Additional file 3: Figure S2A). A total of 17,550 full DA tests were analyzed. Generally, many methods were robust, with FPR close to or below 0.05, as expected under the null hypothesis. However, edgeR, metagenomeSeq ZIG (unfiltered, see below), and especially baySeq displayed very high FPRs, indicating that models did not fit well to the data. Intriguingly, baySeq, edgeR, and negative binomial GLMs performed worse under balanced conditions, i.e., 50% cases and 50% controls, than under unbalanced conditions with only 10% cases. Most methods had low variance of FPR across iterations, but metagenomeSeq ZIG and especially baySeq showed considerable variation within parameter sets. To ensure that observed differences in FPRs between balanced conditions were not due to inherent biological signals or sample structures in the datasets used, we repeated the analysis in an additional simulated dataset (A4, n = 5850), based on within-OTU count permutation, retaining the biological distribution of OTU count data but breaking within-sample characteristics. With the exception of baySeq, no major deviations were observed from the results obtained with dataset A3 (Additional file 3: Figure S2B).

Next, we investigated the effect of OTU sparsity on test inference (Fig. 2) and observed that the sparsity of any given OTU had different effects when applying the different methods, in the feces dataset A1. OTU-wise p values from non-spiked single DA runs with 50% cases, selected by the median FPR, depended to some extent on the percentage of zeroes in the OTU in question. The methods edgeR, negative binomial GLM, metagenomeSeq ZIG (unfiltered), and especially baySeq displayed biased results at high sparsity, meaning that many zeroes lead to lower p values, irrespective of any signal in the data. This effect was to some extent ameliorated for metagenomeSeq by its filtering step, which essentially removes most of the sparse OTUs after the model has been estimated, as demonstrated in Fig. 2. The feature model did not exhibit this inflation. DESeq2 in particular exhibited a conservative estimation of p values at high sparsity and t tests, and Wilcoxon and the permutation tests were all very robust across the range. The ALDEx2 method was very conservative and showed a narrow band of p values resembling a Gaussian distribution around 0.5, regardless of sparsity.

Fig. 2 OTU sparsity vs. p value. Scatterplots of OTU sparsity vs p value with panels representing each differential relative abundance test method in feces dataset A1, with 50% cases. Colored line represents the LOESS regression on data. False positive rate (FPR) is defined as the fraction of OTUs with p < 0.05. Each differential relative abundance test represents the median FPR for that method, out of all 150 permutations. Contour lines indicate point density and can be compared to a hypothetical null distribution of p values demonstrated in the final panel (“Random uniform”) Full size image

Spike-in retrieval tests

A total of 175,500 spiked DA analyses from datasets A1–A3 were considered. The spike-ins were performed by increasing the raw counts of random OTUs in cases, either multiplicatively or additively (see Fig. 1), by a range of spike-in magnitudes, in order to measure the relative retrieval performance between methods.

Additional file 4: Figure S3A shows the area under the curve (AUC) value distributions of the receiver operating characteristics (ROC) curve when using p values to discriminate between multiplicatively spiked and non-spiked OTUs in the feces dataset A1, for all the methods, case proportions, and magnitudes. We found that most methods improved detection power as the spike-in magnitudes increase, though both Wilcoxon and metagenomeSeq ZIG (filtered) did not exhibit this property to the same extent as the others. The multiplicative spike-in AUC distributions for the other two datasets A2 and A3 (Additional file 4: Figure S3B, S3C) showed very similar characteristics. Overall, the best performance in terms of AUC was exhibited by the sequencing-specific methods edgeR, metagenomeSeq ZIG, and baySeq, as well as the assumption-free permutation test. The mid-level in performance was represented by negative binomial GLM, DESeq2, and ALDEx2, whereas t tests and Wilcoxon performed the worst. The robustness of these methods varied greatly, with some tests yielding AUC values below 0.5, in case of the t tests and Wilcoxon even with a median value below 0.5 in the unbalanced tests with case proportions at 10%. The most robust test was the permutation test and metagenomeSeq feature model, which only very rarely fell below 0.5 in AUC.

We repeated these analyses with additive spiking (Additional file 5: Figure S4A, S4B, S4C), yielding very similar results, albeit with lower variance in the distributions. We found that the methods exhibited the same hierarchy of performance across the three datasets as in the multiplicative spike-in tests. The best performing models were saturated already at magnitude 10, rendering magnitude of change 20 unnecessary.

Finally, we considered a mixed spike-in setup (Additional file 6: Figure S5). In this setup, methods were not as clearly separated in AUC values, although the same general hierarchy was retrieved. The highest AUC values were found in the methods edgeR, metagenomeSeq ZIG, negative binomial GLMs, and the permutation test.

Figure 3 shows the median AUC values vs the median FPRs, illustrating the overall performance of the various methods in the three datasets, at the highest magnitude (20) with multiplicative spiking.

Fig. 3 Median test AUC vs median test FPR. Scatterplots of median test area under the curve (AUC) vs median test false positive rate (FPR), for the datasets A1–A3; three different compartments in the COPSAC 2010 cohort. FPR is defined as the fraction of OTUs with p < 0.05. Dot color represents differential relative abundance test method, while dot shape represents experiment case/control balance Full size image

Subsets and simulated data

We repeated the FPR (n = 11,700) and spike-in tests (n = 117,000) to examine the relative performance of the various methods in small- and medium-sized datasets by subsetting datasets A1–A3, yielding datasets of lower sparsity (A1s–A3s and A1m–A3m) (Additional file 1: Table S1). We found that the performance hierarchy was very similar across all six subsets with regard to both FPR and spike-in AUC (Fig. 4). There were some deviations from the larger sets, especially metagenomeSeq ZIG and the permutation test performed worse in the subsets, whereas edgeR kept a high AUC but with much lower FPR. The distributions of FPR and AUC under all parameter combinations can be seen in Additional file 7: Figure S6A–D.

Fig. 4 Median test AUC vs median test FPR, small and medium subsets. Scatterplots of median test area under the curve (AUC) vs median test false positive rate (FPR), for the datasets A1s–A3s and A1m–A3m; three different compartments in the COPSAC2010 cohort, with 50% case/control proportion. FPR is defined as the fraction of OTUs with p < 0.05. Dot color represents differential relative abundance test method Full size image

In the simulated dataset A4, the AUC results (n = 58,500, Additional file 8: Figure S7A–C) were nearly identical to those from dataset A3, on which it was based, except for baySeq, which performed worse in dataset A4, but with very high variability.

Spike-in retrieval sensitivity to sparsity

We examined the effect of OTU sparsity on the ability of the various methods to retrieve spiked OTUs, expressed as the p value quantile of a spiked OTU within all p values from a dataset, for each method, which should ideally be as low as possible. Additional file 9: Figure S8A–C shows that almost all methods had better detection power at low sparsity (many positive samples), but the patterns of saturation were quite different. Additional file 9: Figure S8A shows that most methods were primarily dependent on the number of positive samples, with the lines from the differently sized datasets following each other closely. Notably, Wilcoxon tests were negatively influenced by zero-inflation, meaning that the detection power was decreased with dataset size for the same number of positive samples. For the smaller datasets, metagenomeSeq ZIG and baySeq did not show better performance with more positive samples. The filtered metagenomeSeq ZIG and feature model had several OTUs with quantiles of 1, since p values were filtered or not computed and therefore set to a value of 1. Additional file 9: Figure S8B shows sensitivity to case proportion, where almost all methods had better performance across the range with more balanced group sizes, though the effect was greatest for metagenomeSeq feature model and the t test. Finally, in Additional file 9: Figure S8C, we see that most methods saturate faster with high spike-in magnitude. Notably, the Wilcoxon test gains little from increased magnitudes. Overall, the metagenomeSeq feature model seemed to saturate at the lowest number of positive samples for across these figures, although it required at least two positive samples in each group to compute a p value, most evident in Additional file 9: Figure S8B.

Beta-diversity optimization

We studied the effects of library size normalization, count transformation, and distance metric on the ability to separate biologically relevant groups in beta-diversity analyses (Fig. 5). All analyses were significant with p < 0.001, but with large differences in R 2 values. In the feces dataset B1, the optimal separation was found in log transformed counts using 10−5 as pseudocount with weighted UniFrac, yielding an Adonis R 2 value of 0.367. In the HMP dataset B2, the optimal was non-transformed weighted UniFrac, with an R 2 value of 0.166. In the feces dataset B3, TMM normalization was not possible due to high sparsity, and this method was omitted from the analysis. The optimal separation was found in log transformed counts using 10−5 as pseudocount with weighted UniFrac, yielding an R 2 value of 0.145. As a sensitivity analysis to include TMM, we agglomerated closely related OTUs, thereby reducing the number of OTUs and the sparsity, which did not materially change the results, see Additional file 10: Figure S9.

Fig. 5 The effect of normalization, transformation, and distance metric on beta-diversity separation. a Datasets B1–B3 (vertical panels) with all combinations of library size normalizations (horizontal panels) and count transformations (color) applied prior to the calculation of distances and use of Adonis permanova model. Effect of design variable in question (B1—age; B2—tongue versus palate; B3—age group) measured as model R 2 value. The highest and lowest R 2 values (yielding best and worst separation, respectively) are demonstrated in subplots b–g for each dataset as principal coordinate analysis plots, colored by design variable, with overlaid prediction ellipses (B1—subplot (b, c); B2—subplot (d, e); B3—subplot (f, g)). CSS cumulative sum scaling, TMM trimmed mean of M values, TSS total sum scaling Full size image

Across all three datasets, several characteristics were very similar. The most important factor was the choice of distance metric, with the weighted UniFrac metric scoring the highest in terms of separation power in all three datasets. The transformation applied to (normalized) counts was of less importance. In the feces dataset B1, log transformations with very small pseudocounts were best, whereas the untransformed counts were optimal in the HMP oral dataset B2. In both cases, the effects were large. The effect of library size normalization was by far the lowest, especially between no normalization, CSS, DESeq2, and TMM normalization, which essentially did not matter in terms of separation in any of the three datasets. TSS was the normalization type with the highest impact, but it mostly changed the optimal transformation choice, rather than improve or deteriorate the separation power as such. These effects were very apparent when comparing the optimal combinations of normalization, transformation, and distance with the poorest for each dataset, as described above, visualized with principal coordinates analysis (PCoA) plots in panels B and C of each dataset subplot.