Merging the geo-referenced 16S rRNA sequence data resulted in 844 valid soil samples, of which, 320 representative sampling sites were obtained after sample aggregation (Fig. 1a). Only bacterial diversity was analyzed, as the use of 16S rRNA sequences precludes the investigation of fungal diversity in the current study. Despite covering all 14 classified biomes of the world22, sampling was not even, and some biomes and continents were under- or overrepresented (e.g., deserts contribute to about 18.9% of the terrestrial surface, yet only 6.3% of samples originated from these environments). From a total of 256,620 amplicon sequence variants (ASV) detected, we removed Archaea and unassigned sequences (at kingdom level, 1.55%) leaving 98.45% of bacterial ASVs. For ease of communication, we refer to the designated bacterial ASVs as “species” throughout the text. The widest range of species richness was observed in deserts (Fig. 1b) and could be attributed to the wide span of variations in environmental conditions in such biomes23. The relatively low richness in montane grassland and tundra could be indicative of a non-monotonic relation between moisture availability and soil bacterial richness. Boreal forests (n = 11) exhibited lower richness compared to tropical (n = 23) and temperate forests (n = 122; p = 0.0311 and p = 0.0063, respectively, Wilcoxon rank sum test). This latitudinal shift in species richness17,20 suggests that temperature plays an important role in regulating bacterial richness. However, consideration of temperature alone provides no distinction between the richness observed in tropical and temperate forests (p = 0.6575, Wilcoxon rank sum test), suggesting more complex interactions and mechanisms.

Figure 1 Distribution of sites and representative samples obtained from three recent studies (EMBL17, EMP21, ZHOU20) used in this meta-analysis. (a) Geographical locations of sites (n = 320, by continent: AF = 27, AS = 42, AU = 30, EU = 55, NA = 104, SA = 62). Size of the points represents the number of samples that were aggregated within 0.1° × 0.1° cells. Colors orange, blue and green represent the three studies EMBL, EMP and ZHOU respectively. (b) Bacterial richness grouped by biomes (F. forest, G. grassland). Site values are shown in grey, while the red points represent mean values. Boxes show the inter quartile range (median as solid line) with bars indicating central 95%-range of values. Full size image

Univariate analysis of bacterial richness

We first evaluate trends of species richness considering climate and soil properties within univariate general additive modeling. Selected covariates were used that represent different aspects of the soil environment (Supplementary Table S1). Climatic water content (CWC) represents the soil water storage capacity and climatic water balance based on the number of consecutive dry days (DRY) and potential evapotranspiration (PET) (Supplementary Methods). It is a proxy for the soils wetness, its dynamics and aqueous phase connectivity. Both shape the number of distinct aqueous habitats and their connectedness in a soil. We found an optimal CWC in the range of 0.15 to 0.20 where bacterial richness peaks (Fig. 2a). A generally linear drop in richness seen towards low water availability is potentially due to nutrient limitations by the physically constrained diffusion processes and reduced carbon input. Soil pH exhibited a trend similar to the CWC with a peak near neutral values (pH 7, Fig. 2b) as reported in previous studies7,21. We note, however, a strong linear association between pH and climatic water content (R2 = 61%, n = 320, Fig. 2c). Climatically humid regions tend to be acidic and dry regions basic. Such trends have been attributed to the difference between mean annual precipitation (MAP) and PET that determine the climatic soil water balance for the region24. A net accumulation of salt in soil (e.g. in arid regions) directly results from a negative water balance with more evaporation than precipitation. This increase of mineral concentrations enhances the soil pH buffering capacity and can result in high soil pH. With an increase in ionic strength we would also expect effects on bacterial physiology (e.g. increased osmotic pressures11,12) and possible, specialized adaptations to these environments. A recent study attempted to disentangle the effect of salts and soil pH on bacterial community composition and revealed a strong effect of salinity25. This may also suggest that previously reported dependencies of bacterial diversity on soil pH7,8,17 could have been mediated by climatic soil water conditions via the accumulation of salts. Although pH is related to the suitability of bacterial habitats by increasing the tolerance (and competitive ability) of pH-adapted species25, it might not be the underlying driver of bacterial diversity. This reasoning is based on the idea that competitive exclusion can only occur with some degree of habitat overlap and interactions between species. Under most conditions in natural soils the aqueous phase is largely fragmented and the (micro-) environments experienced by bacteria are not necessarily the same. This fragmentation permits coexistence and suppresses the elimination of inferior competitors and, hence, promotes bacterial diversity. The distinct optimality of bacterial richness related to soil wetness could be attributed to (i) resource limitation for extremely dry soils and (ii) the increased habitat connectivity that suppresses diversity by promoting competitive exclusion in wet soils. In this context, pH represents the chemical niche environment, a variable under control of primary (physical) factors, i.e. resulting from a soil’s climatic water balance24. Temperature is another primary variable that might confound many processes. The mean annual temperature (MAT) is expected to alter species richness according to the metabolic theory of ecology20,26. This trend was manifested by a slight increase of richness with MAT peaking at 0–10 °C and 20–30 °C (Supplementary Fig. S1), in agreement with a previous study4. One explanation for the lack of clear patterns could be that temperature not only modifies growth rates of bacterial cells, but also affects habitat connectedness via effects on precipitation and water balance. This may counteract the enhancing effect of temperature on richness in wet and warm regions (e.g. the Tropics) where bacterial habitats are frequently connected. Furthermore, despite the strong variation of MAT near the soil surface, the effective range at the sampled depth of 10 cm might be narrower due to the damping effects of soil and leads to a limited range of conditions experienced in bacterial habitats. Additionally, bacteria could be able to tolerate a wide range of temperatures. Bacterial richness was found to be driven by temperature near geothermal springs only beyond 70 °C27; conditions that are not frequently found in soil. Nonetheless, changes in light intensity (solar radiation, RAD) are strongly correlated with temperature and latitude. A direct effect of light on bacterial richness would be expected by enabling growth of photoautotrophs and possible adaptation to high doses of UV light (or the lack thereof). Both effects could be masked by the presence of vegetation (e.g. NPP) that would intercept the solar radiation. We thus do not expect strong changes in the distribution of bacterial richness caused by light in vegetated environments and in sub surface soils (due to the strong attenuation of light). Nevertheless, the indirect effects of solar radiation should be well described by the used covariates (e.g. MAT and CWC) as light and water availability both shape the vegetation of an ecosystem. We used net primary productivity (NPP) to represent vegetation patterns at the ecosystem level and to characterize carbon input into subsurface bacterial habitats. NPP did not display a notable effect on species richness (slightly increasing richness up to 500 g C m−2 yr−1, constant richness beyond, Supplementary Fig. S1). Only in extreme environments, such as deserts and tundra, NPP seems to influence species richness.

Figure 2 Univariate general additive model (GAM) of soil bacterial richness. (a) Relation between climatic water content and bacterial richness. Bacterial richness peaks in soils with intermediate climatic water contents (0.15–0.2) and drops in dry and wet soils (R2 = 27.7%, RMSE = 298.1, AIC = 4557.5, EDF = 4.7). (b) Commonly observed trend of bacterial richness peaking at near neutral conditions (pH 7) and showing distinct drops in acidic and basic soils (R2 = 23.8%, RMSE = 306.0, AIC = 4574.0, EDF = 5.1). (c) A strong linear association (adjusted R2 = 60.8%, deviance explained 61.1%) is observed between climatic water contents and soil pH pointing to possible confounding effects of these covariates on bacterial richness. Shaded areas correspond to standard errors (n = 320). Full size image

Multivariate general additive model (GAM) of bacterial richness

The complexity of interactions among environmental factors, vegetation and soil microorganisms suggests that a single variable alone is not likely to explain the observed patterns of soil bacterial species richness. We therefore tested the robustness of the observed single-variable trends using a multivariate general additive model (GAM) with forward selection of covariates (Table 1, Supplementary Fig. S2). The ranking of the most influential covariates remained consistent with the results of univariate GAM, with CWC slightly outperforming pH. Interestingly MAT occupied the third rank, suggesting that we were able to successfully capture combined effects on soil bacterial species richness. The goodness-of-fit statistics of the multivariate GAM using only the six selected covariates (R2 = 35%, RMSE = 283.7) were better than the statistics of any univariate GAM, suggesting that soil and climatic covariates provide additional information on species richness. Although we observed significant associations between bacterial diversity and environmental factors in uni- and multivariate modeling, these associations do not necessarily imply causation.

Table 1 Ranking of covariates determined by forward selection for the multivariate general additive model (GAM). Full size table

To mitigate limitations of commonly used structural equation models (SEM) in discerning causal nonlinear effects, we have used a causal additive model (CAM)28 to explore potential causes of soil bacterial diversity. We used this novel approach to generate a graph of inferred structural dependencies between covariates and bacterial richness (Supplementary Fig. S3). By removing links between variables that are not considered significant (p ≤ 0.0005), we can distinguish direct from indirect relations between covariates and bacterial richness; as variables that remain linked to richness directly and variables that are connected to richness via others. Compared to the results of the multivariate GAM, we obtained a similar set of covariates with direct effect on species richness, i.e. CWC and DRY. Surprisingly, pH and MAT were not selected as potential direct causes, implying that they may have weaker effects on species richness or their associations with species richness were attributed to confounding effects. This approach enables further exploration of potential model structure. Nevertheless, care should be taken when interpreting inferred causal relationships as the method relies on the strong assumption of “no hidden variables” that are unknown in most natural environmental systems. Yet, it is noteworthy that no prior expectations or knowledge is imposed on the model structure, as is necessary with many SEM17. All direct and indirect links are deduced only from the observations with a given set of covariates. A drawback of this approach is that not all dependencies might be physically meaningful.

Varying proportions of low abundance species

Thus far, we have focused on explaining bacterial species richness without considering environmental effects on species with different levels of abundance. We evaluated the performance of the univariate (CWC, pH) and multivariate GAM for metrics of diversity other than species richness and found a consistent increase in R2 with increasing weight of species with low abundances (Supplementary Fig. S4). The observation indicates that species with low abundance show greater sensitivity to environmental conditions than the species dominating within samples. To further evaluate effects of environmental variables on rare and common fractions of the soil bacterial populations, we split the species in two groups by using a threshold (0.005%) of global relative abundance. For each sample, we computed the log-ratio of the number of rare and common species. A value of zero indicates that a sample contains the same number of rare and abundant species, and larger values indicate that the rare species are more numerous. We explored the dependence of the log-ratio on environmental covariates by univariate GAM (Supplementary Table S2). Interestingly we find similar, but complementary trends for CWC and pH (Fig. 3a,b). Most notably, a distinct drop in the number of rare species appears under elevated climatic soil water contents. This trend compares well with univariate and multivariate model results for species richness. The modeled dependencies of rare and common species diversity on climatic water content (Fig. 3c) demonstrate a higher susceptibility of rare species to increased aqueous phase connectivity associated with high water contents. While the common species remain abundant, the number of rare members of the soil bacterial community shows a steep decline towards wetter soil conditions. This discrepancy is weaker for soil pH where diversity of both rare and common species decrease at similar rates when approaching acidic conditions (Fig. 3d). The gradual increase in the proportion of globally rare species under drier conditions (low CWC) is likely due to the more fragmented aqueous phase that may shelter bacterial species in small but numerous isolated aqueous habitats13,14. Alternatively, one might argue that the emergence of rare species under basic (high pH) — and possibly also very dry— conditions is attributed to the presence of specialist phylotypes capable of coping with such an environment8,29. However, if neutral pH would be favored by most bacterial species (i.e. leading to more diversity) we would expect less balanced soil bacterial communities with more of the rare species present around pH 7. Interestingly, the log-ratio does not increase again towards acidic conditions. Hence, acidic environments reduce diversity of rare and common species to a similar extent, and rare (specialist) species that benefit from weaker competition with common species seem to be missing. Although, information on many additional factors that could affect the presence of rare and common species (e.g. nutrient status of the soil) could not be included in the analysis, general tendencies could be identified using the variables considered. We thus conclude that aqueous habitat connectivity largely dominates the soil bacterial richness picture and should be taken into account together with additional factors when data is available.

Figure 3 Dependence of the log-ratio of number of rare and common species per sample on the two main predictors of bacterial richness; (a) climatic water content (adjusted R2 = 24.5%, deviance explained 25.5%, AIC = 70.5, EDF = 4.3) and (b) soil pH (R2 = 23.0%, deviance explained 23.9%, AIC = 77.0, EDF = 4.1). The log ratio is calculated by splitting species into two groups based on a threshold of global relative abundance (0.005%). A log ratio of zero indicates a balanced population, where the number of rare species per sample equals the number of common species. The modeled curves of both groups richness (rare and common) are shown for (c) climatic water content and (d) soil pH. Shaded areas correspond to standard errors (n = 320). Full size image

Global patterns of soil bacterial richness

The GAM used in this study accounts only for independent and additive effects of covariates on species richness. This may not be a realistic depiction of processes in natural ecosystems with numerous connections and interdependencies. Tree-based statistical models seem better suited to account for (higher-order) interactions between variables. For prediction of global maps of species richness we therefore combined independently trained random forest and gradient boosting trees by simple averaging. The procedure was reinitialized and repeated ten times to stabilize the results and increase reproducibility. The tree-based model (R2 = 40%, RMSE = 261.5) performed better than the multivariate GAM (R2 = 35%, RMSE = 283.0) indicating that interactions between covariates are important for predicting species richness. Despite the considerably better performance, a large portion of variance remained unexplained. This is not unexpected, given the different sampling strategies and methodology of the studies. Additionally, covariates derived from remote sensing products and digital soil maps smooth the actual spatial variation of the respective characteristics and do not (yet) capture the full heterogeneity of natural soils. Another limitation of this study is the lack of fungal data. The data used does not permit analysis of fungal richness, and we can only speculate about potential, general trends. However, one study used in our dataset (EMBL)17 investigated fungal diversity across biomes and report that fungal diversity does not peak in temperate regions (unlike bacterial diversity). The authors further suggest niche differentiation lead to contrasting responses of fungal diversity with precipitation and soil pH compared to bacterial diversity17. We thus would expect fungi to play a dominant role in vegetated soils with lower pH and high C:N ratios17,30. Such regions (biomes) are represented by high NPP and high climatic water contents. In these environments the aqueous phase connectedness could additionally enhance competition; potentially also between bacteria and fungi. The global map of predicted bacterial richness shows distinct regions of varying bacterial richness (Fig. 4). Tropical regions (e.g. the Amazon and the Congo Basin rainforests) exhibit remarkably lower bacterial richness highlighting the adverse effects of high levels of soil wetness on bacterial diversity. Lowest richness values were also found in regions where resources are most limiting, such as in the Sahara or the Atacama deserts. “Hotspots” of species richness lie in temperate regions and climatic transition zones where resource availability is not limiting and the aqueous phase remains fragmented, such as in the northern regions of India or in the Sahel. Tree-based methods provide a complementary approach to GAM as they efficiently handle higher order interactions between covariates and provide an efficient interface for spatial mapping. The implicit representation of covariate dependencies and model averaging, however, do not offer as much insight into the model structure as is possible with GAMs.