Autocorrelation in language diversity

To weigh the relative explanatory power of the isolation and ecological risk hypotheses, we tested the association between language diversity and six climatic variables, four geographic variables, two human population variables and four biodiversity measures. Our analyses were based on global-scale datasets of the geographic distribution of 6425 languages1, high-resolution climatic and geographic data layers, and global biodiversity datasets. We used the number of languages whose distribution overlaps each cell of a global equal-area grid as the measure of language diversity. A grid-based approach eliminates variation in language diversity and other variables due to differences in land area. It also allowed us to repeat analyses under three different spatial resolutions, as previous studies have shown that spatial resolution can influence tests for latitudinal gradients in language diversity7,26.

In order to untangle causal connections from incidental associations, we need to account for sources of covariation in our data. In particular, we need to address spatial autocorrelation and phylogenetic non-independence32,33,34. Grid cells that are located near each other are likely to have similar values of climatic and landscape variables, contain related human cultures and languages, and share much of their flora and fauna. If there are any particular features of cultures or languages that correlate with language diversity, then they will tend to co-vary with environmental features and biodiversity measures, whether or not there is any causal connection between them.

Our analyses show that regions in close geographic proximity, and with a high degree of language relatedness, tend to be more similar in their language diversity (Fig. 1 and Supplementary Figure 1). Spatial autocorrelation and phylogenetic relatedness account for 15% of the variance in language diversity in analyses at low spatial resolution, 18% at medium resolution and 5% at high resolution, even after removing highly correlated grid cells (see Methods). This confirms the need to account for spatial proximity and phylogenetic relatedness between grid cells when testing for correlates of language diversity.

After correcting for spatial autocorrelation and phylogenetic relatedness, there is a latitudinal gradient in language diversity, with more languages near the equator than at higher latitudes (Fig. 1). The regression coefficient of absolute latitude against language diversity is significantly negative under all resolutions (low: t = −2.58; p = 0.011; medium: t = −2.49; p = 0.014; high: t = −5.22; p < 0.001).

Climatic effect on language diversity

We tested six climatic variables for associations with language diversity: mean annual temperature, mean annual precipitation, temperature seasonality, precipitation seasonality, net primary productivity, and mean annual growing season (Supplementary Figure 2). Among these climatic variables, precipitation seasonality has the strongest association with language diversity in low resolution analyses, and temperature seasonality has the strongest association with language diversity in medium and high-resolution analyses, independently of their covariation with other climatic variables (Table 1). The six climatic variables also provide sufficient explanation for the latitudinal gradient in language diversity, because adding latitude as an explanatory variable in the model in addition to the six climatic variables does not significantly increase the model fit under any resolution (low: LR = 0.94, p = 0.33; medium: LR = 1.15, p = 0.28; high: LR = 2.03, p = 0.15).

Table 1 Climatic effects on language diversity Full size table

The ecological risk hypothesis may provide an explanation for the association between language diversity and seasonality, which predicts higher language diversity in regions with longer periods of reliable food production, by allowing smaller cultural groups to be self-sufficient16,17,18,19. This hypothesis makes testable predictions about the associations between climate, population size and density, and language diversity. We followed previous studies16,17,18,19 in using mean growing season (the number of days per year suitable for growing crops) as an indicator of ecological risk, although our results indicate that temperature seasonality may be a better predictor of the influence of environment on language diversity. The ecological risk hypothesis predicts that longer growing seasons will result in reduced area per language and smaller speaker population sizes. We find evidence to support both of these predictions. Longer growing seasons are associated with a greater number of languages per grid cell at all three resolutions (Table 2), consistent with a reduction in range sizes of languages allowing tighter packing of languages (as language polygons are largely non-overlapping, smaller average range size allows more languages fit into a given area). The increase in language diversity is not simply a result of areas with long growing seasons supporting a greater number of people, because mean growing season has a significant positive association with language diversity beyond its covariation with population density (Table 2). Mean growing season is negatively associated with minimum speaker population size (the population size of the smallest language in a grid cell) under medium and high resolutions (Table 2), consistent with the prediction that smaller cultural groups are more able to persist in areas of longer growing season. There is no association between mean growing season and the average speaker population size of all the languages in a grid cell, so increased packing is primarily a result of reduction in language range size in areas with longer growing seasons, rather than being attributable to a reduction in the average size of cultural groups within those high-diversity areas.

Table 2 Predictions of the ecological risk hypothesis Full size table

Our results are broadly consistent with the ecological risk hypothesis, because mean growing season is associated with both the minimum group size and the number of languages per grid cell. However, we find that seasonality in temperature and precipitation have additional associations with language diversity that is not attributable to mean growing season. This is consistent with a recent study that supports associations between language diversity and average amount of precipitation in the wettest quarter and temperature in the warmest quarter36. Although growing season is defined by the number of days above a specific minimum temperature and moisture availability, seasonality will reflect both minimum and maximum of temperatures and moisture. Therefore, this result may suggest that climatic extremes across seasons shape language diversity, in addition to average length of growing season.

Landscape effect on language diversity

To examine the effect of isolation on language diversity, we tested four landscape variables that have been suggested to influence patterns of human movement and therefore contribute to the isolation of cultural groups: mean altitude, altitudinal range, landscape roughness, and river density (Supplementary Figure 2). Higher river density is associated with greater language diversity at low and medium resolutions, beyond its covariation with climatic variables and the other landscape variables (Table 3). Although this result is consistent with previous proposals that rivers act to isolate populations into smaller language groups13, we find little additional support for this hypothesis. Although river density is associated with smaller minimum speaker population size at medium resolution (Table 3), there is no association between river density and average speaker population size (controlling for the effects of population density). These observations suggest that the association between river density and language diversity is more akin to the ecological risk hypothesis than to the isolation hypothesis, because rivers seem to allow the persistence of smaller speaker populations, but not to divide human populations into smaller speaker populations. In this sense, rivers may act more as an ecological resource than a barrier to interaction.

Table 3 Landscape effects on language diversity and speaker population size Full size table

Similarly, although altitudinal range is associated with language diversity at high resolution with marginal significance, there is no evidence that this is caused by isolation, as altitudinal range does not result in reduction in speaker population size, even when controlling for population density (Table 3). Although landscape roughness is significantly associated with language diversity when altitudinal range is not included in the model (t = 2.87; p = 0.004), we find no significant association between landscape roughness and language diversity beyond its covariation with climatic variables and the other landscape variables under the three resolutions, and no statistically significant negative association between landscape roughness and speaker population size (Table 3).

In contrast to a previous study that described river density and landscape roughness as universal determinants of language diversity13, we find little evidence that landscape variables have a strong or consistent influence on language diversity. Although we use similar data to Axelson & Manrubia13, there are a number of differences in our analytical approach. To compare our results to theirs, we reanalyze our data using their method, fitting continent-specific parameter values and not including altitudinal range. Without correcting for spatial and phylogenetic non-independence among grid cells, we get similar results to Axelson & Manrubia13, namely that river density and landscape roughness have significant associations with language diversity in most continents (Supplementary Table 1). But when we correct the data for non-independence among grid cells, neither river density nor landscape roughness has a significant association with language diversity in any continent (Supplementary Table 1). We therefore conclude that the previous result was driven primarily by spatial autocorrelation and phylogenetic non-independence, with the similarity in both landscape variables and language diversity between neighboring grid cells generating spurious correlations.

In conclusion, we find little consistent support for effect of landscape factors on language diversity. Although we find associations between language diversity and river density, altitudinal range and landscape roughness, these landscape factors have much less influence on language diversity than climatic factors, and there is little indication that this is caused by the division of human populations into smaller, isolated cultural groups. Instead, previous results suggesting river density and landscape roughness are universal determinants of language diversity13 may have been driven by autocorrelation among grid cells.

Link between language diversity and biodiversity

We now ask if biodiversity provides any additional explanation for language variation beyond covariation with climate and landscape factors. Adding mammal or bird diversity as additional predictors to the climatic and landscape variables significantly improves model fit, but adding vascular plant and amphibian diversity do not provide additional explanatory power (Table 4). Adding biome to the analysis increases model fit above climate variables at low resolution, suggesting that ecosystem structures may influence language diversity, however it does not provide significant explanatory power above the effect of climate at medium and high resolutions (low: LR = 27.01, p = 0.02; medium: LR = 14.91, p = 0.38; high: LR = 11.83, p = 0.62).

Table 4 Association between biodiversity and language diversity Full size table

Why are bird and mammal diversity associated with language diversity? There is no evidence that this is due to a direct causal relationship between biodiversity and language diversity, because there is no consistent relationship between these biodiversity measures and residual variation in language diversity, above and beyond that explained by climate and landscape (Supplementary Table 2). Instead, the increase in model fit when bird and mammal diversity are added to the model of language diversity, climate and landscape, seems to be driven primarily by regions that have both low language diversity and low species diversity, particularly the Sahara, the Arabian Peninsula, and the Tibetan Plateau (Fig. 2 and Supplementary Figure 3), which present harsh environmental conditions for birds and mammals (including humans). These are not the only regions of low diversity but they seem to have a disproportionate influence on the relationship between mammal and bird diversity and language diversity (Supplementary Figure 4). Running the high-resolution analysis without these low diversity areas, we find that adding mammal or bird diversity as additional predictors to the climatic and landscape variables no longer increases model fit (n = 334, mammal: LR = 1.92, p = 0.17; bird: LR = 3.67, p = 0.07), although this reduced data still show results for the climatic and landscape effects that are similar to those from the complete data set. Even when these low diversity areas are excluded from the analysis, temperature seasonality remains the strongest predictor for language diversity of all of the climatic variables (t = −2.34, p = 0.02) and altitudinal range remains the strongest predictor of language diversity of the landscape variables (t = 2.27, p = 0.02). These results suggest that the low diversity areas have a significant effect on the association between biodiversity and language diversity, but they are not responsible for the broader association between language diversity and climatic and landscape effects.

Fig. 2 Global distribution of mammal diversity and bird diversity. Values on logarithm scale of number of species are shown for 200 × 200 km cells of an equal-area grid. For amphibian and plant diversity see Supplementary Figure 3 Full size image

In conclusion, we find that the association between language diversity and biodiversity appears to be largely a result of their covariation with common climatic and landscape factors, and any additional increase in model fit between language diversity and mammal and bird diversity is likely due to the disproportionate effect of a few regions of harsh environment that reduce both biodiversity and language diversity.

Residual variation in language diversity

The six climatic variables and the four landscape variables together explain 45% of the variance in language diversity under low resolution, 31% under medium resolution, and 27% under high resolution (after correction for phylogenetic and spatial non-independence). About 80% of this explanatory power is contributed by the six climatic variables under the three resolutions. Measures of biodiversity do not appear to add additional explanatory power beyond their covariation with climatic factors, above and beyond the influence of several key areas of low diversity.

What accounts for the remaining variation in language diversity? Fig. 3 shows the distribution of the residuals in language diversity after removing the climatic and landscape effects on language diversity. We can identify areas of high unexplained language diversity as the red grid cells with residuals ≥ 1.96 standard deviations higher than predicted by the climatic and landscape variables alone. These grid cells are concentrated in four regions—New Guinea, Eastern Himalaya, West Africa, and Mesoamerica. Language diversity in grid cells with residuals ≤ −1.96 (blue) is lower than we would predict based on the climatic and landscape variables, most notably in the lower Amazon Basin of South America.

Fig. 3 Global distribution of residuals in language diversity. Residuals after accounting for the climatic and landscape effects on language diversity are shown for 200 × 200 km grid cells of an equal-area grid. Aggregations of grid cells with residuals ≥ 1.96 (red) are circled. These indicate four regions of higher than expected language diversity, compared with regions of similar climate and landscape (New Guinea, Eastern Himalaya, West Africa, and Mesoamerica). Areas of lower than expected language diversity with residuals ≤ −1.96 (blue) are distributed in South America, mostly in the Amazon basin. The figure only shows grid cells for which we have relevant data Full size image

There are several possible explanations for these areas of relative excess or paucity of languages, beyond that predicted by the climate and landscape variable. One is that they reflect relative completeness of language documentation. For example, Amazonia is considered an area of high language diversity27, but incomplete documentation in the central areas of this region have led to it being described as the least known and least understood linguistic region37. Therefore, the true number of languages may be higher than the documented number of languages. However, it seems unlikely that the opposite effect (over-reporting of language diversity) would explain the areas of high unexplained language diversity.

Alternatively, it may be that other factors contribute significantly to shaping language diversity that are not captured by climate variables (representing the ecological risk hypothesis) nor by landscape variables (representing isolation mechanisms). For example, regions of higher than expected language diversity may have had a longer period of in situ language diversification, or have undergone a higher rate of diversification, leading to a greater accumulation of languages in these regions than in other regions of similar climate. One way to investigate the influence of time or diversification rate on diversity is to use a phylogeny that contains information on the relative timing of diversification events in order to compare the timescale and rate of diversification in different regions24,38. Although phylogenies are available for the languages within some language families39,40,41,42,43,44, and a global distance-based phylogeny45, there is currently no dated phylogeny of the world’s languages, nor is there general agreement on the relationships or age of language families. Therefore, we lack the means to make a quantitative comparison of duration or rates of diversification between the majority of grid cells (those that contain languages from different families or languages not contained in comprehensive phylogenies).

Nevertheless, we can make a qualitative comparison of the relative depth of divergence represented in each grid cell if we make the simple assumption that languages from the same language family diverged more recently than languages from different families. Number of language families per grid cell is a significant predictor of residuals in language diversity under the three resolutions (low: t = 4.65, p = < 0.001; medium: t = 6.27, p = < 0.001; high: t = 8.83, p = < 0.001; Supplementary Figure 5). However, we are hesitant to draw strong conclusions from this pattern. For example, although New Guinea has more language families per grid cell than most other regions, the other areas of high unexplained language diversity do not have unusually high language family diversity, and some areas with many language families do not have high language richness (Fig. 4; Supplementary Figure 5). Clearly, this is not an ideal analysis of variation in time for diversification, as we cannot standardize time or rate of language evolution across families without a global dated phylogeny. But it suggests that time to diversification may be a profitable area of enquiry once complete language phylogenies become available.

Fig. 4 Global distribution of the number of language families. Numbers of language families are shown for 200 × 200 km cells of an equal-area grid. Language family is defined by the World Language Mapping System taxonomy1. Language isolates are treated as distinct families. Number of language families within a grid cell is calculated as the number of language families that include at least one language distributed in the grid cell. The figure only shows grid cells for which we have relevant data Full size image

In addition to factors relating to data completeness and time for diversification, we expect a large number of other factors to have influenced patterns of language diversity, which were not included in this study owing to our focus on the influence of climate, landscape, and biodiversity. Some of these additional factors may have global patterns of influence. For example, it has been suggested that the relative explanatory power of climate on language diversity is stronger for foraging and pastoral societies, and less so for agricultural societies5,35. Subsistence strategy is also strongly influenced by climatic factors46. The areas of greater language diversity than would be predicted using environmental factors alone predominantly coincide with areas where the dominant subsistence strategy is plant-based agriculture46. However, it is unlikely that this provides a strong explanation for these hotspots because there are many more regions dominated by agriculture that do not have higher than expected language diversity.

It is also important to acknowledge that our analysis is based on a contemporary snapshot of language diversity, and uses only current climate information, therefore, we are unable to capture the influence of past environmental variation. Nor can we account for the influence of changing patterns of cultural diversity, political complexity, or subsistence patterns over time or space35. The identification of environmental factors associated with patterns of language distribution and diversity does not deny the role of historically contingent events unique to each culture. Human history is influenced by a great diversity of factors, including conflict, political structures, and patterns of human migration. But, on top of these influences, we detect consistent influences of environmental factors that add to the millieu of factors impacting on patterns of human diversity.

The overall picture supported by our analyses is that environmental factors are a significant determinant of global variation in the diversity of human languages, as they are for global variation in biodiversity. Associations between global patterns of language diversity and climate are consistent with the ecological risk hypothesis, that stable productive climates allow human cultures to persist in smaller, more localized groups. Our results offer less support for isolation mechanisms as global drivers of language diversity. Although there are significant associations between language diversity and river density, altitudinal extent and landscape roughness, landscape factors have less explanatory power than climate, and the patterns are not indicative of an mechanism that divides human populations into smaller, isolated cultural groups. The association between biodiversity and language diversity is likely owing to an incidental association between language and species richness driven by shared causal factors such as climate and landscape. The importance of macroevolutionary influences such as time to accumulate diversity or the rate of language diversification are yet to be explored in detail.