Overview

Using a Bayesian model-based geostatistical framework and synthesizing geolocated data from 528 household and census datasets, this analysis provides subnational estimates of mean numbers years of education and the proportion of the population who attained key levels of education for women of reproductive age (15–49 years), women aged 20–24 years, and equivalent male age bins between 2000 and 2017 in 105 countries across all low- and middle-income countries (LMICs). Countries were selected for inclusion in this analysis using the socio-demographic index (SDI) published in the Global Burden of Disease (GBD) study27. The SDI is a measure of development that combines education, fertility, and poverty. Countries in the middle, lower-middle, or low SDI quintiles were included, with several exceptions. Albania, Bosnia, and Moldova were excluded despite middle SDI status due to geographical discontinuity with other included countries and lack of available survey data. Libya, Malaysia, Panama, and Turkmenistan were included despite higher-middle SDI status to create better geographical continuity. We did not analyse American Samoa, Federated States of Micronesia, Fiji, Kiribati, Marshall Islands, Samoa, Solomon Islands, or Tonga, where no available survey data could be sourced. Analytical steps are described below, and additional details can be found in the Supplementary Information.

Data

We compiled a database of survey and census datasets that contained geocoding of subnational administrative boundaries or GPS coordinates for sampled clusters. These included datasets from 528 sources (see Supplementary Table 2). These sources comprised at least one data source for all but two countries on our list of LMICs: Western Sahara and French Guiana. We chose to exclude these two countries from our analysis; 42 of 105 included countries have only subnational administrative level data. We extracted demographic, education, and sample design variables. The coding of educational attainment varies across survey families. In some surveys, the precise number of years of attainment is not provided, with attainment instead aggregated into categories such as ‘primary completion’ or ‘secondary completion’. In such cases, individuals who report ‘primary completion’ may have gone on to complete some portion of secondary education, but these additional years of education are not captured in the underlying dataset. Previous efforts to examine trends in mean years of education have either assumed that no additional years of education were completed (that is, primary education only) or have used the midpoint between primary and secondary education as a proxy28. Trends in the single-year data, however, demonstrate that such assumptions introduce bias in the estimation of attainment trends over time and space, as differences in actual drop-out patterns or binning schema can lead to biased mean estimates29.

For this analysis, we used a recently developed method that selects a training subset of similar surveys across time and space to estimate the unobserved single-year distribution of binned datasets29. In comprehensive tests of cross-validation that leveraged data for which the single-year distributions are observed, this algorithmic approach significantly reduces bias in summary statistics estimated from datasets with binned coding schemes compared to alternatives such as the standard-duration method28. The years in all coding schemes were mapped to the country- and year-specific references in the UNESCO International Standard Classification of Education (ISCED) for comparability30. We used a top coding of 18 years on all data; this is a common threshold in many surveys that have a cap and it is reasonable to assume that the importance of education for health outcomes (and other related SDGs) greatly diminishes after what is the equivalent of 2 to 3 years of graduate education in most systems.

Data were aggregated to mean years of education attained and the proportions achieving key levels of education. The levels chosen were proportion with zero years, proportion with less than primary school (1–5 years of education), proportion with at least primary school (6–11 years of education), and proportion achieving secondary school or higher (12 or more years of education). A subset of the data for a smaller age bin (20–24 years) was also examined to more closely track temporal shifts. Equivalent age bins were aggregated for both women and men to examine disparities in mean years of attainment by sex. Where GPS coordinates were available, data were aggregated to a specific latitude and longitude assuming a simple-random sample, as the cluster is the primary sampling unit for the stratified design survey families, such as the Demographic and Health Survey (DHS) and Multiple Indicator Cluster Survey (MICS). Where only geographical information was available at the level of administrative units, data were aggregated with appropriate weighting according to their sample design. Design effects were estimated using a package for analysing complex survey data in R31.

Spatial covariates

To leverage strength from locations with observations to the entire spatiotemporal domain, we compiled several 5 × 5-km2 raster layers of possible socioeconomic and environmental correlates of education (Supplementary Table 5 and Supplementary Fig. 6). Acquisition of temporally dynamic datasets, where possible, was prioritized to best match our observations and thus predict the changing dynamics of educational attainment. We included nine covariates indexed at the 5 × 5-km2 level: access to roads, nighttime lightstv, populationtv, growing season, ariditytv, elevation, urbanicitytv, irrigation, and yeartv (tv, time-varying covariates). More details, including plots of all covariates, can be found in the Supplementary Information.

Our primary goal is to provide educational attainment predictions across LMICs at a high (local) resolution, and our methods provide the best out-of-sample predictive performance at the expense of inferential understanding. To select covariates and capture possible nonlinear effects and complex interactions between them, an ensemble covariate modelling method was implemented32. For each region, three submodels were fitted to our outcomes using all of our covariate data: generalized additive models, boosted regression trees, and lasso regression. Each submodel was fit using fivefold cross-validation to avoid overfitting and the out-of-sample predictions from across the five folds were compiled into a single comprehensive set of predictions from that model. Additionally, the same submodels were also run using 100% of the data and a full set of in-sample predictions were created. The five sets of out-of-sample submodel predictions were fed into the full geostatistical model as predictors when performing the model fit. The in-sample predictions from the submodels were used as the covariates when generating predictions using the fitted full geostatistical model. This methodology maximizes out-of-sample predictive performance at the expense of the ability to provide statistical inference on the relationships between the predictors and the outcome. A recent study has shown that this ensemble approach can improve predictive validity by up to 25% over an individual model32. More details on this approach can be found in the Supplementary Information.

The primary goal of using the stacking procedure in our analyses was to maximize the predictive power of the raster covariates by capturing the nonlinear effects and complex interactions between covariates to optimize the model performance. It has previously been suggested32 that the primary purpose of the submodel predictions is to improve the mean function of the Gaussian process. Although we have determined a way to include the uncertainty from two of our submodels (lasso regression and generalized additive models (GAM)), we have not determined a way to include uncertainty from the boosted regression tree (BRT) submodel into our final estimates. Whereas GAM and lasso regression seek to fit a single model that best describes the relationship between response variable and some set of predictors, BRT method fits a large number of relatively simple models for which the predictions are then combined to give robust estimates of the response. Although this feature of the BRT model makes it a powerful tool for analysing complex data, quantifying the relative uncertainty contributed by each simple model as well as uncertainty from the complex interactions of the predictor variables is challenging33,34. It is worth noting, however, that our out-of-sample validation indicates that the 95% coverage is fairly accurate (for example, closely ranges around 95%) as shown in the figures and table of Supplementary Information section 4.3.2. This indicates that we are not misrepresenting the uncertainty in our final estimates.

Analysis

Geostatistical model

Gaussian and binomial data are modelled within a Bayesian hierarchical modelling framework using a spatially and temporally explicit hierarchical generalized linear regression model to fit the mean number years of education attainment and the proportion of the population who achieved key bins of school in 14 regions across all LMICs as defined in the GBD study (Extended Data Fig. 1). This means we fit 14 independent models for each indicator (for example, the proportion of women with zero years of schooling). GBD study design sought to create regions on the basis of three primary criteria: epidemiological homogeneity, sociodemographic similarity, and geographical contiguity27. Fitting our models by these regions has the advantage of allowing for some non-stationarity and non-isotropy in the spatial error term, compared to if we modelled one spatiotemporal random-effect structure over the entire modelling region of all LMICs.

For each Gaussian indicator, we modelled the mean number of years of attainment in each survey cluster, d. Survey clusters are precisely located by their GPS coordinates and year of observation, which we map to a spatial raster location i at time t. We model the mean number of years of attainment as Gaussian data given fixed precision τ and a scaling parameter s d (defined by the sample size in the observed cluster). As we may have observed multiple data clusters within a given location i at time t, we refer to the mean attainment, μ, within a given cluster d by its indexed location i, and time t as μ i(d),t(d) .

$${{\rm{edu}}}_{d}|{\mu }_{i(d),t(d)},{s}_{d},\tau \, \sim \,{\rm{Normal}}({\mu }_{i(d),t(d)},\tau {s}_{d})\,\forall \,{\rm{observed}}\,{\rm{clusters}}\,d$$

$${\mu }_{i,t}=\,{\beta }_{0}+{{\bf{X}}}_{i,t}{\boldsymbol{\beta }}+{Z}_{i,t}+{{\epsilon }}_{{\rm{ctr}}(i)}+{{\epsilon }}_{i,t}\,\forall \,i\in {\rm{spatial}}\,{\rm{domain}}\,\forall \,t\in {\rm{time}}\,{\rm{domain}}$$

For each binomial indicator, we modelled the number of individuals at a given attainment level in each survey cluster, d. We observed the number of individuals reporting a given attainment level as binomial count data C d among an observed sample size N d . As we may have observed multiple data clusters within a given location i at time t, we refer to the probability of attaining that level, p, within a given cluster d by its indexed location i and time t as p i(d),t(d) .

$${C}_{d}|{p}_{i(d),t(d)},\,{N}_{d}\sim {\rm{Binomial}}({p}_{i(d),t(d)},\,{N}_{d})\,\forall \,{\rm{observed}}\,{\rm{clusters}}\,d$$

$${\rm{logit}}({p}_{i,t})=\,{\beta }_{0}+{{\bf{X}}}_{i,t}{\boldsymbol{\beta }}+{Z}_{i,t}+{{\epsilon }}_{{\rm{ctr}}(i)}+{{\epsilon }}_{i,t}\,\forall \,i\in {\rm{spatial}}\,{\rm{domain}}\,\forall \,t\in {\rm{time}}\,{\rm{domain}}$$

We used a continuation-ratio modelling approach to account for the ordinal data structure of the binomial indicators35. To do this, the proportion of the population with zero years of education was modelled using a binomial model. The proportion with less than primary education was modelled as those with less than primary education of those that have more than zero years of education. The same method followed for the proportion of population completing primary education. The proportion achieving secondary school or higher was estimated as the complement of the sum of the three binomial models.

The remaining parameter specification was consistent between all indicators in both binomial and Gaussian models:

$$\mathop{\sum }\limits_{h=1}^{3}{\beta }_{h}\,=1$$

$${{\epsilon }}_{{\rm{ctr}}}\sim {\rm{iid}}\,{\rm{Normal}}(0,\,{\gamma }^{2})$$

$${{\epsilon }}_{i,t}\sim {\rm{iid}}\,{\rm{Normal}}(0,\,{\sigma }^{2})$$

$${\bf{Z}}\sim {\rm{GP}}(0,{\Sigma }^{{\rm{space}}}\otimes {\Sigma }^{{\rm{time}}})$$

$${\Sigma }^{{\rm{space}}}=\,\frac{{\omega }^{2}}{\varGamma (

u ){2}^{v-1}}\times {(\kappa D)}^{

u }\times {{\rm K}}_{

u }(\kappa D)$$

$${\Sigma }_{j,\,k}^{{\rm{time}}\,}={\rho }^{|k-j|}$$

For indices d, i, and t, *(index) is the value of * at that index. The probabilities p i,t represent both the annual proportions at the space–time location and the probability that an individual had that level of attainment given that they lived at that particular location. The annual probability p i,t of each indicator (or μ i,t for the mean indicators) was modelled as a linear combination of the three submodels (GAM, BRT, and lasso regression), rasterized covariate values X i,t , a correlated spatiotemporal error term Z i,t , country random effects \({{\epsilon }}_{{\rm{ctr}}(i)}\) with one unstructured country random effect fit for each country in the modelling region and all sharing a common variance parameter, γ2, and an independent nugget effect \(\,{{\epsilon }}_{i,t}\) with variance parameter σ2. Coefficients β h in the three submodels h = 1, 2, 3 represent their respective predictive weighting in the mean logit link, while the joint error term Z i,t accounts for residual spatiotemporal autocorrelation between individual data points that remains after accounting for the predictive effect of the submodel covariates, the country-level random effect \({{\epsilon }}_{{\rm{ctr}}(i)}\), and the nugget independent error term, \(\,{{\epsilon }}_{i,t}\). The purpose of the country-level random effect is to capture spatially unstructured, unobserved country-specific variables, as there are often sharp discontinuities in educational attainment between adjacent countries due to systematic differences in governance, infrastructure, and social policies.

The residuals Z i,t are modelled as a three-dimensional Gaussian process (GP) in space–time centred at zero and with a covariance matrix constructed from a Kronecker product of spatial and temporal covariance kernels. The spatial covariance Σspace is modelled using an isotropic and stationary Matérn function36, and temporal covariance Σtime as an annual autoregressive (AR1) function over the 18 years represented in the model. In the stationary Matérn function, Γ is the Gamma function, K v is the modified Bessel function of order v > 0, κ > 0 is a scaling parameter, D denotes the Euclidean distance, and ω2 is the marginal variance. The scaling parameter, κ, is defined to be \(\kappa =\sqrt{8v}/\delta \) where δ is a range parameter (which is about the distance for which the covariance function approaches 0.1) and v is a scaling constant, which is set to 2 rather than fit from the data37,38. This parameter is difficult to reliably fit, as documented by many other analyses37,39,40 that set this parameter to 2. The number of rows and the number of columns of the spatial Matérn covariance matrix are equal to the number of spatial mesh points for a given modelling region. In the AR1 function, ρ is the autocorrelation function (ACF), and k and j are points in the time series where |k − j| defines the lag. The number of rows and the number of columns of the AR1 covariance matrix are equal to the number of temporal mesh points (18). The number of rows and the number of columns of the space–time covariance matrix, Σspace ⊗ Σtime, for a given modelling region are equal to: the number of spatial mesh points × the number of temporal mesh points.

This approach leveraged the residual correlation structure of the data to more accurately predict estimates for locations with no data, while also propagating the dependence in the data through to uncertainty estimates41. The posterior distributions were fit using computationally efficient and accurate approximations in R-integrated nested Laplace approximation (INLA) with the stochastic partial differential equations (SPDE) approximation to the Gaussian process residuals using R project version 3.5.142,43,44,45. The SPDE approach using INLA has been demonstrated elsewhere, including the estimation of health indicators, particulate air matter, and population age structure10,11,46,47. Uncertainty intervals were generated from 1,000 draws (that is, statistically plausible candidate maps)48 created from the posterior-estimated distributions of modelled parameters. Additional details regarding model and estimation processes can be found in the Supplementary Information.

To transform grid cell-level estimates into a range of information that is useful to a wide constituency of potential users, these estimates were aggregated from the 1,000 candidate maps up to district, provincial, and national levels using 5 × 5-km2 population data49. This aggregation also enabled the calibration of estimates to national GBD estimates for 2000–2017. This was achieved by calculating the ratio of the posterior mean national-level estimate from each candidate map draw in the analysis to the posterior mean national estimates from GBD, and then multiplying each cell in the posterior sample by this ratio. National-level estimates from this analysis with GBD estimates can be found in Supplementary Table 44.

To illustrate how subnational progress has contributed differentially to national progress (Fig. 3), we decomposed the improvement in the national rate of secondary completion since 2000 for each country into the additive contributions of rate changes at the second administrative level, where C is the national secondary rate change, N is the total number of second-level administrative units, c i is the population proportion in administrative unit i, and r i is the rate of secondary attainment in administrative unit i.

$$C=\,\mathop{\sum }\limits_{i=1}^{N}({c}_{i,2017}{r}_{i,2017})-({c}_{i,2000}{r}_{i,2000})$$

Although the model can predict at all locations covered by available raster covariates, all final model outputs for which land cover was classified as ‘barren or sparsely vegetated’ were masked, on the basis of the most recently available Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data (2013), as well as areas in which the total population density was less than 10 individuals per 1 × 1-km2 pixel in 201550. This step has led to improved understanding when communicating with data specialists and policy-makers.

Model validation

Models were validated using source-stratified fivefold cross-validation. To offer a more stringent analysis by respecting some of the source and spatial correlation in the data, holdout sets were created by combining sets of data sources (for example, entire survey- or census-years). Model performance was summarized by the bias (mean error), total variance (root-mean-square error) and 95% data coverage within prediction intervals, and the correlation between observed data and predictions. All validation metrics were calculated on the predictions from the fivefold cross-validation. Where possible, estimates from these models were compared against other existing estimates. Furthermore, measures of spatial and temporal autocorrelation pre- and post-modelling were examined to verify correct recognition, fitting, and accounting for the complex spatiotemporal correlation structure in the data. All validation procedures and corresponding results are provided in the Supplementary Information.

Limitations

Our analysis is not without several important limitations. First, almost all data collection tools conflate gender and sex and we therefore do not capture the full distribution of sex or gender separately in our data. We refer throughout to the measurement of ‘gender (in)equality’, following the usage in SDG 5. Second, it is extremely difficult to quantify quality of education on this scale in a comparable way. Quality is ultimately a large part of the SDG agenda and of utmost importance to achieving equity in opportunity for social mobility. However, many studies across diverse low- and middle-income settings have linked attainment, even very low levels, to measurable improvement in maternal and child health17. As our analysis highlights with the proportional indicators, there are still many subnational regions across the world where large proportions do not complete primary school. A third limitation is that we are unable to measure or account for migration. A concept note released from the forthcoming Global Education Monitoring Report 2019 focuses on how migration and displacement affects schooling51. Our estimates of the modelled outcome, educational attainment for a particular space–time–age–sex, are demonstrated to be statistically unbiased (Supplementary Information section 4.3); however, interpretation of any change in attainment as a change in the underlying education system could potentially be biased by the effects of migration. It is possible that geographical disparities reflect changes in population composition rather than changes in the underlying infrastructure or education system. Pathways for this change are complex and may be voluntary. Those who manage to receive an education in a low-attainment area may have an increased ability to migrate and choose to do so. This change may also be involuntary, particularly in politically unstable areas where displacement may make geographical changes over time difficult to estimate. A shifting population composition is a general limitation of many longitudinal ecological analyses, but the spatially granular nature of the analyses used here may be more sensitive to the effects of mobile populations.

Our analysis is purely predictive but draws heavily in its motivation from a rich history of literature on the role of education in reducing maternal mortality, improving child health, and increasing human capital. Studies have also demonstrated complex relationships between increased education and a myriad of positive health outcomes, such as HIV risk reductions and spillover effects to other household members52,53. The vast majority of these studies are associational and recent attempts at causal analyses have provided more-mixed evidence54,55,56. Although causal analyses of education are very difficult and often rely on situational quasi-experiments, associational analyses using the most comprehensive datasets demonstrate consistent support for the connection between education and health17,57. Looking towards future analyses, it will be important to study patterns of change in these data and how they overlap with distributions of health. Lastly, our estimates cannot be seen as a replacement for proper data collection systems, especially for tracking contemporaneous change. Our analysis of uncertainty at a high-resolution may be used to inform investment in more robust data systems and collection efforts, especially if the ultimate goal is to measure and track progress in the quality of schooling.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.