No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Overview

Our study follows the Guidelines for Accurate and Transparent Health Estimates Reporting (GATHER)29. We use a multi-stage model to estimate the average years of schooling, and the single-year distribution of educational attainment, for 1970 to 2018, and create projections to 2030. These models draw on a database of 3,180 nationally representative censuses and surveys. Estimates are created for the 195 counties and territories examined in the Global Burden of Disease 2017 study30. In the first stage, we model mean years of schooling and the proportion of the population without any formal schooling from 1970 to 2018. This is performed using a cohort extrapolation model and a subsequent age period model with Gaussian process regression to synthesis all data and create final estimates with uncertainty. The second stage entails an ensemble K-nearest neighbours algorithm to estimate the distribution of education from 1970 to 2018, drawing on previously estimated quantities. Finally, trends in these distributions are projected to 2030 using a rate of change approach, and mean years of schooling values for 2019–2030 are calculated from the resulting distributions. All analyses are run using 1,000 draws to propagate model and data uncertainty through to subsequent steps. All estimation steps are validated, and all hyper-parameters are optimized, using out of sample predictive validity.

Data sources

We compiled a database of 3,180 nationally representative surveys and censuses describing the distribution of years of schooling by age and sex. Data sources providing single years of schooling are used directly, while those providing larger bins of educational attainment, for example ‘some primary attainment’ are probabilistically split into single-year proportions using a previously published crosswalk model31. Data are top-coded to 18 years, as it is a common choice among providers of single-year education data32, and it is reasonable to assume that the importance of education for health or social capital diminishes greatly after the completion of 18 years, which represents 2 to 3 years of post-university education in most educational systems.

Data adjustment model

Data are adjusted for systematic biases between data providers in a regional and location-specific fashion. Gold-standard data are identified using expert knowledge of the high-volume data providers that have robust processes in place to ensure data quality. In almost all cases, census data obtained from the IPUMS data repository are considered as the gold standard, or Demographic Health Survey data where IPUMS are not available. Supplementary Table 3 lists the location-specific gold-standard data providers. Regional effects are applied to all data to adjust them to the gold standard available in that region. Subsequently, in countries that had gold-standard data available, country-specific effects are used to adjust for within-county biases between data sources. This has the benefit of being able to correct for biases in all countries, even when gold-standard data are not available in that country, using regional effects. Country-specific effects ensure consistent time trends with minimal discontinuities.

We use a mixed-effects regression model with random effects for data provider and nested random effects for data provider within country. This model is run separately for each region, and is formulated as follows:

$${\rm{l}}{\rm{o}}{\rm{g}}{\rm{i}}{\rm{t}}({P}_{Q,A,S,Y,L})={\beta }_{0}+{\beta }_{1}\times {\rm{a}}{\rm{g}}{\rm{e}}+{\beta }_{2}\times {\rm{s}}{\rm{e}}{\rm{x}}+{\beta }_{3}\times {\rm{l}}{\rm{o}}{\rm{c}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}+{\beta }_{4}\,\times {\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}+{u}_{{\rm{d}}{\rm{a}}{\rm{t}}{\rm{a}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{v}}{\rm{i}}{\rm{d}}{\rm{e}}{\rm{r}}}+{u}_{{\rm{l}}{\rm{o}}{\rm{c}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}:{\rm{d}}{\rm{a}}{\rm{t}}{\rm{a}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{v}}{\rm{i}}{\rm{d}}{\rm{e}}{\rm{r}}}$$

in which P Q,A,S,Y,L is the quantity of interest, either proportion of the population with no education or \(\frac{{\rm{mean}}\,{\rm{years}}\,{\rm{of}}\,{\rm{educational}}\,{\rm{attainment}}}{18}\) for a given age, sex, year, and location. u data provider is a region-specific random effect that captures the average bias between data providers across all countries within that region, and u location:data provider is a nested location-specific random effect that captures the additional bias between a location-specific gold standard (where applicable) and the other sources present in that location.

To calculate source adjustments for each data provider, the u data provider value for each data provider is compared with the regional gold standard, and the difference is applied to all surveys. Subsequently, in locations that have gold-standard data present, u location:data provider effects are applied in the same fashion.

Cohort extrapolation

We use an age-cohort modelling process to project cohorts through time, leveraging the stability of cohort-specific educational attainment after age 25. To model the changes by age within cohorts, we use data from all available cohorts with multiple observations at or after age 25. For each quantity being modelled, we calculated \({y}_{Q,L,S,C,{A}_{x}}\), which is the logit difference of the P Q,A,S,Y,L ′ (the adjusted input data) at time x and at time y, for all possible combinations of repeat cohort observations. We restrict repeat cohort observations to those that are less than or equal to 10 years apart and to those where both observations occur after 1990 to avoid the attribution of differences in measurements to mortality as opposed to advances in survey and census design. In addition, we normalize all repeat cohort observation pairings so that the average change at 65 years of age is 0 to account for systematic bias between survey iterations (such as improvements in sampling). This is similar to other previously described approaches33, in which only excess mortality beyond the age of 65 is considered. This calculation is shown below:

$${y}_{Q,L,S,C,{A}_{x}}={\rm{logit}}({P}_{L,S,C,A,{{\rm{Src}}}_{x}}{}^{{\prime} })-{\rm{logit}}({P}_{L,S,C,A,{{\rm{Src}}}_{y}}{}^{{\prime} })-{{\rm{bias}}}_{L,P,S,C,{\rm{Src}}}$$

in which Q is the quantity being modelled, L is location, S is sex, C is cohort, A is age, Src is data provider, and bias L,P,S,C,Src is the average change for cohorts as they age from 60 to 70 between the two surveys. This is the age period for which we expect the educational attainment of a cohort to be least prone to changes due to migration and mortality, and any changes observed during this period are therefore used as a measure of inherent bias between multiple waves of a survey or census.

These logit differences were examined with respect to several predictor variables. We then modelled the logit difference using a number of linear mixed-effects models, which were evaluated using out-of-sample predictive validity (see Supplementary Information). The best performing model specification is displayed here:

$${y}_{Q,L,S,C,{A}_{x}}=I+{u}_{{\rm{location}}:{\rm{super}}{\rm{region}}}$$

in which I is a natural spline with a knot at age 70 intended to capture the potential nonlinearity in the rate of change of differential mortality by education over age. u location:super region are random intercepts on location, nested within super-regional random intercepts.

Age-period model

Age-period models were fit on all values of \({P}_{Q,A,S,Y,L}{\prime\prime} \), which reflect the adjusted input data after cohort extrapolation, to interpolate data for observed cohorts, and to extrapolate to all parts of the desired time series, producing \({P}_{Q,S,Y,L}{}^{\prime\prime\prime }\), single-year estimates of attainment from 1970 to 2018. Several linear mixed-effects models were used and evaluated using out-of-sample predictive validity (see Supplementary Information). Separately for each sex, and region grouping used in the GBD study, the quantity of interest of the country–age–year-specific population,\({P}_{Q,A,S,Y,L}{}^{\prime\prime\prime }\) was estimated:

$${\rm{l}}{\rm{o}}{\rm{g}}{\rm{i}}{\rm{t}}({P}_{Q,A,S,Y,L}{\prime\prime} )=\,{\beta }_{s,r}+{\delta }_{s,r}{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}+{I}_{s,r}+\,{\alpha }_{c,a,s}$$

in which \({\beta }_{s,r}\) is a sex- and region-specific intercept; δ s,r captures the linear secular trend for each sex and region; I s,r is a natural spline on age to capture the nonlinear age pattern by sex and region, with knots at 45 and 65 years of age; and \({\alpha }_{c,a,s}\) is a country-sex-specific random intercept.

Gaussian process regression

Gaussian process regression (GPR) was used to ensure final model results are consistent with input data and incorporate model and data uncertainty to produce uncertainty intervals. GPR has been used extensively as a data synthesis tool34. GPR uses a covariance function to smooth the residuals from the age-period model, taking into account the uncertainty in each data point. GPR also synthesizes both data and model uncertainty, in order to produce estimate uncertainty intervals. GPR assumes that the trend in the underlying data follows a Gaussian process, which is defined using a mean function m(∙) and a covariance function Cov(∙). Therefore, separately for each Q quantity being estimated, the location–sex–age–year-specific outcome measures are defined:

$${\rm{logit}}({y}_{Q,L,S,C,A})=\,{g}_{Q,L,S,A,Y}\,+\,{{\epsilon }}_{Q,L,S,A,Y}\,$$

Where the error term is normally distributed:

$${{\epsilon }}_{Q,L,S,A,Y}={\rm{normal}}(0,{\sigma }_{p}^{2})$$

The error variance, \({\sigma }_{p}^{2}\) is composed of the squared standard error of the observed data point, as well as the prediction errors from the age-cohort imputation process. The mean function of the model is defined as the age-period model predictions, as detailed above. The covariance function of the model is derived using a Matérn covariance function, consistent with prior applications of GPR:

$$M(y,y{\prime} )={\sigma }^{2}\frac{{2}^{1-

u }}{\varGamma (

u )}\,{\left(\frac{d(y,y{\prime} )\sqrt{2

u }}{l}\right)}^{

u }{K}_{{\rm{

u }}}\left(\frac{d(y,y{\prime} )\sqrt{2

u }}{l}\right)$$

where d(∙) is a distance function, σ2 is the marginal variance, ν is a smoothness hyper parameter defining the differentiability of the function, l is a link-scale parameter approximately equivalent to the number of years at which two points are no longer correlated, Κ ν is the Bessel function, and Γ(∙) is the gamma function. Similar to previous applications of GPR, we approximate \({\sigma }_{p}^{2}\) as the super-region and sex-specific residual from the mean function, with ν set to 2 and l to 40, to reflect the inherent smoothness of educational attainment trends over time.

Ensemble K-nearest neighbours distribution model

To create a full time-series of distributions of single-years of educational attainment to 2018, we used a K-nearest neighbours algorithm to reconstruct an ensemble distribution for each location–age–sex–year (LASY) combination. To pick K candidate distributions for each LASY combination, we used two modelled entities produced by the above methods, mean educational attainment and proportion of the LASY population with 0 years of schooling, to find the most similar distributions in our database of 3,180 surveys and censuses. The metric used to find the most similar distributions was the Mahalanobis distance:

$${D}_{M}^{i}({H}^{i})=\sqrt{{({H}^{i}-{I}^{{\rm{LASY}}})}^{T}{S}^{-1}({H}^{i}-\,{I}^{{\rm{LASY}}})}$$

in which \({H}^{i}\) is a multivariate vector \(({\rm{logit}}\left(\frac{{{\rm{mean}}}^{i}}{18}\right),\,{\rm{logit}}({{\rm{prop}}}_{0}^{i}))\) corresponding to a survey–age–sex–year entry in our educational database, I LASY is a multivariate vector \(({\rm{logit}}\left(\frac{{{\rm{mean}}}^{{\rm{LASY}}}}{18}\right),\,{\rm{logit}}({{\rm{prop}}}_{0}^{{\rm{LASY}}}))\) repre-senting the modelled entities described above, and S −1 is the covariance matrix between vectors \({\rm{logit}}\left(\frac{{{\rm{mean}}}^{i}}{18}\right)\) and \({\rm{logit}}({{\rm{prop}}}_{0}^{i})\).

For each I LASY, K distributions with the smallest Mahalanobis distances are chosen as candidate distributions for the final ensemble distribution. To collapse K distributions to a final ensemble distribution, we use a weighted average of the candidate distributions based on a location, age, and cohort distance defined as:

$${{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{a}}{\rm{n}}{\rm{c}}{\rm{e}}}^{i}={({P}_{{\rm{a}}{\rm{g}}{\rm{e}}}\times {{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{a}}{\rm{n}}{\rm{c}}{\rm{e}}}_{{\rm{a}}{\rm{g}}{\rm{e}}}^{i})}^{\psi }\,+{({P}_{{\rm{c}}{\rm{o}}{\rm{h}}{\rm{o}}{\rm{r}}{\rm{t}}}\times {{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{a}}{\rm{n}}{\rm{c}}{\rm{e}}}_{{\rm{c}}{\rm{o}}{\rm{h}}{\rm{o}}{\rm{r}}{\rm{t}}}^{i})}^{\psi }+{({P}_{{\rm{s}}{\rm{p}}{\rm{a}}{\rm{c}}{\rm{e}}}\times {{\rm{D}}{\rm{i}}{\rm{s}}{\rm{t}}{\rm{a}}{\rm{n}}{\rm{c}}{\rm{e}}}_{{\rm{l}}{\rm{o}}{\rm{c}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}}^{i})}^{\psi }$$

All values of P and Distance are rescaled to lie between 0.001 and 1. \({{\rm{Distance}}}_{{\rm{location}}}^{i}\) is 0.001 for same country, 0.33 for same region, 0.66 for same super-region, and 1 otherwise.

$${{\rm{Weights}}}^{i}=\frac{1}{{{\rm{Distance}}}^{i}}$$

ψ is a hyperparameter controlling how sharply weights decrease as Distancei increases. To collapse K distributions to a final ensemble distribution for each LASY combination we calculated:

$${{\rm{Proportion}}}_{{\rm{eduyrs}}}^{{\rm{LASY}}}=\frac{{\sum }_{i=1}^{K}{{\rm{Weights}}}^{i}\times {{\rm{proportion}}}_{{\rm{eduyrs}}}^{i}}{{\sum }_{i=1}^{K}{{\rm{Weights}}}^{i}}$$

in which Proportion eduyrs is the proportion in each educational bin, 0–18.

Final ensemble distributions were then smoothed by bin using a Loess smoother with a span of η over time to ensure plausible time series for each draw. All hyperparameters were optimized using out-of-sample predictive validity (detailed in the Supplementary Information), and chosen values include: K = 80; P age = 0.25; P cohort = 0.85; P space = 0.7; ψ = 2.5; η = 0.5.

Rate of change distribution forecasting model

To forecast the distribution of education and mean years of schooling, we use a rate of change (ROC) model at the single-year bin level. This has the benefit of producing projections of mean attainment that respect the nonlinear dynamics of distributional growth. The model is fit in a timeseries-specific fashion, separately by sex and country. For each single-year bin, we derive a ROC using a weighted average of the ROC for the last 15 years:

$${{\rm{ROC}}}_{{\rm{eduyr}}}^{{\rm{LAS}}}=\,\mathop{\sum }\limits_{i=2004}^{2018}\frac{{\rm{logit}}{({{\rm{proportion}}}_{{\rm{eduyr}}})}^{{{\rm{LAS}}}^{i}}-\,{\rm{logit}}{({{\rm{proportion}}}_{{\rm{eduyr}}})}^{{{\rm{LAS}}}^{i-1}}}{15}$$

Where \({{\rm{ROC}}}_{{\rm{eduyr}}}^{{\rm{LAS}}}\) is the average rate of change over the last 15 years within each location–age–sex (LAS) combination for each single-year bin of education (0–18).

The ROC model was leveraged only where the cohort extrapolation model could not inform our estimates. This begins in 2019 for 25–29-year-olds, 2024 for 30–34-year-olds, and 2029 for 35–40-year-olds. For the results presented in the main text, for 25–29-year-olds, this method was used for 2019 onwards.

SDG progress and inequality metrics

Drawing on these estimates of the distribution of years of schooling, we calculate several metrics detailing global progress towards the SDG 4 targets. We calculate the proportion of the population of individuals age 25–29 who have completed primary, secondary, and tertiary education, defined as completing at least 6, 12, and 15 years of schooling, respectively. We describe gender equality using the ratio of female to male attainment of primary and secondary education, as well as the gap in mean years of schooling between men and women. Aggregate measures at the national level for both sexes, and at the regional level were calculated, using projected population estimates drawn from the World Population Prospects dataset35. We also present a novel index of educational inequality among young people in each country, the average AID. This index is defined as the average value of the absolute differences between all possible pairs of individuals in the population. The AID is also mathematically equivalent to the Gini coefficient, multiplied by two times the mean of the distribution36.

Predictive validity

The main aims of this analysis are predictive in nature, and we therefore assessed each stage of our model, and each model selection decision, with respect to predictive capacity. We focused mainly on ‘out-of-sample’ predictive ability, which reflects how well the model predicts data that was not directly available. This most mimics the true task that we want our model to accomplish, that is, to make accurate predictions for the geographies and time periods that do not have input data available. To assess out-of-sample predictive validity, we followed the general strategy of dividing our database into ‘training’ and ‘testing’ data. The model was fit on the training data, and the results were compared with the testing data. The ‘error’ of the model represents the average amount that our model was incorrect compared with the ‘true’ data that was held out. Each step of the modelling process was assessed for how well it predicted (out-of-sample) the mean years of schooling for a given population, as well as other aspects of the distribution, such as the proportion with 0 years of schooling. We also assessed the degree to which predictive validity varied by time period, across regions, and by which type of data source was held out. There were small differences in predictive validity across these dimensions, for example, models tended to perform slightly better in the 2000–2018 period where the most data are available; however, they were generally modest. Furthermore, we found that the best performing models tended to perform optimally across almost all geographies/time periods, so it was not necessary to use multiple models for a single step. All predictive validity results, and a discussion of their implications, can be found in the Supplementary Information.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.