Probabilistic forecast system

To make climate predictions we developed a PRObabilistic foreCAST system (PROCAST system) based on transfer operators. This method has been successfully used in a large range of scientific studies from statistical physics6 to geophysical fluid mechanics7,8,9. The basic principle behind the use of transfer operators is a statistical approach to rationalize the chaotic behavior of the system. The transfer operators gather the information from all known, previous state transitions (or trajectories in the phase space), allowing the computation of the system evolution from its current state to new states in a probabilistic manner. In this sense it can be related to the analog methodologies10,11,12. Here the climate state is evaluated through the one-dimensional phase space defined by either GMT or SST, whereas the state transitions are based on GMT or SST evolutions simulated by climate models from the CMIP5 database (see Methods for further details).

The evolution of the annual GMT and SST anomalies for the 10 climate models is estimated from the temperature anomaly relative to the ensemble mean in each individual climate model ensemble (see Methods for further details). This procedure separates internal variability from the forced signal. The anomaly of GMT or SST in the observational record is computed by removing the part that can be attributed to external forcing (see Methods for further details).

The CMIP5 GMT and SST anomalies consist of centered distributions with a standard deviation of the annual mean of 0.1 and 0.07 K, respectively. For GMT, the modeled standard deviation is slightly weaker than the observed one (0.12 K), but remains in good agreement: <9% of relative difference. On the other hand, for SST, the standard deviation of the distribution in the CMIP5 is significantly weaker than in the observations (0.13 K), with a relative difference of 43%. Hence for SST, the modeled distribution is renormalized to fit the standard deviation of the observations.

To define the transfer operators, we split the phase space in 24 states and computed the probability of transition between the 24 states based on the ~10 000 transitions provided by the CMIP5 database (see Fig. 2 and Method for further details). Hence the transfer operators are a natural tool for conditional probability forecasts. Indeed, we evaluate the probability of future system states based on the condition that the system is in its current state. To gain insight on the method we applied the transfer operators to the 24 possible states. This shows that for all possible initial conditions the probability density function slowly converges to the total density distribution with a timescale of ~10 years (see Fig. 3 for an example of the asymptotic convergence). This is accomplished through the slow converge of the mean probability and the spread of the probability distribution. However, the computed probability density distribution is more intricate than simply described by the evolution of its mean and standard deviation. Indeed, within the information restricted to the reduced phase space of GMT or SST, the method allows to follow the evolution of the complete probability density function over time (Fig. 3).

Fig. 2 Schematic of the Transfer Operator method. The computation of the transfer operators follows four steps: first, the predicted metric (GMT and SST) is split in a number of different states based on its intensity (8 in the schematic, 24 for the real forecast system); then, the number of trajectories from the CMIP5 database in each state is counted (N i , blue numbers); as a third step, for each state, the number of corresponding trajectories found after the prediction delay is counted (n i,j , red numbers); finally, the probability of transition from ith state to jth is given by n i,j /N i . This sequence is repeated for 10 different prediction delays and 10 averaging times of trajectories (both are varied from 1 to 10 years, by 1 year time step) building the range of transfer operators required for prediction Full size image

Fig. 3 Example of a probabilistic interannual prediction for 5-year GMT anomalies using the transfer operators. Evolution of the predicted density distribution of GMT for hindcast lags of a 1 year, b 2 years, c 5 years, and d 10 years (red histogram). Vertical blue lines indicate the initial condition; the gray histograms in the background represent the asymptotic, climatological distribution; vertical black lines correspond to the mean, ±1, and ±2 standard deviations of the asymptotic, climatological distribution Full size image

Hence, this method is valid under four assumptions: GMT or SST information is enough for GMT or SST prediction; all model trajectories are statistically equivalent; the common component of each model reflects its forced component; and stationarity of the statistics of anomalies in a changing climate scenario (these assumptions are further described in the Methods). In particular, the severe truncation of the phase space to a single variable implies that different climate states with equivalent GMT or SST are all aggregated in the probabilistic approach of the transfer operators. However, as it will be fully described later, PROCAST is skillful for interannual GMT and SST prediction despite these four assumptions. This, a posteriori, confirms that despite these assumptions not being strictly true, they are reasonable given the aims of our study.

Evaluation metrics and perfect model approach

Now that the transfer operators are defined and its general behavior described, we first estimate the predictive skill of our method in a perfect model framework. This means that we predict trajectories of the numerical models, rather than of real observations. This allows to estimate the predictive skill in the best possible scenario, since it avoids the intrinsic bias or error between models and observations. The probabilistic distribution is computed for up to 10-year lags (1–10 years ahead) and for annual to 10-year averaged data.

To estimate the validity of our probabilistic predictions we use two different measures: the coefficient of determination—R2, which shows the skill of the mean prediction; and the reliability, which measures the accuracy of the spread in the prediction. These two measures can be mathematically expressed as:

$$R^2 = 1 - \frac{{\overline {\left( {\overline {x_ip_i(t)} ^i - o(t)} \right)^2} ^t}}{{\overline {o(t)^2} ^t}},$$ (1)

$${\mathrm{Reliability}} = \sqrt {\overline {\left\{ {\frac{{\left( {\overline {x_ip_i(t)} ^i - o(t)} \right)^2}}{{\overline {\left[ {x_i - \overline {x_ip_i(t)} ^i} \right]^2p_i(t)} ^i}}} \right\}} ^t} ,$$ (2)

where t is time, i is the possibility or state index, o(t) is the observation, and x i are the predicted possibilities with probability p i (t). The bar denotes an average over time or sum over possibilities depending on the superscript. (Our equation of the reliability is an extension for non-stationary statistic of the previously suggested definition13.) The coefficient of determination, when multiplied by 100, gives the percentage of variance of the observations explained by the prediction. Since the system is chaotic (there is a degree of uncertainty around the mean prediction), it is expected that the prediction cannot represent the observation perfectly, even if the model represents perfectly reality. Hence, the reliability measures the accuracy of this prediction error. When a reliable prediction has large skill (~1) we expect the prediction uncertainty to be small. On the other hand, when a reliable prediction system has low skill (~0) we expect the prediction uncertainty to be as big as the observed variance. In this context, and regardless of its skill, a reliable prediction system always needs to have a reliability close to 1. Hence, despite that a high value of R2 is preferable for a skillful prediction, the reliability is arguably more important to estimate the usefulness of the prediction system. Indeed, a reliable prediction system can be used for probabilistic forecasts and risk assessments, even if it has low skill14,15.

To evaluate the quality of our prediction system, skill and reliability are computed for all lags and averaging times. For comparison, we use persistence as our null hypothesis (i.e., initial values in our hindcast for all hindcast times). Within the perfect model approach, all the trajectories in the selected models of the CMIP5 database have been tested. This reveals two main results: PROCAST is able to surpass persistence for all averaging timescales and hindcast lags for GMT and SST; and prediction skills are often larger than 0.5 on annual to interannual timescales (Fig. 4). It also reveals the excellent reliability of PROCAST (as expected within a perfect model approach) with a value of 1 (within a 6% and 3% errors for GMT and SST, respectively) for all averaging timescales and hindcast lags (Fig. 4c, d).

Fig. 4 Interannual hindcast skills of GMT and SST within a perfect model approach. a, b Skill of the prediction measured by the coefficient of determination—R2—between model observations and mean prediction for different hindcast lags and averaging times. The coefficient of determination between model observations and persistence (i.e., the null hypothesis of prediction) is also computed to give a benchmark. Thick contour lines represent values of 0.5, thin dashed—lower skill, and thin solid—higher skill, with contour intervals of 0.1; and hatching shows skill lower than the persistence. c, d Reliability of the prediction for different hindcast lags and averaging times. Note the absence of hatched region in a and b denoting the better skill than persistence for all hindcast lags and averaging times. Also, note the flat pink color in c and d corresponding to a good reliability close to 1 for all hindcast lags and averaging times, as expected in a perfect model approach. e, f Difference between the coefficient of determination for the hindcasts with observations and within a perfect model approach (a and b vs. Fig. 5c, d for GMT and SST). For all hindcast lags and averaging times, the skill is better for observations than for the perfect model approach. Thick black lines are for zero values, thin black lines are positive values with a contour interval of 0.05 Full size image

To further test the predictive skill and reliability of PROCAST we have assessed them in an imperfect model approach (i.e., removing outputs of one model from the transfer operators computation and using them as pseudo-observations). We find that PROCAST is still able to perform at the same level of accuracy than within the perfect model approach with a slight decrease of the coefficient of determination of less than 0.01 for all lags and averaging times tested.

Hindcast skills and predictions of the post-1998 hiatus

After having tested PROCAST in a perfect model setting, we now test the exact same system with real observations. (Note that no retuning before going to observations has been applied.) We reproduce the skill analysis with the observed internal variability, estimated as anomalies from the forced component in GMT and SST (Fig. 1). For this purpose we computed retrospective predictions of the past, or hindcasts, from 1880 to 2016. This procedure allows a full estimate of the predictive ability of our prediction system in the most realistic and operational setting. (Examples of hindcasts for 5-year averages are shown in Fig. 5a, b.)

Fig. 5 Interannual hindcast skills of the observed GMT and SST. Example for 5-year average data of a, b GMT and SST observations (blue) with mean prediction for 1- to 10-year hindcast lags (from black to yellow lines, respectively). c, d Skill of the prediction measured by the coefficient of determination—R2—between observations and mean prediction for different hindcast lags and averaging times. The coefficient of determination between model observations and persistence (i.e., the null hypothesis of prediction) is also computed to give a benchmark. Thick contour lines represent values of 0.5, thin dashed—lower skill, and thin solid—higher skill, with contour intervals of 0.1; and hatching shows skill lower than the persistence. e, f Reliability of the prediction for different hindcast lags and averaging times. Thick red, black, and white contour lines represent values of 0.8, 1, and 1.2, respectively; thin black dashed and solid lines represent lower and higher values than 1, respectively, with contour intervals of 0.1. Note the better skill than persistence (almost no hatched region) and the good reliability close to 1 for all hindcast lags and averaging times Full size image

As before, the probabilistic distribution is computed for up to 10-year lags and for annual to 10-year averaged data. Similar to the perfect model approach, skill values decrease from 1 to 0 depending on the lags (Fig. 5). Focusing on a 5-year mean prediction of SST, we still have a skill for 5-year lags of ~30% (Fig. 5d), suggesting our ability to accurately forecast part of the SST variations for the next 5 years. More generally the skills are always better than persistence (except for a few averaging times and hindcast lags of SST). Also reliability remains close to 1, even when the skill is low, suggesting that even for low skills we are sampling an accurate range of possible future states. The same holds for predictions of GMT, but with slightly reduced skill compared to SST. This demonstrates the usefulness of PROCAST for probabilistic predictions.

The root mean square error averaged for hindcast lags from 1 to 9 years for annual GMT is 0.105 K. This is strictly identical to the value reported for DePreSys in 2007 (the operational Decadal Prediction System of the Met-Office)16. When averaged over the hindcast lags from 1 to 5 years, the root mean square error of PROCAST is 0.104 K, whereas it is 0.151 K for the latest version of DePresys (DePreSys3, D. Smith, personal communication). This indicates that PROCAST is 37% more accurate than DePreSys3 for interannual predictions of GMT. This comparison is slightly biased, however, since DePreSys predicts the absolute temperature, whereas we only predict the anomaly from the forced part. However, the forced part of the variability is arguably the most predictable, since the external forcing is imposed accurately during hindcasts. Furthermore, the accurate probabilistic approach of PROCAST also differs from DePreSys, which can only diagnose a probabilistic prediction from a relatively small ensemble (~10 members), limiting its statistical accuracy and overall reliability. More specifically, PROCAST reliability for annual GMT is almost perfect with an averaged value for hindcast lags from 1 to 5 years of 1 (prediction spread is as big as the prediction error on average), whereas it is 2.3 for DePreSys3 (prediction spread is 2.3 times as small as the prediction error on average). This relatively weak reliability of DePreSys3 is a sign of the under-dispersion of the ensemble in comparison with its prediction error and of an over-confident prediction system. Hence, PROCAST appears to be better suited for probabilistic predictions and risk assessments of extremes. Finally, and probably more importantly, the numerical cost is without comparison. Doing a 10-year forecast using PROCAST takes 22 ms, and can be easily done on a laptop almost instantaneously. On the other hand, a 10-year forecast using DePreSys3 (which corresponds to a 10-member ensemble) takes a week on the Met-Office supercomputer, accessible only to a small number of scientists. This difference in numerical cost has to be put into perspective, though. PROCAST takes advantage of the freely available CMIP5 database, which is an incredibly expensive numerical exercise of the worldwide climate science community. Also, unlike PROCAST, DePreSys3 is not specifically trained for a single-variable prediction, so that the entire climate state is predicted in one forecast. This is obviously beneficial.

To further identify the usefulness of PROCAST we tested its prediction skill for the recent global warming hiatus17. We define this recent hiatus as the post-1998 decade cooling seen in GMT anomaly (Fig. 1e). This cooling totally offsets the forced warming (Fig. 1c) leading to a plateau in the observed, total warming (Fig. 1a). PROCAST is indeed able to reproduce the decade-long cooling anomaly (Fig. 6) for all averaging timescales tested (1–10 years). The coefficient of determination is 0.52 on average (for averaging timescales ranging from 1 to 5 years and could be as high as 0.6 or 0.64 for 1 and 5-year time averages, respectively. This suggests that 60% and 64% of the annual and 5-year variations, respectively, are accurately predicted for a decade long. For these two examples the correlation coefficient is also high with values of 0.63 and 0.64. Despite some error in its exact intensity (especially when focusing on mean prediction) or the details of its annual variations, this shows that an event such as the post-1998 hiatus could not have been missed using PROCAST (especially when acknowledging the predictive spread). In particular our probabilistic forecast framework shows that a decade-long hiatus was always a likely outcome (always well within 1 standard deviation of our prediction), even if not the most likely, especially after 7 years. Because the amplitude is somewhat lower than observed, it would be consistent if a small part of the hiatus was indeed caused by external forcing, although the main part would be due to internal variability18,19,20,21. This is a significant achievement since the recent hiatus can be considered as a statistical outlier22,23, and only a few of the CMIP5 models simulated such a strong and long pause in global warming24. This places PROCAST among the state-of-the-art prediction systems, which have been able to retrospectively predict the recent global warming hiatus25,26. Other starting dates have been tested (such as 2002) and always allow PROCAST to capture the long-term hiatus, but to a lesser degree the exact interannual variation of the decades.

Fig. 6 Predictions of the post-1998 hiatus observed through GMT anomalies. Observation and prediction of GMT anomalies averaged over a 1 year, b 2 years, and c 5 years. Hiatus is defined as the post-1998 decade (blue line) showing a decrease of GMT anomalies partially or totally offsetting the forced trend. Decadal prediction means and standard deviations (red circles and vertical lines, respectively) are obtained using PROCAST initialized in 1998 (red squares). Prediction skills are computed through the coefficient of determination (R2) and the correlation coefficient (CC) over the post-1998 decade. d–f are equivalent to a–c with the addition of the observed component attributed to forcing (purple lines) Full size image

When compared with the perfect model predictions, the predictive skill in PROCAST is (always) better for real-world predictions (Fig. 4e, f) than in the perfect model approach. In particular, the skill is improved by up to 30% on interannual timescales, and is especially better for SST. This behavior has been previously reported for other variables and prediction systems27,28,29 and seems to be related to a weaker signal-to-noise ratio in models than in observations.

It is also interesting to note that skill and reliability are improved by the addition of information from more models rather than by selecting a subset of the best models (i.e., models giving the best skill when used alone to train the Transfer Operator). This means that the transfer operators built with only the best models have lower skill than the transfer operators built with all 10 climate models. Moreover, removing any single model from the set of 10 does not significantly lower the skill, suggesting that convergence has been reached when using 10 climate models.

Our analysis also shows that SST has better skill than GMT for all tested hindcast lags and averaging times. This suggests that the ocean is improving the hindcast skill and is more predictable than the continental surface temperature encompassed in GMT. This result is consistent with previous analyses suggesting the ocean as a source of predictability and limiting the continental predictability to marine-influenced regions30,31.

Forecasting the future

After the skill of the method has been assessed, we compute a probabilistic forecast of GMT and SST from 2018 to 2027 (Figs. 7 and 8), focusing on three averaging times: 1-, 2-, and 5-year averages. We also considered longer averaging times, such as a 10-year averaging period, but these all show forecast probabilities almost equivalent to the climatological probability (Table 1). Similarly, for long enough lags or forecast times (i.e., a decade) the predictions always converge to the climatological values (Figs. 7 and 8a–c).

Fig. 7 Interannual probabilistic forecast of the GMT anomalies. a–c Mean prediction of GMT anomaly for the next 5 years for three different averaging times: annual, 2, and 5 years; horizontal black lines correspond to the mean and ±1 standard deviations of the climatological distribution. Circles and squares represent mean predictions with coefficient of determination bigger and smaller than 0.2 (note that we still have good reliability even for skills smaller than 0.2); colorscale represents the mean prediction of the GMT anomaly; the vertical colored lines represent ±1 standard deviation of prediction distributions. Stars denote the most likely state from the distributions. d–f Prediction of GMT distribution (in %) for 1, 2, and 5 years in advance with respect to 2017. Gray histograms in the background represent the asymptotic, climatological distribution; vertical blue lines represent the current position used to initialize the forecast system; vertical black lines correspond to the mean, ±1, and ±2 standard deviations of the climatological distribution. g–i Distribution of probability anomaly (in %), probability changes with respect to the climatological distribution, for the 1, 2, and 5 years in advance predictions. Background colorscale represents ±0–1, 1–2, and more than 2 standard deviations, respectively, consistently with moderate, intense, and extreme events Full size image

Fig. 8 Interannual probabilistic forecast of the SST anomalies. a–c Mean prediction of SST anomaly for the next 5 years for three different averaging times: annual, 2, and 5 years; horizontal black lines correspond to the mean, ±1, and ±2 standard deviations of the climatological distribution. Circles and squares represent mean predictions with coefficient of determination bigger and smaller than 0.2 (note that we still have good reliability even for skills smaller than 0.2); colorscale represents the mean prediction of the SST anomaly; the vertical colored lines represent ±1 standard deviation of prediction distributions. Stars denote the most likely state from the distributions. d–f Prediction of SST distribution (in %) for 1, 2, and 5 years in advance with respect to 2017. Gray histograms in the background represent the asymptotic, climatological distribution; vertical blue lines represent the current position used to initialize the forecast system; vertical black lines correspond to the mean, ±1, and ±2 standard deviations of the climatological distribution. g–i Distribution of probability anomaly (in %), probability changes with respect to the climatological distribution, for the 1, 2, and 5 years in advance predictions. Background colorscale represents ±0–1, 1–2, and more than 2 standard deviations, respectively, consistently with moderate, intense, and extreme events Full size image

With shorter time averages our probabilistic forecast for 2018 (based on data up to 2017) suggests a higher likelihood of warm events for both GMT and SST (Figs. 7 and 8d), with a probability of higher temperature than predicted by the forcing alone of 58% and 75% for GMT and SST, respectively. This corresponds to an expected warm anomaly of 0.02 and 0.07 K for GMT and SST, which would reinforce the forced warm trend. To describe the expected warm event in greater details, we can classify the temperature anomaly as moderate (lower than 1 standard deviation), intense (bigger than 1 standard deviation and lower than 2 standard deviations), and extreme (bigger than 2 standard deviations). This classification suggests that moderate warm events are the most likely for 2018 GMT and SST. This can be further diagnosed by looking at the relative changes of probabilities from the climatological probability (Figs. 7 and 8g). This suggests that intense and extreme cold events have the lowest risk in 2018 (i.e., largest change of occurrence compared to climatology), with an occurrence decrease of more than 60% and 100% for GMT and SST, respectively.

On longer timescales, for the 2018–2019 average, anomalous warm events remain the most likely event (Figs. 7 and 8e), with an expected intensity of 0.03 and 0.07 K for GMT and SST. For this period, intense GMT warm anomalies show the maximum changes in likelihood compared to climatology (Fig. 7h). For SST the maximum change in likelihood is an increase of extreme warm events (Fig. 8h).

For the forecasted 5-year averaged temperatures (i.e., for the period 2018–2022), the predictions differ. For GMT, the predictions suggest a balanced probability between warm and cold events (Fig. 7f) on top of the forced trend, with a small relative reduction of expected occurrence of extreme cold and a small relative increase of expected occurance of extreme warm events (Fig. 7i). For SST, the forecast still suggests an anomalous warm event for the period 2018–2022 with an expected value of 0.05 K (Fig. 8f). The forecast also suggests a relative increase of the probability of extreme warm events for SST over this period, by up to 400% (Fig. 8i).