Climate models provide an important way to understand future changes in the Earth's climate. In this paper we undertake a thorough evaluation of the performance of various climate models published between the early 1970s and the late 2000s. Specifically, we look at how well models project global warming in the years after they were published by comparing them to observed temperature changes. Model projections rely on two things to accurately match observations: accurate modeling of climate physics and accurate assumptions around future emissions of CO 2 and other factors affecting the climate. The best physics‐based model will still be inaccurate if it is driven by future changes in emissions that differ from reality. To account for this, we look at how the relationship between temperature and atmospheric CO 2 (and other climate drivers) differs between models and observations. We find that climate models published over the past five decades were generally quite accurate in predicting global warming in the years after publication, particularly when accounting for differences between modeled and actual changes in atmospheric CO 2 and other climate drivers. This research should help resolve public confusion around the performance of past climate modeling efforts and increases our confidence that models are accurately projecting global warming.

Retrospectively comparing future model projections to observations provides a robust and independent test of model skill. Here we analyze the performance of climate models published between 1970 and 2007 in projecting future global mean surface temperature (GMST) changes. Models are compared to observations based on both the change in GMST over time and the change in GMST over the change in external forcing. The latter approach accounts for mismatches in model forcings, a potential source of error in model projections independent of the accuracy of model physics. We find that climate models published over the past five decades were skillful in predicting subsequent GMST changes, with most models examined showing warming consistent with observations, particularly when mismatches between model‐projected and observationally estimated forcings were taken into account.

1 Introduction Physics‐based models provide an important tool to assess changes in the Earth's climate due to external forcing and internal variability (e.g., Arrhenius, 1896; IPCC, 2013). However, evaluating the performance of these models can be challenging. While models are commonly evaluated by comparing “hindcasts” of prior climate variables to historical observations, the development of hindcast simulations is not always independent from the tuning of parameters that govern unresolved physics (Gettelman et al., 2019; Mauritsen et al., 2019; Schmidt et al., 2017). There has been relatively little work evaluating the performance of climate model projections over their future projection period (referred to hereafter as model projections), as much of the research tends to focus on the latest generation of modeling results (Eyring et al., 2019). Many different sets of climate projections have been produced over the past several decades. The first time series projections of future temperatures were computed using simple energy balance models in the early 1970s, most of which were solely constrained by a projected external forcing time series (originally, CO 2 concentrations) and an estimate of equilibrium climate sensitivity from single‐column radiative‐convective equilibrium models (e.g., Manabe & Wetherald, 1967) or general circulation models (e.g., Manabe & Wetherald, 1975). Simple energy balance models have since been gradually sidelined in favor of increasingly high resolution and comprehensive general circulation models, which were first published in the late 1980s (e.g., Hansen et al., 1988; IPCC, 2013; Stouffer et al., 1989). Climate model projections are usefully thought about as predictions conditional upon a specific forcing scenario. We consider these to be projections of possible future outcomes when the intent was to use a realistic forcing scenario and where the realized forcings were qualitatively similar to the projection forcings. Evaluating model projections against observations subsequent to model development provides a test of model skill, and successful projections can concretely add confidence in the process of making projections for the future. However, evaluating future projection performance requires a sufficient period of time postpublication for the forced signal present in the model projections to be differentiable from the noise of natural variability (Hansen et al., 1988; Hawkins & Sutton, 2012). Researchers have previously evaluated prior model projections from the Hansen et al. (1988) National Aeronautics and Space Administration Goddard Institute for Space Studies model (Hargreaves, 2010; Rahmstorf et al., 2007), the Stouffer et al. (1989) Geophysical Fluid Dynamics Laboratory model (Stouffer & Manabe, 2017), the IPCC First Assessment Report (FAR‐IPCC, 1990; Frame & Stone, 2012), and the IPCC Third and Fourth Assessment reports (IPCC, 2001; IPCC, 2007; Rahmstorf et al., 2012). However, to‐date there has been no systematic review of the performance of past climate models, despite the availability of warming projections starting in 1970. This paper analyzes projections of global mean surface temperature (GMST) change, one of the most visible climate model outputs, from several generations of past models. GMST plays a large role in determining climate impacts, is tied directly to international‐agreed‐upon mitigation targets, and is one of the climate variables that has the most accurate and longest observational records. GMST is also the output most commonly available for many early climate models run in the 1970s and 1980s. Two primary factors influence the long‐term performance of model GMST projections: (1) the accuracy of the model physics, including the sensitivity of the climate to external forcings and the resolution or parameterization of various physical processes such as heat uptake by the deep ocean and (2) the accuracy of projected changes in external forcing due to greenhouse gases and aerosols, as well as natural forcing such as solar or volcanic forcing. While climate models should be evaluated based on the accuracy of model physics formulations, climate modelers cannot be expected to accurately project future emissions and associated changes in external forcings, which depend on human behavior, technological change, and economic and population growth. Climate modelers often bypass the task of deterministically predicting future emissions by instead projecting a range of forcing trajectories representative of several plausible futures bracketed by marginally plausible extremes. For example, Hansen et al., 1988 consider a low‐emissions extreme Scenario C with “more drastic curtailment of emissions than has generally been imagined,” a high‐emissions extreme Scenario A wherein emissions “must eventually be on the high side of reality,” as well as a middle‐ground Scenario B, which “is perhaps the most plausible of the three.” More recently, the Representative Concentration Pathways (RCPs) used in CMIP5 and the IPCC AR5 report similarly includes a number of plausible scenarios bracketed by a low‐emissions extreme Scenario RCP2.6 and a high‐emissions extreme Scenario RCP8.5 (van Vuuren et al., 2011). Thus, an evaluation of model projection performance should focus on the relationship between the model forcings and temperature change, rather than simply assessing how well projected temperatures compare to observations, particularly in cases where projected forcings differ substantially from our best estimate of the subsequently observed forcings. This approach—comparing the relationship between forcing and temperatures in both model projections and observations—can effectively assess the performance of the model physics while accounting for potential mismatches in projected forcing that climate modelers did not address at the time. In this paper we apply both a conventional assessment of the change in temperature over time and a novel assessment of the response of temperature to the change in forcing to assess the performance of future projections by past climate models compared to observations. Climate modeling efforts have advanced substantially since the first modern single‐column (Manabe & Strickler, 1964) and general circulation models (Manabe et al., 1965) of Earth's climate were published in the mid‐1960s, resulting in continually improving model hindcast skill (Knutti et al., 2013; Reichler & Kim, 2008). While these improvements have rendered virtually all of the models described here operationally obsolete, they remain valuable tools as they are in a unique position to have their projections evaluated by virtue of their decades‐long postpublication projection periods.

2 Methods We conducted a literature search to identify papers published prior to the early‐1990s that include climate model outputs containing both a time series of projected future GMST (with a minimum of two points in time) and future forcings (including both a publication date and future projected atmospheric CO 2 concentrations, at a minimum). Eleven papers with 14 distinct projections were identified that fit these criteria. Starting in the mid‐1990s, climate modeling efforts were primarily undertaken in conjunction with the IPCC process (and later, the Coupled Model Intercomparison Projects, CMIPs), and model projections were taken from models featured in the IPCC FAR (1990), Second Assessment Report (SAR‐IPCC, 1996), Third Assessment Report (TAR‐IPCC, 2001), and Fourth Assessment Report (AR4‐IPCC, 2007). The specific models projections evaluated were Manabe, 1970 (hereafter Ma70), Mitchell, 1970 (Mi70), Benson, 1970 (B70), Rasool & Schneider, 1971 (RS71), Sawyer, 1972 (S72), Broecker, 1975 (B75), Nordhaus, 1977 (N77), Schneider & Thompson, 1981 (ST81), Hansen et al., 1981 (H81), Hansen et al., 1988 (H88), and Manabe & Stouffer, 1993 (MS93). The energy balance model projections featured in the main text of the FAR, SAR, and TAR were examined, while the CMIP3 multimodel mean (and spread) was examined for the AR4 (multimodel means were not used as the primary IPCC projections featured in the main text prior to the AR4). Details about how each individual model projection was digitized and analyzed as well as assessments of individual models included in the first three IPCC reports can be found in the supporting information. The AR4 projection was excluded from the main analysis in the paper as both the observational uncertainties and model projection uncertainties are too large over the short 2007–2017 period to draw many useful conclusions, and its inclusion makes the figures difficult to read. However, analyses including the AR4 projection can be found in the supporting information. We assessed model projections over the period between the date the model projection was published and the end of 2017 or when the model projection ended in cases where model runs did not extend through 2017. An end date of 2017 was chosen for the analysis because the ensemble of observational estimates of radiative forcings we used only extends through that date. Five different observational temperature time series were used in this analysis—National Aeronautics and Space Administration GISTEMP (Lenssen et al., 2019), National Oceanic and Atmospheric Administration GlobalTemp (Vose et al., 2012), Hadley/UEA HadCRUT4 (Morice et al., 2012), Berkeley Earth (Rohde et al., 2013), and Cowtan and Way (Cowtan & Way, 2014). The observational temperature records used do not present a completely like‐to‐like comparison with models, as models provide surface air temperature (SAT) fields while observations are based on SAT fields over land and sea surface temperature (SST) fields over the ocean. This means that the trends in the models used here are likely biased high compared to observations, as model blended field trends are about 7% (±5%) lower than model global SAT fields over the 1970–2017 period (Cowtan et al., 2015; Richardson et al., 2016). However, the absence of SST fields from the models analyzed here prevents a comparison of blended SAT/SST against observations. 2 concentrations, F 2x , (following Otto et al., 2013 We compared observations to climate model projections over the model projection period using two approaches: change in temperature versus time and change in temperature versus change in radiative forcing (“implied TCR”). We use an implied TCR metric to provide a meaningful model‐observation comparison even in the presence of forcing differences. Implied TCR is calculated by regressing temperature change against radiative forcing for both models and observations, and multiplying the resulting values by the forcing associated with doubled atmospheric COconcentrations,, (following Otto et al.,): We express implied TCR with units of temperature using a fixed value of F 2x = 3.7 W/m2 (Vial et al., 2013). ΔF anthro includes only anthropogenic forcings and excludes volcanic and solar changes to avoid introducing sharp interannual changes in forcing that would complicate the interpretation of TCR over shorter time periods. For the observational record, ΔF anthro is based on a 1,000‐member ensemble of observationally informed forcing estimates (Dessler & Forster, 2018). Model forcings are recomputed from published formulas and tables when possible and otherwise digitized from published figures (see supporting information section S2 for details). Instantaneous forcings rather than effective or efficacy‐adjusted forcing are used, as those are all that is available for some early models (Hansen et al., 2005; Marvel et al., 2016; see supporting information section S1.0). Details on the approach used to calculate implied TCR can be found in supporting information section S1.2. Comparing models and observations via implied TCR assumes a linear relationship between forcing and warming, an approach that has been widely used in prior analyses (Gregory et al., 2004; Otto et al., 2013). If forcing varies sufficiently slowly in time and deep ocean temperatures remain approximately constant, then a linear relationship is expected to hold with a constant of proportionality that depends on the strength of radiative feedbacks and ocean heat uptake (Held et al., 2010). In this regime, our implied TCR metric provides information about model physics and is unaffected by the time rate of change of forcing; moreover, previous studies have suggested that the temperature response to twentieth century anthropogenic forcing falls within this regime (Gregory & Forster, 2008; Gregory & Mitchell, 1997; Held et al., 2010). However, sudden increases or decreases such as those associated with volcanic eruptions will not engender an equivalent immediate temperature response. For this reason, only anthropogenic forcings were used in estimating TCR implied , as all models evaluated lacked additional volcanic events during their projection periods with the exception of Scenarios B and C of H88. Similarly, thermal inertia in the climate system can affect the relationship between temperature and external forcing if forcing increases sufficiently rapidly (Geoffroy et al., 2012). Scenarios where forcing is rapidly increasing will, all things being equal, tend to be further away from an equilibrium state than scenarios with more gradual increase after a given period of time (Rohrschneider et al., 2019) and thus have a lower implied TCR. With a few exceptions (e.g., RS71, H88 Scenarios A and C), however, most models evaluated had a rate of external forcing increase in the projection period within 1.3 times of the mean estimate of observational forcings and thus likely fall into the regime where implied TCR depends largely on radiative feedbacks and ocean heat uptake. In this analysis we refer to model projections as consistent or inconsistent with observations based on a comparison of the differences between the two. Specifically, if the 95% confidence interval in the differences between the modeled and observed metrics includes 0, the two are deemed consistent; otherwise, they are inconsistent (Hausfather et al., 2017). Additionally, we follow the approach of Hargreaves (2010) in calculating a skill score for each model for both temperature versus time and implied TCR metrics. This skill score is based on the root‐mean‐square errors of the model projection trend versus observations compared to a zero‐change null‐hypothesis projection. See supporting information section S1.3 for details on calculating consistency and skill scores.

3 Results A direct comparison of projected and observed temperature change during each historical model's projection period can provide an effective test of model skill, provided that model projection forcings are reasonably in‐line with the ensemble of observationally informed estimates of radiative forcings. In about 9 of the 17 model projections examined, the projected forcings were within the uncertainty envelope of observational forcing ensemble. However, the remaining eight models—RS71, H81 Scenario 1, H88 Scenarios A, B, and C, FAR, MS93, and TAR—had projected forcings significantly stronger or weaker than observed (Figure 1). For the latter, an analysis comparing the implied TCR between models and observations may provide a more accurate assessment of model performance. Figure 1 Open in figure viewer PowerPoint Rate of external forcing increase (in watts per meter squared per decade) in models and observations over model projection periods Comparisons between climate models and observations over model projection periods are shown in Figure 2 for both temperature versus time and implied TCR metrics (differences between models and observations are shown in Figure S2). Overall the majority of model projections considered were consistent with observations under both metrics. Using the temperature versus time metric, 10 of the 17 model projections show results consistent with observations. Of the remaining seven model projections, four project more warming than observed—N77, ST81, and H88 Scenarios A and B—while three project less warming than observed—RS71, H81 Scenario 2a, and H88 Scenario C. Figure 2 Open in figure viewer PowerPoint Comparison of trends in temperature versus time (top panel) and implied TCR (bottom panel) between observations and models over the model projection periods displayed at the bottom of the figure. Figure S1 shows a variant of this figure with the AR4 projections included When mismatches between projected and observed forcings are taken into account, a better performance is seen. Using the implied TCR metric, 14 of the 17 model projections were consistent with observations; of the three that were not, Mi70 and H88 Scenario C showed higher implied TCR than observations, while RS71 showed lower implied TCR (Schneider, 1975; see supporting information Text S2 for a discussion of the anomalously low‐equilibrium climate sensitivity (ECS) model used in RS71). A number of model projections were inconsistent with observations on a temperature versus time basis but are consistent once mismatches between modeled and observed forcings are taken into account. For example, whileN77 and ST81 projected more warming than observed, their implied TCRs are consistent with observations despite forcings within—though on the high end of—the ensemble range of observational estimates. Similarly, while H81 Scenario 2a projects less warming than observed, its implied TCR is consistent with observations. A number of 1970s‐era models (Ma70, Mi70, B70, B75, and N77) show implied TCR on the high end of the observational ensemble‐based range. This is likely due to their assumption that the atmosphere equilibrates instantly with external forcing, which omits the role of transient ocean heat uptake (Hansen et al., 1985). However, despite this high implied TCR, a number of the models (e.g., Ma70, Mi70, B70, and B75) still end up providing temperature projections in‐line with observations as their forcings were on the lower end of observations due to the absence of any non‐CO 2 forcing agents in their projections. In principle, the same underlying model should show consistent results for modestly different forcing scenarios under the implied TCR metric. However, the inconsistency of the H88 Scenario C is illustrative of the limitations of the implied TCR metric when the model forcings differ dramatically from observations, as Scenario C has roughly constant forcings after the year 2000. The H88 model provides a helpful illustration of the utility of an approach that can account for mismatches between modeled and observed forcings. H88 was featured prominently in congressional testimony, and the recent thirtieth anniversary of the event in 2018 focused considerable attention on the accuracy of the projection (Borenstein & Foster, 2018; United States. Cong. Senate, 1988). H88's “most plausible” Scenario B overestimated warming experienced subsequent to publication by around 54% (Figure 3). However, much of this mismatch was due to overestimating future external forcing—particularly from CH 4 and halocarbons (Figure S3). When H88 Scenario B is evaluated based on the relationship between projected temperatures and projected forcings, the results are consistent with observations (Figures 2 and 3). Figure 3 Open in figure viewer PowerPoint 1988 Hansen et al.,projections compared with observations on a temperature versus time basis (top) and temperature versus external forcing (bottom). The dashed gray line in the top panel represent the start of the projection period. The transparent blue lines in the lower panel represent 500 random samples of the 5,000 combinations of the five temperature observation products and the 1,000 ensemble members of estimated forcings (the full ensemble is subsampled for visual clarity). The dashed blue lines show the 95% confidence intervals for the 5,000‐member ensemble (see supporting information Text S 1 .4 for details). Anomalies for both temperature and forcing are shown relative to a 1958–1987 preprojection baseline. Skill score median estimates and uncertainties for both temperature versus time and implied TCR metrics are shown in Table 1 (see supporting information Text S1.3). A skill score of one represents perfect agreement between a model projection and observations, while a skill score of less than 0 represents worse performance than a no‐change null‐hypothesis projection. Table 1. Model Skill Scores Over the Projection Period, Where 1 Represents Perfect Agreement With Observations and Less Than 0 Represents Worse Performance Than a No‐Change Null Hypothesis Model Timeframe ΔT/Δt skill ΔT/ΔF skill Ma70 1970–2000 0.84 [0.57 to 0.99] 0.51 [−0.11 to 0.94] Mi70 1970–2000 0.91 [0.69 to 0.99] 0.41 [−0.26 to 0.90] B70 1970–2000 0.78 [0.45 to 0.97] 0.63 [0.06 to 0.96] RS71 1971–2000 0.19 [0.16 to 0.25] 0.42 [0.28 to 0.59] S72 1972–2000 0.83 [0.49 to 0.99] 0.83 [0.43 to 0.98] B75 1975–2010 0.85 [0.64 to 0.98] 0.72 [0.31 to 0.97] N77 1977–2017 0.67 [0.44 to 0.84] 0.79 [0.48 to 0.98] ST81 1981–2017 0.76 [0.53 to 0.94] 0.82 [0.52 to 0.98] H81(1) 1981–2017 0.93 [0.81 to 0.99] 0.74 [0.59 to 0.93] H81(2a) 1981–2017 0.77 [0.66 to 0.91] 0.87 [0.69 to 0.99] H88(A) 1988–2017 0.38 [0.01 to 0.68] 0.81 [0.63 to 0.98] H88(B) 1988–2017 0.48 [0.08 to 0.77] 0.79 [0.41 to 0.98] H88(C) 1988–2017 0.66 [0.48 to 0.89] 0.28 [−0.46 to 0.84] FAR 1990–2017 0.63 [0.29 to 0.87] 0.86 [0.68 to 0.99] MS93 1993–2017 0.71 [0.20 to 0.97] 0.87 [0.61 to 0.99] SAR 1995–2017 0.73 [0.58 to 0.95] 0.66 [0.49 to 0.91] TAR 2001–2017 0.81 [0.15 to 0.98] 0.76 [−0.13 to 0.98] AR4 2007–2017 0.56 [0.35 to 0.92] 0.60 [0.37 to 0.93] The average of the median skill scores across all the model projections evaluated is 0.69 for the temperature versus time metric. Only three projections (RS71, H88 Scenario A, and H88 Scenario B) had skill scores below 0.5, while H81 Scenario 1 had the highest skill score of any model—0.93. Using the implied TCR metric, the average projection skill of the models was also 0.69. Models with implied TCR skill scores below 0.5 include Mi70, RS71, and H88 Scenario C, while MS93 had the highest skill score at 0.87. H88 Scenarios A and B and the IPCC FAR all performed substantially better under an implied TCR metric, reflecting the role of misspecified future forcings in their high‐temperature projections. It is important to note that the skill score uncertainties for very short future projection periods—as in the case of the TAR and AR4—are quite large and should be treated with caution due to the combination of short‐term temperature variability and uncertainties in the forcings. A number of model projections had external forcings that poorly matched observational estimates due to the exclusion of non‐CO 2 forcing agents. However, all models included projected future CO 2 concentrations, providing a common metric for comparison, and these are shown in Figure S4. Most of the historical climate model projections overestimated future CO 2 concentrations, some by as much as 40 ppm over current levels, with projected CO 2 concentrations increasing up to twice as fast as actually observed (Meinshausen, 2017). Of the 1970s climate model projections, only Mi70 projected atmospheric CO 2 growth in‐line with observations. Many 1980s projections similarly overestimated CO 2 , with only the Hansen 88 Scenarios A and B projections close to observed concentrations. The first three IPCC assessments included projections based on simple energy balance models tuned to general circulation model results, as relatively few individual model runs were available at the time. From the AR4 onward IPCC projections were based on the multimodel mean and model spread. We examine individual models from the first three IPCC reports on both a temperature versus time and implied TCR basis in Figure S5.

4 Conclusions and Discussion In general, past climate model projections evaluated in this analysis were skillful in predicting subsequent GMST warming in the years after publication. While some models showed too much warming and a few showed too little, most models examined showed warming consistent with observations, particularly when mismatches between projected and observationally informed estimates of forcing were taken into account. We find no evidence that the climate models evaluated in this paper have systematically overestimated or underestimated warming over their projection period. The projection skill of the 1970s models is particularly impressive given the limited observational evidence of warming at the time, as the world was thought to have been cooling for the past few decades (e.g., Broecker, 1975; Broecker, 2017). A number of high‐profile model projections—H88 Scenarios A and B and the IPCC FAR in particular—have been criticized for projecting higher warming rates than observed (e.g., Michaels & Maue, 2018). However, these differences are largely driven by mismatches between projected and observed forcings. H88 A and B forcings increased 97% and 27% faster, respectively, than the mean observational estimate, and FAR forcings increased 55% faster. On an implied TCR basis, all three projections have high model skill scores and are consistent with observations. While climate models have grown substantially more complex than the early models examined here, the skill that early models have shown in successfully projecting future warming suggests that climate models are effectively capturing the processes driving the multidecadal evolution of GMST. While the relative simplicity of the models analyzed here renders their climate projections operationally obsolete, they may be useful tools for verifying or falsifying methods used to evaluate state‐of‐the‐art climate models. As climate model projections continue to mature, more signals are likely to emerge from the noise of natural variability and allow for the retrospective evaluation of other aspects of climate model projections.

Acknowledgments Z. H. conceived the project, Z. H. and H. F. D. created the figures, and Z. H., H. F. D., T. A., and G. S. helped gather data and wrote the article text. A public GitHub repository with code used to analyze the data and generate figures and csv files containing the data shown in the figures is available online (https://github.com/hausfath/OldModels). Additional information on the code and data used in the analysis can be found in the supporting information. We would like to thank Piers Forster for providing the ensemble of observationally‐informed radiative forcing estimates. No dedicated funding from any of the authors supported this project.

Supporting Information Filename Description grl59922-sup-0001-2019GL085378-SI.docxWord 2007 document , 2.2 MB Supporting Information S1 Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.