1 Introduction

The purpose of this paper is to isolate a measurable feature of the climate that can serve as a testable index of the major hypothesis of the atmospheric component of general circulation models (GCMs). By major hypothesis we refer to the process of central interest both to most modelers and to the many users of GCM output, namely, the parameterized representation of moist thermodynamics and convection that, in combination with the underlying model structure, yields amplified warming of the atmosphere from greenhouse gases consistent with mainstream magnitudes of equilibrium climate sensitivity (ECS), namely 3.0 ± 1.5°C per doubling of carbon dioxide‐equivalent. GCMs embed countless minor hypotheses subject to continual testing and revision. Due to the sheer complexity of the climate itself, and of climate models, any number of observed discrepancies between projections from an individual model and some local feature of the real world can be accommodated, rationalized, or ignored, without calling into question the model itself, since the rejected component could be removed without the rest of the model ceasing to be a member of the class of GCMs. We start from the assumption that there must, in principle, be at least one testable major hypothesis, the rejection of which would constitute failure of the model itself, in the sense that were the failed component to be removed, what remains would no longer be a GCM. We also start from the assumption that the measure we seek represents an emergent behavior from models based on both physical theory and modeler judgment, so that model integrations are genuinely expressions of a hypothesis (as opposed to computations of a known constant). The ensemble mean would thus represent the central tendency of modeler assumptions and is itself a testable quantity.

There are many identifiable predictions generated by climate models that could serve as test targets, but we propose four conditions that help us narrow the field down to a truly informative one: measurability, specificity, independence, and uniqueness. The first refers to the point that a prediction must refer to a target that is well measured over a long time span. This rules out testing ECS directly since neither it nor its transient counterpart are observable. It also limits the choice of temperature fields. Remote places like the polar surfaces are poorly sampled, creating known problems for assembling complete, homogeneous long‐term temperature estimates. Many regions of the ocean are also poorly sampled or are subject only to recent measurement. Behavior of a target over a relatively short time span may be strongly affected by climate system internal variability or by exceptional events. An example of the latter is the influence of 1983 and 1992 volcanoes on the stratospheric temperature record, each of which led to temporary spikes twice as large as the multidecadal trend, thus making trend identification difficult over the post‐1979 satellite record. Similar problems limit the usefulness of many other potential targets for furnishing testable predictions.

The second condition is that it must be a specific prediction, namely, one that reliably emerges across runs and across all models, on a specific temporal scale. To the extent the governing mechanisms in models reflect shared hypotheses among independent modeling teams we should expect to see coherent behavior in the target variable across independent runs, varying by magnitude but not by sign. The issue of timescale is equally important. One could endlessly shield a GCM from testing by arguing that while the magnitude of a projected change is precisely forecast, the timing is unknown to plus or minus several decades or centuries, so the failure to observe an expected change even in a lengthy data set only means that it is delayed. To avoid this dead end we confine attention to large, well‐measured atmospheric regions where GCMs predict, more or less in unison, not only specific magnitudes of change but also on a specific (and reasonably rapid) timescale.

Third, the independence criterion means that the target of the prediction must not be an input to the empirical tuning of the model. Once a model has been tuned to match a target, its reproduction of the target is no longer a test of its validity. In the case of GCMs, this rules out using the global average surface temperature record for testing, since during development models are often adjusted to broadly match its evolution over time. If the model structure is otherwise valid, such tuning practices should improve empirical fidelity, and the result should be that the model now makes more accurate predictions about other features of the atmosphere, measurements of which were not inputs to the tuning process. A good test ought therefore to focus on those other measures.

Finally, uniqueness refers to the causality behind the observed change. If the model predicts that greenhouse gases (GHGs) will cause the target to warm, but also predicts that many other factors could cause the target to warm, an observed warming would be less informative, since it is consistent both with a successful prediction and with a failed prediction coupled with the coincidental action of other causes. Ideally, then, we look for a prediction uniquely tied to the underlying causal mechanism of interest.

Air temperature in the 200‐ to 300‐hPa layer of the tropical troposphere meets all four test conditions, pretty much uniquely in the climate system as far as we are aware. First, homogenized measurements from more than one independent source are available over a 60‐year span from 1958 to 2017. This is twice the length of the customary 30‐year interval usually thought to be necessary for identifying a climatological phenomenon and more than enough compared to the response timescale in GCMs. The time span encompasses several major volcanoes and strong El Niño events, and the Pacific climate shift (PCS) of the late 1970s but is long enough to allow distinct identification of an underlying smooth trend, if one exists. Also, since it is part of the well‐mixed free troposphere layer, there are fewer problems in obtaining a credible tropical‐scale sample than is the case with surface measurements. For instance, Figure 17 in Christy et al. (2018) compares a variety of trends in tropical midtroposphere data products over the 1979‐2016 interval. The radiosonde products, covering large parts of the tropical grid, yield results nearly identical to reanalysis products, which cover the entire grid.

Second, as was noted in the 2007 Fourth IPCC Assessment report (Meehl et al., 2007, Ch. 10), GCMs unanimously project that warming will reach a global maximum in the tropics near the 200‐ to 300‐hPa layer, due to the so‐called negative lapse rate feedback (National Academy of Sciences, 2003) and that the warming will occur rapidly in response to increased greenhouse forcing. Figure 1 shows the simulated 1958–2017 warming rates from the IPCC AR5 Canadian model, with the target zone visible as the red bullseye in the middle. Similar figures from models developed in the United States, the UK, and Germany are shown in the supporting information Figures S1–S3. Model representations of this layer's annual temperature series over our sample span are very coherent. Ninety‐four percent of the possible cross‐correlations among model runs exceed 0.5, and 77% exceed 0.6. The first principal component (PC) explains 73% of the variance across all 102 runs. What remains in the data is largely model‐specific noise. Figure 2 shows the Scree plot of the first 30 PCs. After PC1 the next four each explain 2% or less and the remainder taper off quickly to very small levels, indicating that there is only one dominant signal common across models and across all runs by each model. The timescale is also well constrained. The average projected warming rate over 1958–2017 in the target layer is 0.33°C per decade, with a range spanning 0.18–0.51 °C per decade. Hence, models project on average that the total amount of warming in the target zone since 1958 should have been about 2 °C by now, a magnitude well within observational capability, and that the trends should be well established, thus specifying both a magnitude and a timescale.

Figure 1 Open in figure viewer PowerPoint Warming pattern in Canadian model 1958–2017. Horizontal axis shows latitude, vertical axis shows altitude, and color shows warming trend magnitude.

Figure 2 Open in figure viewer PowerPoint Scree plot of first 30 principal components of 102 modeled temperature simulations over 1958–2017.

Third, by focusing on the 200‐ to 300‐hPa layer we avoid contaminating the test by searching for a signal to which the models were already tuned. The surface temperature record is ruled out for this reason, but satellite‐based lower‐ and middle‐troposphere composites are also somewhat contaminated since they include the near‐surface layer in their weighting functions. Radiosonde samples measure each layer of the atmosphere independently, not simply as a gradient against the surface.

Fourth, simulations in the IPCC AR4 Chapter 9 (Hegerl et al., 2007) indicate that, within the framework of mainstream GCMs, greenhouse forcing provides the only explanation for a strong warming trend in the target region. AR4 Figure 9.1 illustrates 20th‐century climatic reconstructions applying one‐at‐a‐time individual forcings from observed solar, volcanic, GHG, stratospheric ozone, and sulfate aerosol changes. Solar forcing yields an amplified warming aloft in the tropics, but the magnitude of change is very small, and the IPCC elsewhere emphasizes that actual historical trends in solar output have been too small to cause much atmospheric warming (AR4 Sct. 2.7, Forster et al., 2007). Only GHG forcing yields a large modeled warming pattern in the tropical 200‐ to 300‐mb layer, which accords with the finding above that PC decomposition identifies only one common signal. Such a warming trend in the atmosphere, were it to be observed, would thus have only one explanation; likewise, its absence would conflict with only one major hypothesis of the model, namely, the set of parameterizations that yield amplified GHG‐induced warming.

We make use herein of the latest releases of three radiosonde data sets, the U.S. National Oceanic and Atmospheric Administration's Radiosonde Atmospheric Temperature Products for Assessing Climate (RATPAC‐A v2, Durre & Yin, 2011), as well as the University of Wien's RAdiosonde Observation Correction using Reanalyses (RAOBCORE v1.5) and Radiosonde Innovation Composite Homogenization (RICH v1.5, Haimberger et al., 2012). All data begin in 1958 when radiosonde coverage expanded around the globe for the International Geophysical Year and continue to the end of 2017. These series are compared against the complete ensemble of 102 model runs prepared for the Climate Model Intercomparison Project Number 5 (CMIP5) used in the most recent IPCC report (see Flato et al., 2013). The model output was obtained and used as‐is from the Koninklijk Nederlands Meteorologisch instituut Climate Data Explorer site (van Oldenborg, 2016). Since autocorrelation structures in climate data can be complex and may differ among data types (see, e.g., Varotsos et al., 2013) we use a variance estimation methodology robust to general forms of heteroskedasticity and autocorrelation (McKitrick & Vogelsang, 2014; Vogelsang & Franses, 2005). We also allow for a possible break term at 1979 associated with the Pacific climate shift (see Seidel & Lanzante, 2004; Tsonis et al., 2007; Powell & Xu, 2011, and extensive references therein). What we refer to as the general trend model allows for a step change at 1979, while the restricted model does not.

Our test is directed at the response size, rather than the sign. Our real‐world analogue exhibits a change in the predicted direction, so we focus our testing on whether the magnitude of change is consistent with the model prediction. As we will show, all 102 model runs warm more rapidly than observations, whether or not we allow for a break term. Most of the divergences are individually significant. We reject the hypothesis that the average model trend matches the average observed trend, regardless of the inclusion of a break term. Thus, the observed data are inconsistent with the major hypothesis of GCMs as represented by the selected target variable.