by Ross McKitrick

Ben Santer et al. have a new paper out in Nature Climate Change arguing that with 40 years of satellite data available they can detect the anthropogenic influence in the mid-troposphere at a 5-sigma level of confidence. This, they point out, is the “gold standard” of proof in particle physics, even invoking for comparison the Higgs boson discovery in their Supplementary information.

FIGURE 1: From Santer et al. 2019

Their results are shown in the above Figure. It is not a graph of temperature, but of an estimated “signal-to-noise” ratio. The horizontal lines represent sigma units which, if the underlying statistical model is correct, can be interpreted as points where the tail of the distribution gets very small. So when the lines cross a sigma level, the “signal” of anthropogenic warming has emerged from the “noise” of natural variability by a suitable threshold. They report that the 3-sigma boundary has a p value of 1/741 while the 5-sigma boundary has a p value of 1/3.5million. Since all signal lines cross the 5-sigma level by 2015, they conclude that the anthropogenic effect on the climate is definitively detected.

I will discuss four aspects of this study which I think weaken the conclusions considerably: (a) the difference between the existence of a signal and the magnitude of the effect; (b) the confounded nature of their experimental design; (c) the invalid design of the natural-only comparator; and (d) problems relating “sigma” boundaries to probabilities.

(a) Existence of signal versus magnitude of effect

Suppose you are tuning an old analog receiver to a weak signal from a far-away radio station. By playing with the dial you might eventually get a good enough signal to realize they are playing Bach. But the strength of the signal tells you nothing about the tempo of the music: that’s a different calculation.

In the same way the above diagram tells us nothing about the magnitude of the temperature effect of greenhouse gases on the climate. It only shows the ratio of two things: a measure of the rate of improvement over time of the correlation between observations and models forced with natural and anthropogenic forcings, divided by a measure of the standard deviation of the same measure under a “null hypothesis” of (allegedly) pure natural variability. In that sense it is like a t-statistic, which is also measured in sigma units. Since there can be no improvement over time in the fit between the observations and the natural-only comparator, any improvement in the signal raises the sigma level.

Even if you accept Figure 1 at face value, it is consistent with there being a very high or very low sensitivity to greenhouse gases, or something in between. It is consistent, for instance, with the findings of Christy and McNider, also based on satellite data, that sensitivity to doubled GHG levels, while positive, is much lower than typically shown in models.

(b) Confounded signal design

According to the Supplementary information, Santer et al. took annually-averaged climate model data based on historical and (RCP8.5) scenario-based natural and anthropogenic forcings and constructed mid-troposphere (MT) temperature time series that include an adjustment for stratospheric cooling (i.e. “corrected”). They averaged all the runs and models, regridded the data into 10 degree x 10 degree grid cells (576 altogether, with polar regions omitted) and extracted 40 annual temperature anomalies for each gridcell over the 1979 to 2018 interval. From these they extracted a spatial “fingerprint” of the model-generated climate pattern using principal component analysis, aka empirical orthogonal functions. You could think of it as a weighted average over time of the anomaly values for each gridcell. Though it’s not shown in the paper or the Supplement, this is the pattern (it’s from a separate paper):

FIGURE 2: Spatial fingerprint pattern

The gray areas in Figure 2 over the poles represent omitted gridcells since not all the satellite series cover polar regions. The colors represent PC “loadings” not temperatures, but since the first PC explains about 98% of the variance, you can think of them as average temperature anomalies and you won’t be far off. Hence the fingerprint pattern in the MT is one of amplified warming over the tropics with patchy deviations here and there.

This is the pattern they will seek to correlate with observations as a means of detecting the anthropogenic “fingerprint.” But it is associated in the models with both natural and anthropogenic forcings together over the 1979—2018 interval. They refer to this as the HIST+8.5 data, meaning model runs forced up to 2006 with historical forcings (both natural and anthropogenic) and thereafter according to the RCP8.5 forcings. The conclusion of the study is that observations now look more like the above figure than the null hypothesis (“natural only”) figure, ergo anthropogenic fingerprint detected. But HIST+8.5 is a combined fingerprint, and they don’t actually decompose the anthropogenic portion.

So they haven’t identified a distinct anthropogenic fingerprint. What they have detected is that observations exhibit a better fit to models that have the Figure 2 warming pattern in them, regardless of cause, than those that do not. It might be the case that a graph representing the anthropogenic-only signal would look the same as Figure 1, but we have no way of knowing from their analysis.

(c) Invalid natural-only comparator

The above argument would matter less if the “nature-only” comparator controlled for all known warming from natural forcings. But it doesn’t, by construction.

The fingerprint methodology begins by taking the observed annual spatial layout of temperature anomalies and correlates it to the pattern in Figure 2 above, yielding a correlation coefficient for each year. Then they look at the trend in those correlation coefficients as a measure of how well the fit increases over time. The correlations themselves are not reported in the paper or the supplement.

The authors then construct a “noise” pattern to serve as the “nature-only” counterfactual to the above diagram. They start by selecting 200-year control runs from 36 models and gridding them in the same 10×10 format. Eventually they will average them all up, but first they detrend each gridcell in each model, which I consider a misguided step.

Everything depends on how valid the natural variability comparator is. We are given no explanation of why the authors believe it is a credible analogue to the natural temperature patterns associated with post-1979 non-anthropogenic forcings. It almost certainly isn’t. The sum of the post-1979 volcanic+solar series in the IPCC AR5 forcing series looks like this:

FIGURE 3: IPCC NATURAL FORCINGS 1979-2017

This clearly implies natural forcings would have induced a net warming over the sample interval, and since tropical amplification occurs regardless of the type of forcing, a proper “nature-only” spatial pattern would likely look a lot like Figure 2. But by detrending every gridcell Santer et al. removed such patterns, artificially worsening the estimated post-1979 natural comparator.

The authors’ conclusions depend critically on the assumption that their “natural” model variability estimate is a plausible representation of what 1979-2018 would have looked like without greenhouse gases. The authors note the importance of this assumption in their Supplement (p. 10):

“Our assumption regarding the adequacy of model variability estimates is critical. Observed temperature records are simultaneously inﬂuenced by both internal variability and multiple external forcings. We do not observe “pure” internal variability, so there will always be some irreducible uncertainty in partitioning observed temperature records into internally generated and externally forced components. All model-versus-observed variability comparisons are affected by this uncertainty, particularly on less well-observed multi-decadal timescales.”

As they say, every fingerprint and signal-detection study hinges on the quality of the “nature-only” comparator. Unfortunately by detrending their control runs gridcell-by-gridcell they have pretty much ensured that the natural variability pattern is artificially degraded as a comparator.

It is as if a bank robber were known to be a 6 foot tall male, and the police put their preferred suspect in a lineup with a bunch of short women. You might get a confident witness identification, but you wouldn’t know if it’s valid.

Making matters worse, the greenhouse-influenced warming pattern comes from models that have been tuned to match key aspects of the observed warming trends of the 20th century. While less of an issue in the MT layer than would be the case at the surface, there will nonetheless be partial enhancement of the match between model simulations and observations due to post hoc tuning. In effect, the police are making their preferred suspect wear the same black pants and shirt as the bank robber, while the short women are all in red dresses.

Thus, it seems to me that the lines in Figure 1 are based on comparing an artificially exaggerated resemblance between observations and tuned models versus an artificially worsened counterfactual. This is not a gold standard of proof.

(d) t-statistics and p values

The probabilities associated with the sigma lines in Figure 1 are based on the standard Normal tables. People are so accustomed to the Gaussian (Normal) critical values that they sometimes forget that they are only valid for t-type statistics under certain assumptions, that need to be tested. I could find no information in the Santer et al. paper that such tests were undertaken.

I will present a simple example of a signal detection model to illustrate how t-statistics and Gaussian critical values can be very misleading when misused. I will use a data set consisting of annual values of weather-balloon measured global MT temperatures averaged over RICH, RAOBCORE and RATPAC, the El-Nino Southern Oscillation Index (ESOI – pressure based version), and the IPCC forcing values for greenhouse gases (“ghg” comprising CO2 and other), tropical ozone (“o3”), aerosols (“aero”), land use change (“land”), total solar irradiance (“tsi”) and volcanic aerosols (“volc”). The data run from 1958 to 2017 but I only use the post-1979 portion to match the Santer paper. The forcings are from IPCC AR5 with some adjustments by Nic Lewis to bring them up to date.

A simple way of investigating causal patterns in time series data is using an autoregression. Simply regress the variable you are interested in on itself aged once plus lagged values of the possible explanatory variables. Inclusion of the lagged dependent variable controls for momentum effects, while the use of lagged explanatory variables constrains the correlations to a single direction: today’s changes in the dependent variable cannot cause changes in yesterday’s values of the explanatory variables. This is useful for identifying what econometricians call Granger causality: when knowing today’s value of one variable significantly reduces the mean forecast error of another variable.

My temperature measure (“Temp”) is the average MT temperature anomaly in the weather balloon records. I add up the forcings into “anthro” (ghg + o3 + aero + land) and “natural” (tsi + volc + ESOI).

I ran the regression Temp = a1 + a2* l.Temp + a3*l.anthro +a4* l.natural where a lagged value is denoted by an “l.” prefix. The results over the whole sample length are:

The coefficient on “anthro” is more than twice as large as that on “natural” and has a larger t-statistic. Also its p-value indicates a probability of detection if there were no effect of 1 in 2.4 billion. So I could conclude based on this regression that anthropogenic forcing is the dominant effect on temperatures in the observed record.

The t-statistic on anthro provides a measure much like what the Santer et al. paper shows. It represents the marginal improvement in model fit based on adding anthropogenic forcing to the time series model, relative to a null hypothesis in which temperatures are affected only by natural forcings and internal dynamics. Running the model iteratively while allowing the end date to increase from 1988 to 2017 yields the results shown below in blue (Line #1):

FIGURE 4: S/N ratios for anthropogenic signal in temperature model

It looks remarkably like Figure 1 from Santer et al., with the blue line crossing the 3-sigma level in the late 90s and hitting about 8 sigma at the peak.

But there is a problem. This would not be publishable in an econometrics journal because, among many other things, I haven’t tested for unit roots. I won’t go into detail about what they are, I’ll just point out that if time series data have unit roots they are nonstationary and you can’t use them in an autoregression because the t-statistics follow a nonstandard distribution and Gaussian (or even Student’s t) tables will give seriously biased probability values.

I ran Phillips-Perron unit root tests and found that anthro is nonstationary, while Temp and natural are stationary. This problem has already been discussed and grappled with in some econometrics papers (see for instance here and the discussions accompanying it, including here).

A possible remedy is to construct the model in first differences. If you write out the regression equation at time t and also at time (t-1) and subtract the two, you get d.Temp = a2* l.d.Temp + a3*l.d.anthro +a4*l.d.natural, where the “d.” means first difference and “l.d.” means lagged first difference. First differencing removes the unit root in anthro (almost – probably close enough for this example) so the regression model is now properly specified and the t-statistics can be checked against conventional t-tables. The results over the whole sample are:

The coefficient magnitudes remain comparable but—oh dear—the t-statistic on anthro has collapsed from 8.56 to 1.32, while those on natural and lagged temperature are now larger. The problem is that the t-ratio on anthro in the first regression was not a t-statistic, instead it followed a nonstandard distribution with much larger critical values. When compared against t tables it gave the wrong significance score for the anthropogenic influence. The t-ratio in the revised model is more likely to be properly specified, so using t tables is appropriate.

The corresponding graph of t-statistics on anthro from the second model over varying sample lengths are shown in Figure 4 as the green line (Line #2) at the bottom of the graph. Signal detection clearly fails.

What this illustrates is that we don’t actually know what are the correct probability values to attach to the sigma values in Figure 1. If Santer et al. want to use Gaussian probabilities they need to test that their regression models are specified correctly for doing so. But none of the usual specification tests were provided in the paper, and since it’s easy to generate a vivid counterexample we can’t assume the Gaussian assumption is valid.

Conclusion

The fact that in my example the t-statistic on anthro falls to a low level does not “prove” that anthropogenic forcing has no effect on tropospheric temperatures. It does show that in the framework of my model the effects are not statistically significant. If you think the model is correctly-specified and the data set is appropriate you will have reason to accept the result, at least provisionally. If you have reason to doubt the correctness of the specification then you are not obliged to accept the result.

This is the nature of evidence from statistical modeling: it is contingent on the specification and assumptions. In my view the second regression is a more valid specification than the first one, so faced with a choice between the two, the second set of results is more valid. But there may be other, more valid specifications that yield different results.

In the same way, since I have reason to doubt the validity of the Santer et al. model I don’t accept their conclusions. They haven’t shown what they say they showed. In particular they have not identified a unique anthropogenic fingerprint, or provided a credible control for natural variability over the sample period. Nor have they justified the use of Gaussian p-values. Their claim to have attained a “gold standard” of proof are unwarranted, in part because statistical modeling can never do that, and in part because of the specific problems in their model.

Moderation note: as with all guest posts, please keep your comments civil and relevant.