The early twentieth-century warming (EW; 1910–45) and the mid-twentieth-century cooling (MC; 1950–80) have been linked to both internal variability of the climate system and changes in external radiative forcing. The degree to which either of the two factors contributed to EW and MC, or both, is still debated. Using a two-box impulse response model, we demonstrate that multidecadal ocean variability was unlikely to be the driver of observed changes in global mean surface temperature (GMST) after AD 1850. Instead, virtually all (97%–98%) of the global low-frequency variability (>30 years) can be explained by external forcing. We find similarly high percentages of explained variance for interhemispheric and land–ocean temperature evolution. Three key aspects are identified that underpin the conclusion of this new study: inhomogeneous anthropogenic aerosol forcing (AER), biases in the instrumental sea surface temperature (SST) datasets, and inadequate representation of the response to varying forcing factors. Once the spatially heterogeneous nature of AER is accounted for, the MC period is reconcilable with external drivers. SST biases and imprecise forcing responses explain the putative disagreement between models and observations during the EW period. As a consequence, Atlantic multidecadal variability (AMV) is found to be primarily controlled by external forcing too. Future attribution studies should account for these important factors when discriminating between externally forced and internally generated influences on climate. We argue that AMV must not be used as a regressor and suggest a revised AMV index instead [the North Atlantic Variability Index (NAVI)]. Our associated best estimate for the transient climate response (TCR) is 1.57 K (±0.70 at the 5%–95% confidence level).

Since attempts to estimate the magnitude of internal variability by means simple climate models are plagued from dissatisfying low correlations with observations ( Aldrin et al. 2012 ; Skeie et al. 2014 ), here we use a refined two-box impulse response model framework that accounts for fast and slow responses to forcing perturbations in the climate system. To constrain the complexity of the model, we introduce a novel transient climate response (TCR) adjustment factor for different forcing agents that is governed by robust physical factors. Apart from Northern Hemisphere (NHem) and GMST (Global) data, we also analyze Southern Hemisphere (SHem), land surface air temperature (Land) and SSTs (Ocean), expanding on previous GMST-only analyses ( Mann et al. 2014 ; Dong and McPhaden 2017 ) to better understand the impact of radiative forcing changes on surface temperatures. We recommend all impulse response or energy balance model studies use land, ocean, and hemispheric temperature records with our dedicated set of model parameters as separate benchmark tests to robustly evaluate model performance.

We argue that any attribution exercise that does not sufficiently account for the spatiotemporal AER changes will invariably produce unreliable and erroneous results. Incorporating now better quantifiable biases in the instrumental SST record, we demonstrate that a carefully designed analysis (that avoids overfitting) yields a surprisingly high level of agreement between our model and observations without the need to infer additional unexplained internal variability. We endeavor to highlight the pitfalls associated with attributing and identify the shortcomings in representing the externally forced temperature responses.

The results are shown in Fig. 1d for HadOST and BE and in Fig. 1e for ERSSTv4 and GHCNv3 as used in GISTEMP ( Hansen et al. 2010 ). The scaling factors are provided in the figure legend. In both cases, the two coastal records (derived from HadISST2 and ERSSTv4) show excellent agreement during the calibration period. As expected, the land scaling factor is lower in agreement with amplified warming trends over land. The land and ocean proxies agree after 1920 and show only minor deviations before 1920 ( Fig. 1d ). The HadOST proxies (land and ocean) suggest that HadISST2 is reliable with marginal biases between 1880 and 1940. We find much less agreement between GHCNv3 and ERSSTv4 before 1980. Our analysis further suggests that ERSSTv4 has a substantial cold bias between 1900 and 1980 as well as a spurious warm bias during WWII ( Fig. 1e ). While by no means perfect, this straightforward analysis is at least indicative that SSTs in general and ERSST in particular (versions 4 and 5 are almost identical throughout the period of coverage) are still impacted by substantial unresolved inhomogeneities. In our main analysis we discard GISTEMP and apply the following correction factors to HadOST, Cru4CW, and BE during four of the WWII years (1942–45): NHem = −0.04°C, Global = −0.08°C, SHem = −0.12°C, and Ocean = −0.18°C. The remaining years in the time series remain unchanged. We discuss the implications in section 4 .

Because of continuous problems in currently available SST datasets, mainly manifest as warm bias as a result of changing SST sampling methods (from bucket to engine-room intake measurements) during World War II, associated with changing fleet composition ( Karl et al. 2015 ; Hansen et al. 2016 ; Kent et al. 2017 ), Cowtan et al. (2018) have recently proposed a novel method to address the World War II (WWII) bias using island and coastal weather stations only. Inspired by the idea, we replicate their analysis with a slightly simplified methodology. We use a mask where grid boxes over land (adjacent to ocean) and over ocean (along coastlines) are selected, including islands. The global average of all such subsampled ocean grid boxes establishes our new SST proxy. The two coastal time series are scaled to match the 1980–2016 global SST trend [see Cowtan et al. (2018) for details on the scaling method] as it is deemed the most reliable period in the marine instrumental record ( Rahmstorf et al. 2017 ).

(a) Global radiative forcing components used in our study, (b) decomposition of the four AER components including indirect aerosol effects, and (c) spatial decomposition of the effective and noneffective AER. Scaled coastal HadOST (blue) and coastal BE anomalies (red) in comparison with (d) 60°N–60°S HadOST (black) and (e) as in (d), but with coastal ERSSTv4, coastal GISTEMP, and 60°N–60°S ERSSTv4.

(a) Global radiative forcing components used in our study, (b) decomposition of the four AER components including indirect aerosol effects, and (c) spatial decomposition of the effective and noneffective AER. Scaled coastal HadOST (blue) and coastal BE anomalies (red) in comparison with (d) 60°N–60°S HadOST (black) and (e) as in (d), but with coastal ERSSTv4, coastal GISTEMP, and 60°N–60°S ERSSTv4.

Following the method introduced in earlier work (Otto et al. 2015; Haustein et al. 2017) [vaguely similar to the analysis presented in Lean and Rind (2008) and Lean (2018)], we employ a two-box impulse response model framework that accounts for fast and slow temperature T changes in response to external forcing factors [components (comp): WMGHGs, anthropogenic aerosols (AER), and volcanic eruptions (VOL)]. The fast component can be associated with the ocean mixed layer response whereas the slow component approximates the response of the deep ocean (Li and Jarvis 2009):

d T j d t = q j × F − T j d j ; T comp = ∑ j = 1 2 T j , (1)

TCR comp = F 2 × CO 2 { q 1 [ 1 − d 1 70 ⁡ ( 1 − e − 70 / d 1 ) ] + q 2 [ 1 − d 2 70 ⁡ ( 1 − e − 70 / d 2 ) ] } , and (2)

ECS = F 2 × CO 2 ⁡ ( q 1 + q 2 ) . (3)

The differential warming requires dedicated TCR calibration factors for the WMGHG, VOL, and AER induced temperature responses. To obtain a plausible and robust set of such calibration factors, we use observed transient warming ratios (TWR) between NHem and SHem as well as Land and Ocean. In Fig. 2, the temperature responses to total anthropogenic (Figs. 2a,b), WMGHG (Figs. 2c,d), AER (Figs. 2e,f), and VOL (Figs. 2g,h) are shown. Decadally averaged warming ratios are provided above or under each graph. All data are low-pass filtered with a smoothing radius of 5 years. The TWR is obtained during the 30-yr period of strongest transient warming.

Fig . 2. View largeDownload slide TWR to estimate the TCR adjustment factors for WMGHGs, AER, and VOL for (left) NHem/SHem and (right) Land/Ocean: (a),(b) Allforcing warming contributions in CMIP5, HadCM3, HadOST, and the response model, (c),(d) WMGHG only, (e),(f) AER only, and (g),(h) VOL only contributions in HadCM3 and the response model. Modeled VOL (negative) temperature response is shifted by +10 and +25 years merely for better readability. The timeline of volcanic eruptions (scaled radiative forcing) is shown in black in (g) and (h). All data are low-pass filtered to remove interannual variability. The boxes at the bottom show the inferred (diagnosed) warming ratios (TWRD) for WMGHG and AER using the product of the ratios of observed (red) and modeled (orange) allforcing TWRs in (a) and (b), multiplied by the modeled TWRs (orange) for WMGHG in (c) and (d) and AER in (e) and (f). The estimated warming ratios (TWRE) refer to the simulated response model TWR using TWRD. Both values are given in light purple. Only the 30-yr period of strongest differential warming is used for the central TWR estimates. VOL TWR is only a function of the fast response. Fig . 2. View largeDownload slide TWR to estimate the TCR adjustment factors for WMGHGs, AER, and VOL for (left) NHem/SHem and (right) Land/Ocean: (a),(b) Allforcing warming contributions in CMIP5, HadCM3, HadOST, and the response model, (c),(d) WMGHG only, (e),(f) AER only, and (g),(h) VOL only contributions in HadCM3 and the response model. Modeled VOL (negative) temperature response is shifted by +10 and +25 years merely for better readability. The timeline of volcanic eruptions (scaled radiative forcing) is shown in black in (g) and (h). All data are low-pass filtered to remove interannual variability. The boxes at the bottom show the inferred (diagnosed) warming ratios (TWRD) for WMGHG and AER using the product of the ratios of observed (red) and modeled (orange) allforcing TWRs in (a) and (b), multiplied by the modeled TWRs (orange) for WMGHG in (c) and (d) and AER in (e) and (f). The estimated warming ratios (TWRE) refer to the simulated response model TWR using TWRD. Both values are given in light purple. Only the 30-yr period of strongest differential warming is used for the central TWR estimates. VOL TWR is only a function of the fast response.

Given that TWRs for WMGHG, AER, and VOL can only be inferred from GCMs, we apply a scaling factor that represents the difference between observed and all-forcing TWRs. Assuming that the observed TWR (red shaded area in Figs. 2a and 2b) is our target ratio, the responses in HadCM3 are scaled accordingly. HadCM3 is used because it provides a small ensemble of simulations (mainly drawn from the Euro500 experiment) that is consistent across the experiments. The TWR in the historical HadCM3 ensemble is 1.7, compared to 2.8 in HadOST (Fig. 2a). Hence a scaling factor of ~1.6 is applied to the TWR deduced from the WMGHG and AER ensemble of the same model in order to correct for the underestimated TWR in the historical HadCM3 simulations. The resulting inferred TWR (hereinafter referred to as TWRD; D = diagnosed), which is then used in the response model, is provided in the boxes at the bottom of Fig. 2.

Since the bulk of the VOL response takes place on the fast time scale (1–10 yr) and thus differs from WMGHG-related responses (Ding et al. 2014), we refrain from scaling and use the TWR from HadCM3 directly [consistent with above-mentioned findings in Boucher and Reddy (2008) regarding HadCM3’s fast response time]. Note that the VOL responses in Figs. 2g and 2h are shown for the full 1500–1999 period in contrast to the shorter 1850–1999 (1850–2017) period for all other scenarios.

In addition, since we do not know the resulting warming ratios in the response model a priori when we impose the inferred TWRD, we compare them with the posteriori TWRs (hereinafter referred to as TWRE; E = estimated) in order to validate our approach. We find that, for example, the TWRD for WMGHGs (TCR of 2.65 K over Land and 1.11 K over Ocean) of ~2.4 results in a TWRE of ~2.2 (see Fig. 2d). We therefore argue that our method is reasonably well constrained to provide a robust answer.

All TCR calibration factors based on the deduced TWRDs (and shown at the bottom of Fig. 2) are summarized in the upper box in Fig. 3. We would like to point out that these calibration factors modulate the TCR/ECS ratio and are used for the full range of TCR and ECS values, respectively, not only the best estimate. The latter is provided at the top of Fig. 3 as well, together with the TCR for AER effective forcing, which is ~40% higher (best estimate = 2.2K) than that of WMGHGs (best estimate = 1.6 K), consistent with findings in Rotstayn et al. (2015), to reflect the higher aerosols load over land.

Fig . 3. View largeDownload slide Summary panel for all the necessary response model parameters, including their justification. (top left) Global, hemispheric, and Land/Ocean TCR scaling factors for WMGHGs, AER, and VOL based on the findings shown in Fig. 1. (bottom left) Forcing response time estimates and sensitivities used in this analysis are provided, including their source. Color codes are used for better readability. The pink labels in the lower box refer to the original AER-TWRD. In gray are the associated coupling factors. Surface temperature trends in (a) HadOST, (b) CMIP5, and (c) HadCM3 for 1978–2017. Fig . 3. View largeDownload slide Summary panel for all the necessary response model parameters, including their justification. (top left) Global, hemispheric, and Land/Ocean TCR scaling factors for WMGHGs, AER, and VOL based on the findings shown in Fig. 1. (bottom left) Forcing response time estimates and sensitivities used in this analysis are provided, including their source. Color codes are used for better readability. The pink labels in the lower box refer to the original AER-TWRD. In gray are the associated coupling factors. Surface temperature trends in (a) HadOST, (b) CMIP5, and (c) HadCM3 for 1978–2017.

To estimate the TCR calibration factor for AER, hemispheric and land–ocean coupling factors need to be determined. They reflect the above-mentioned fact that interhemispheric energy exchanges in response to the heterogeneous distribution of AER need to be balanced. Conveniently, the coupling factors are an emergent property and as such a function of the hemispheric area weighting factors, which are strictly interlinked and hence constrained as follows (example for WMGHGs):

T GHG Global = 0.5 T GHG NHem + 0.5 T GHG SHem = 0.32 T GHG Land + 0.68 T GHG Ocean (4)

Note that the Land fraction is marginally >30% because areas covered with sea ice are treated as land throughout the analysis. Apart from the area-weighted constraint, the coupling factors are also dependent on the emission ratio, that is, the ratio between the hemispheric (and land/ocean) and the total global aerosol emission strength, which in turn determines the appropriate fractional contribution to match the inferred AER-TWRD (see appendix A for more details). The resulting coupling factors are 3.9 (ratio of 1.47 and 0.38, which corresponds to 85% NHem and 15% SHem AER contribution for NHem AER and vice versa for SHem AER) and 2.1 (ratio of 1.46 and 0.7, which corresponds to 70% Land and 30% Ocean AER contribution for Land AER and vice versa for Ocean AER). These factors are also provided in the bottom box of Fig. 3, together with all other parameters used in the response model. Global temperature trends for the 1978–2017 period in HadOST (Fig. 3a), CMIP5 (Fig. 3b), and HadCM3 (Fig. 3c) are also shown. The spatial distribution of the trend highlights why observed and modeled TWR do not agree, which is primarily caused by delayed Southern Ocean warming (Armour et al. 2016), and partly by an accelerated Arctic amplification (Serreze and Barry 2011). Both physical processes are not satisfactorily reproduced in most GCMs.

Last, as apparent from the discrepancy between the AER factor provided at the bottom of Fig. 2 (3.5 and 2.4 for NHem/SHem and Land/Ocean, respectively) and that shown in the top box of Fig. 3 (5.1 and 2.9 for NHem/SHem and Land/Ocean, respectively), we increased the inferred AER-TWRD slightly. While the adjustment of the AER-TWRD does not change our conclusions (see Figs. S1 and S2 in the online supplemental material for the same result without AER-TWRD tuning), it does lead to better agreement between HadOST and the response model during the period of strongest AER cooling between 1960 and 1980. Given that the HadCM3 AER ensemble is not a strict AER-only simulation rather than the difference between the allforcing and a nonaerosol ensemble of HadCM3, the results likely do not reflect the full extent of the aerosol-induced TWR. Therefore we think it is a defensible decision and well within the realm of the uncertainty of our AER-TWR estimate. The resulting AER time series is shown in Fig. 1c.

For the uncertainty analysis, response model, radiative forcing and internal variability uncertainty is considered. Apart from the TCR (1.1, …, 2.1 K) and ECS (2.0, …, 4.0 K) range, we also include a range of fast response times (3, …, 13 years) in our response model uncertainty estimate. For the forcing, 200 total radiative forcing realizations are used (Forster et al. 2013) and converted into response model temperature equivalents to estimate the associated error range. The resulting σ (32nd–68th percentiles) of the fractional uncertainties is shown in Fig. 4a (response model error in green and radiative forcing error in blue). If we assume that potential internally generated, low-frequency variability adds linearly to the externally forced response, we need an estimate of (modeled) unforced multidecadal variability. As introduced in Haustein et al. (2017), we use equidistant intervals of selected CMIP5 preindustrial control simulations that do not drift (Knutson et al. 2013) and possess a similar range of unforced variability as our response model-based estimate of the residual observational variability. In Fig. 4b, the low-pass filtered residuals for HadOST, Cru4CW, and BE between 1850 and 2017 and low-pass filtered sample intervals of 168 years from selected CMIP5 models are shown together with their standard deviation σ. The obtained 5th–95th percentiles of their internal variability span ±0.17°C (⁠ σ ¯ = 0.1 ° C and σ ¯ 2 = 0.01 ° C 2 ⁠) as shown in Fig. 4a in gray.

Fig . 4. View largeDownload slide (a) Fractional variance (square of the model error) for impulse response model uncertainty (green), total radiative forcing uncertainty (blue), and internal variability uncertainty (gray). The 1σ (32nd–68th percentiles) range is shown. We note that internal variability is no response model uncertainty in a strict sense as it is added post hoc (i.e., onto the calculated temperature). The peaks in the response model uncertainty coincide with volcanic eruptions (e.g., Tambora in 1816). (b) The internal variability from selected CMIP5 piControl runs is contrasted with the unforced residuals from the GMST datasets used in this study. Observed and modeled time series are low-pass filtered with a 30-yr smoothing radius. The standard error is provided in parentheses. Fig . 4. View largeDownload slide (a) Fractional variance (square of the model error) for impulse response model uncertainty (green), total radiative forcing uncertainty (blue), and internal variability uncertainty (gray). The 1σ (32nd–68th percentiles) range is shown. We note that internal variability is no response model uncertainty in a strict sense as it is added post hoc (i.e., onto the calculated temperature). The peaks in the response model uncertainty coincide with volcanic eruptions (e.g., Tambora in 1816). (b) The internal variability from selected CMIP5 piControl runs is contrasted with the unforced residuals from the GMST datasets used in this study. Observed and modeled time series are low-pass filtered with a 30-yr smoothing radius. The standard error is provided in parentheses.