A unique challenge for technical validation of paleoclimatological datasets is that the target, here, site-local temperature over the Common Era, is unknown. Addressing this issue is an important objective of the current study. Our approach to validation includes comparison with the instrumental data for annually-resolved records, subsampling the dataset to assess reproducibility among proxy types and other subsets based on different screening criteria, and coarse-graining the time series to different extents to address issues related to combining records of different resolution and age certainty. Evidence that the records in the database reflect past temperature variability can be found in the original publications associated with each record. In addition, each series incorporated in the dataset was examined by one or more regional experts, who certified that each proxy record included in the database was accurate and related to temperature (Supplementary Table 1). This level of expert elicitation is unique among existing paleoclimate syntheses covering the last two millennia, and is a key value proposition of the PAGES2k crowd-curation process.

Quality control

To facilitate quality control of individual records within the database, dashboards displaying raw data, their annualized version and the extent to which they may be informative of annual, JJA or DJF temperature were created. These figures are grouped by region or globally and included on the FigShare repository associated with this publication (Global_QCfig_bundle.pdf, Data Citation 1).

Fundamentally there are two ways to infer past temperatures from paleoclimate records. They can be calibrated using either:

1 direct (in-time) calibration; or: 2 indirect (space for time) calibration.

In the first approach, the record must overlap with the instrumental period (here: 1850–2014), and this period of overlap must contain enough points for a statistical calibration to an instrumental temperature product such as HadCRUT4 to be meaningful. In the second approach, one often uses transfer functions or laboratory-based culture experiments. Accordingly, summary plots for all records are divided into two categories, described below. The instrumental overlap threshold requirement is set at n=20 based on sensitivity tests (not shown). This parameter may be changed in the code associated with this dataset (see Code Availability).

Records that can be calibrated in time

Record Ocn_114 (ref. 42) (Fig. 4) is one such example. The top panel shows the (monthly) raw data as gray circles and an annualized curve whose color code matches that of Fig. 1a. The three bottom plots depict correlations with temperature grid points taken within a 2,000 km radius. The bottom left plot shows correlations with mean annual temperature (MAT), with insignificant correlations (as per an isospectral test43, 1,000 surrogates, 5% level) denoted by hatching. The local correlation is −0.59 and its bold font weight indicates statistical significance, also at the 5% level. Similar plots are shown for boreal summer (JJA, center) and boreal winter (DJF, right). Essential textual metadata are displayed on the right hand side. Similar plots follow identical conventions.

Figure 4: Quality-control plot for record Ocn_114 (Ocean_QCfig_bundle.pdf, Data Citation 1). See text for details as an example of record that can be calibrated in time. Full size image

Records that cannot be calibrated in time

If a record has too coarse a resolution, or ends too early, to contain 20 points over the 1850–2000 interval, it belongs to this category. One such record is Ocn_015, a foraminifera Mg/Ca record from the Caribbean35 (Fig. 5). This record was independently calibrated to temperature, as reflected in the ordinate of the time series plot (top left). As before, the right side of the page displays metadata, including the calibration method and its reported uncertainties. Since a comparison to an instrumental temperature series is neither possible nor meaningful for such a record, the bottom left panel displays its correlation to the 10 nearest high-resolution records (bottom left), coded by lines whose color represents the absolute correlation. Thick, solid lines represent significant correlations (again, as per ref. 43 at the 5% level), and thinner dashed lines represent correlations that did not pass the test. The bottom central panel stratifies these correlations by distance; the color corresponds to the proxy code (i.e., corals in orange, sclerosponges in red, c.f. Fig. 1a), with significant correlations circled in black. The size of the symbol scales with the number of years of overlap, as shown at the bottom right.

Figure 5: Quality-control plot for record Ocn_015 (Ocean_QCfig_bundle.pdf, Data Citation 1). See text for details as an example of record that cannot be calibrated in time. Full size image

Relationship to temperature

Here we examine the extent to which the database as a whole captures the observed temperature variability at local and regional scales. We do so via correlation analysis, which makes the common assumption that the relation between the proxy value and temperature over the twentieth century is representative of the entire record (stationarity). Unstable or multivariate associations between proxies and local temperature would represent a significant challenge to this assumption; however, this problem is not unique to paleoclimatology within the Earth sciences. The approach also assumes that the observational temperature time series itself is accurate and unbiased for each proxy site, which may not be true in areas of sparse coverage or complex topography.

The relationship between the current proxy database and the global temperature field is quantified via Pearson’s linear correlation coefficient (R) between proxy values and temperature averages (ANN, JJA, DJF). Statistical significance is established via a non-parametric, isospectral test43, which accounts for the loss of degrees of freedom due to large serial correlations common to proxy time series. Again, we restricted correlation analyses to records comprising a minimum of 20 samples over the instrumental era (CE 1850–2014), which limits the pool of proxies that may be evaluated in this way (n=597).

Regional screening

First, we search for significant correlations (P<0.05) within a search radius r s , ensuring that correlations are local to regional. Compared to a global search, this limits the extent to which spurious correlations may arise, for instance, due to strong trends. Since spatial correlations are non-uniform and highly anisotropic, using a distance-based criterion that is uniform over the globe represents an oversimplification. No single distance is likely to be globally optimal, so its choice reflects a compromise between various factors: autocorrelation in land versus ocean temperatures, annual versus longer resolution, or seasonal biases. With r s =2,000 km44, 411 records show significant correlations with annual temperature—their absolute values are shown in Fig. 6a and their locations are shown in Supplementary Fig. S3. Results change modestly depending on the value used for r s .

Figure 6: Relationships with temperature and other records. Median absolute correlations (a) of each record with mean annual temperature within a 2,000 km radius; (b) of each record with mean annual temperature at the nearest grid point; (c) of each low-resolution record with its 10 closest high-resolution proxy neighbors; (d) of temperature at each grid point and its proxy neighbors within a 2,000 km radius. Proxy-centric correlations (a–c) are reported in color if significant; as small black symbols if insignificant or not applicable. Grid-centric correlations (d) are reported in color if significant; in grey if insignificant or not computable (i.e., no proxy neighbor within 2,000 km). Full size image

Regional screening adjusted for the false discovery rate

Searching for potentially hundreds of suitable correlations within such a search radius runs the risk of false discoveries45. The problem of multiple hypothesis tests has long been known to statisticians and several solutions exist46. We use a method based on the false-discovery rate (FDR)47, adapted to the climate context48,49. In all, 277 records passed this test with annual data (Supplementary Fig. 4).

Local screening

The search may be further restricted to the nearest HadCRUT4.2 grid point. The results of this evaluation are mapped in Fig. 6b. In some cases this may be problematic because sites may sit at the boundary between grid cells. For sites located in the vicinity of frontal zones with large spatial temperature gradients, choosing the most appropriate neighbor can be particularly difficult. Gridded temperature data may represent observations from a range of elevations or environments, and therefore may not be representative of the archive’s actual location. Furthermore, the nearest grid point can in some instances be located thousands of kilometers from a site, because of the incomplete coverage of HadCRUT4. This limitation is particularly acute for Antarctic records, because of poor instrumental coverage over the Southern Ocean and Antarctic continent. A total of 181 records passed this test with annual data, including 5 from Antarctica, versus 9 in the regional+FDR case (Supplementary Fig. 5).

Seasonal effects

The extent to which proxies are informative of annual temperature depends, sometimes very strongly, on the portion of the annual cycle which they preferentially sample7. Thus, before using seasonally-dependent proxies to reconstruct mean-annual temperature, one must ascertain the relationship between seasonal averages and the annual mean.

Supplementary Fig. S6 explores how much of the mean annual temperature (MAT) signal can be explained by boreal summer (JJA, top) versus winter (DJF, bottom) averages in the HadCRUT4.2 dataset. Correlations are generally very high (>0.8) in the tropics, where the MAT range is small, and low in the extra-tropics, particularly over northern hemisphere continental interiors for JJA, where the MAT range is large and dominated by winter synoptic variability. This means that proxies that preferentially record summer conditions may be adequate predictors of the annual mean if they are located in the tropics, but (all other things being equal) less so if they are located on Northern Hemisphere continents. Extratropical winter variability is known to dominate the annual average44, so correlations to the MAT primarily reflect winter conditions in those regions.

Table 2 summarizes the result of the aforementioned correlation-based screening for the three approaches (regional, regional with FDR control, and local), as well as the part of the year that goes into the annual average: ANN (calendar year), DJF, JJA, or April-March (AMA). The results make it clear that some proxy records are sensitive to JJA temperature, but not to DJF or annual temperature. The vast majority are tree-ring records from the Northern Hemisphere.

Table 2 Number of records retained in the correlation-based screening depending on the method used and the part of the annual cycle selected. Full size table

Relationship to other proxies

A total of 95 proxies could not be directly correlated to MAT, either because they ended prior to 1850 or because they featured too few samples after this date. As an alternate validation method, we searched for significant correlations among the 10 closest neighbors from within the proxy dataset that can be correlated with HadCRUT4 (colored dots on Fig. 6a), and reported these significant correlations, along with their magnitude, in Fig. 6c. To minimize issues related to correlating time series with very different resolutions, the time series of proxy neighbors were smoothed to match the resolution of the target proxy. In regions where proxy-record density is high, this is a reasonable approach to assess the mutual consistency between various series; in sparsely sampled regions, this approach implies that proxy series that cannot be directly correlated to instrumental temperature must look like those that can, even if they belong to different climate settings. Moreover, despite the precautions taken with the isospectral test, correlations between a low-resolution record and its high-resolution neighbors are often driven by trends, even if no geophysical connection is present. Correlations between high- and low-resolution records must therefore be interpreted with caution. With these caveats in mind, 54 of the 95 proxies showed a significant correlation to neighboring high-resolution proxies.

Grid-based spatial correlations

An alternate way to evaluate the extent to which variability in the global temperature field is captured by a heterogeneous network of paleoclimate proxies is to quantify the correlation between each grid cell’s instrumental temperature time series and that of all the proxy series within the radius r s 50. Viewed in this manner, the statistical relationship between the regular grid and the irregular proxy network provides information about the extent to which different regions of the global temperature target field are represented by the paleoclimate data, and the strength of that relationship. The results of this evaluation are shown in Fig. 6d. It shows that surface temperature over 73% of the planet is significantly correlated to a proxy time series within a 2,000 km radius-about twice as great an area as covered by the previous PAGES2k compilation11.

Global trends

Having quantified the degree to which proxy records from this dataset respond to temperature, we now synthesize the largest-scale thermal signal embedded therein. We do so by use of composite time series, which efficiently summarize the global trends captured in this large and diverse collection. Composites allow us to readily compare signals contained in various subsets of the database; these comparisons, in turn, are an essential check on the mutual consistency of the temperature proxies across regions, geographic settings, and proxy archives.

Our focus here is purposefully general, centered on multidecadal to centennial time scales and ignoring the spatial features. This simple approach is intended as a preliminary estimate of global mean temperature fluctuations over the past 2000 years, and sets the stage for future community endeavors. Indeed, several PAGES2k working groups are currently working to generate spatially resolved reconstructions of annual or seasonal temperature fluctuations at regional to global scales, as well as cross-validated estimates of global mean surface temperature using a variety of statistical approaches. The composites allow this database to be placed in the context of past reconstruction efforts, and to serve as a benchmark for future ones.

Following recent compilations11–13, we average all records (scaled to unit variance) into a composite. We do so at a coarse resolution by applying a simple binning procedure. Compositing makes two implicit assumptions:

i all proxy records are linearly related to global, mean-annual temperature. ii all proxy records are equally representative of global mean-annual temperature at any given time, and are thus given equal weight.

Given suitable transformations, (i) may be satisfied for a broad class of proxy records, even very nonlinear ones51. Assumption (ii) is more problematic, for three reasons. First, as Fig. 1a shows, the network is dominated by tree rings from the northern midlatitudes, whose temperature sensitivity is primarily linked to the growing season (boreal summer), representing only a fraction of the annual variance. Second, the mix of proxies is also non-stationary (Fig. 1c). Coral records are relatively abundant over the instrumental era but practically absent prior to 1,600 CE. Tree rings dominate the network after around 1,400 CE but less so prior to that. Finally, the majority of records have annual (or better) resolution, but some records have median resolution on the order of 100 years (Fig. 1b). The information density per unit time of such records is thus quite different. Furthermore, dating uncertainties in low-resolution (non layer-counted) proxies are not quantified, but are mitigated to some extent by multidecadal binning.

Despite assumptions (i) and (ii) above, and their potential violation, we suggest that a simple treatment of the data constitutes an informative appraisal of the largest-scale thermal signal embedded in the dataset. We emphasize, however, that the above concerns are all legitimate, and that more rigorous treatment of these assumptions should and will be applied in formal temperature reconstructions. Compositing involved the following processing steps:

Sign adjustment

Records were multiplied by −1 if their values decrease with increasing temperature (i.e., if their interpDirection parameter is negative); by +1 otherwise. This step ensures that all proxy values point upward (downward) in response to warming (cooling).

Normalization

Records were mapped to a standard normal distribution via inverse transform sampling51, resulting in zero mean and unit variance.

Binning

Since the main focus of this composite is on low-frequency (decadal and longer) variability, all records were averaged in bins of 25, 50, and 100 years. Binning also mitigates the effect of age uncertainties, as it is known that even small age offsets between annual records could otherwise cause large spurious trends in composites made from them32.

Scaling

Standardized composites were scaled to temperature over identical bins.

Screening

For high-resolution records (HR: median resolution finer than 5 years), we applied either no screening (none), regional temperature screening (regional), or regional screening adjusted for the false discovery rate (regionalFDR). For low-resolution records (LR: median resolution coarser than or equal to 5 years), basicFilter denotes records that comprise at least 20 values over the Common Era (Supplementary Fig. 2), while hrNeighbors denotes records with at least one significantly correlated HR neighbor (see above for the caveats of this approach).

Bootstrap

Uncertainties in the composite are quantified via a bootstrap approach52. This assumes exchangeability, and primarily measures sampling uncertainty. We plot 95% confidence intervals derived from an ensemble of 1,000 bootstrap samples; in general, such intervals widen with proxy attrition, as expected.

Sensitivity analysis

Figure 7 presents the composites (HR in gray, LR in blue) and the HadCRUT4.2 target (red) scaled to temperature. Cases presented in the left column applied no screening, while the right column explores combinations of screening and binning interval. A striking feature is that in all cases, both HR and LR composites display a long-term cooling trend until the 19th century, after which an abrupt warming takes place, consistent with a very large body of literature5,8,11,12. We also note that temperature variability decreases with increasing bin size, as would be expected for data with random and independent errors.

Figure 7: Global composites for various binning intervals and screening criteria, as indicated in subplot titles. The composites are scaled to temperature for comparison, and the shading denotes 95% bootstrap confidence intervals with 500 replicates, to constrain uncertainties. The cutoff between high-resolution (HR) and low-resolution (LR) records is defined as a median resolution of 5 years. Screening options comprise: no screening (none), regional temperature screening (regional), or regional screening adjusted for the false discovery rate (regionalFDR). For low-resolution records, basicFilter denotes records that comprise at least 20 values over the Common Era (Supplementary Fig. 2), while hrNeighbors denotes records with at least one significantly correlated HR neighbor. Full size image

We find the main results robust to screening choices, with the exception of the case in Fig. 7f (regionalFDR, hrNeighbors). The latter shows the most discrepancy between HR and LR, mainly because the number of LR proxies is very low (n=22) and they have little overlap with the instrumental era, making their temperature calibration unstable. In all cases the HR composites display slightly shallower variations than LR composites. There are two non-exclusive explanations for this. Firstly, it is known that some HR records, particularly the tree-ring chronologies that form the majority of this subset, can be limited in their ability to capture low-frequency variability beyond the mean segment length53. Second, LR records are known to redden climate signals, often exaggerating low-frequency variability at the expense of high frequencies54. Our analysis cannot distinguish between these two possibilities.

It is important to consider whether any of the primary features of the composite series are strongly controlled by a particular subset of proxies, or if they are shared among archive types. There are many potential ways to analyze this dataset. We give but one example in Fig. 8, gathering composites from individual archive types that include 5 or more records among the proxy collection: corals, documentary archives, glacier ice, lake and marine sediments, as well as trees. For this case we apply regional HR screening and basic LR filtering, then average records from coral, documentary, glacier ice, lake sedimentary, marine sedimentary, and tree-ring archives.

Figure 8: 50-year binned composites stratified by archive type, for all types comprising 5 or more series. Composites with fewer than 10 available series are shown by a dotted curve, while solid lines indicate more than 10 series. Shading indicates 95% bootstrap confidence intervals with 500 replicates. Gray bars indicate the number of records per bin. The composites are expressed in standard deviation units, not scaled to temperature. Full size image

Most composites show a strong twentieth century warming trend that emerges above the variability of comparable centennial trends over the last two millennia. This is clearest in the tree- and coral-based composites, despite very large uncertainties in the latter during the seventeenth century, due to the paucity of records (Figs 1 and 8). An exception to this pattern is in the marine sediment composite (Fig. 8), which shows a cooling trend through most the Common Era. This may be explained by the low resolution of marine sediment records noted earlier, and the process of bioturbation of the sediment archive. These factors diffuse and damp changes occurring over years and decades (e.g., ref. 12), including the most recent warming. Local oceanographic factors may also play a role12,55.

Uncertainties in these composites include changes in sample size and available data network over time, the potential for non-climatic or non-temperature influences to bias these smaller subsets of the dataset, and the high spatial heterogeneity of subsample networks (Fig. 8). In general, uncertainty bands widen back in time (cf tree composite), with the notable exception of the marine sediment and documentary composites, which show widening bootstrap intervals in the last 2 bins, coincident with a drop in observational coverage in these archives. Note that multidecadal trends present in coral δ18O records from the eastern tropical Pacific may not be driven by temperature13,56,57 possibly biasing the trend of this coral composite. The network of lake records is regionally constrained (Fig. 1), and that composite may contain multiple environmental influences beyond temperature. As a result, from the lake subset only, we cannot exclude the possibility of above-modern levels of warmth in the third century CE, though uncertainty bands for early centuries are wide, and the recent rate of warming is clearly unprecedented over the Common Era.

The global composites derived from this dataset, despite their simplicity, supersede the composite-of-opportunity published in the last synthesis11, which was an average of regional indices obtained by very different means (hence not statistically homogeneous) and did not include the majority of the marine records gathered here. Nevertheless, the present composites share many similarities and some of the same caveats; namely, that a composite tends to give more weight to numerically abundant records (e.g., tree rings), and regions with more abundant observations (e.g., the Northern Hemisphere continents). An in-depth analysis of these composites, along with their climatic interpretation, will be the subject of a companion paper.