This is a complex guest post by Nic L. It represents early results of a method he and Ryan have been working on for global temperatures. This post uses the concept of regularized least squares, similarly to the RegEM method of Steig et al, whereas you have two datasets, one with more trend information(ground stations), the other with more spatial information (satellite). There is a huge amount of work here and the concept is fully publishable, especially when you see the huge warming result at the end ;). Because of that, this post would probably be one which would make the cut for Tamino or RealClimate, however, this post makes no effort to correct for surface station problems but rather is about the uniqueness of the method and the issues involved with a complex problem.

——NicL 11 February 2010

Introduction

My earlier post on long-record GHCN stations sprang from my interest in using them to create an improved global gridded temperature history from 1850 on, by combining the data from a set of good quality long record surface stations, augmented by ocean data, with the gridded satellite TLT (lower troposphere) data produced by UAH and/or RSS since 1978.

A well known example in the peer reviewed literature (Steig et al, Nature, January 2009) applied this sort of approach to reconstructing a 50 year gridded temperature history of Antarctica using a relatively sparse set of ground stations, many of which had incomplete records and some of which did not cover the full 50 year period, in conjunction with AVHRR satellite TIR (thermal infra-red) data. Steve McIntyre started an investigation of Steig’s Antarctica reconstruction and Ryan O, Jeff Id and myself took this up and spent a lot of time and effort looking into Steig’s work and finding ways to improve on it.

Following on from that work, I have been seeking to apply some of the same techniques on a global scale. Ryan has also begun exploring this area, we have been exchanging ideas and the plan is to work on it together, hopefully along with Jeff. What is presented here is very much a “proof of concept” first cut approach.

The rationale for combining in-situ surface temperature data with satellite data is that the surface data is available going back much further than satellite data, but is spatially very incomplete, whilst the satellite data is spatially complete (or nearly so) but is available for only the last thirty years. Using mathematical techniques, spatial information from the satellite data can be combined with the temporal information from the surface data to produce a gridded reconstruction over the entire period that surface temperature measurements are available.

Combining data from satellite and surface measurements in the proposed way relies on the assumption that the pattern of spatial correlations of temperature fluctuations shown by the satellite data was the same in the pre-satellite era. Over the relatively limited time period involved in this case (160 years), that seems a reasonable assumption. It can be tested to an extent, but I have not attempted to do so at this stage.

There is more than one way of using the satellite data. Because of the very large number of grid cells, it is necessary first to reduce the dimensionality of the satellite data. This is generally done by performing a singular value decomposition (SVD) of the matrix of monthly satellite temperature data. This is performed after adjusting the data to anomalies by subtracting from each calendar month the mean thereof, and possibly also then normalising the data to unit monthly standard deviation. SVD is used to convert the data into a set of principal component time series (PCs) and their corresponding spatial eigenvectors (weights), together defining EOFs (Empirical Orthogonal Functions), ordered according to their contribution to the total covariance in the data, such that each sequential EOF accounts for the maximum possible amount of the total (remaining) covariance. This procedure is also known as Principal Components Analysis (PCA).

Only a limited number of PCs /spatial eigenvector pairs are retained for the reconstruction, on the basis that higher order ones mainly represent noise. Steig only retained 3 PCs for Antarctica – almost certainly too few properly to capture the spatial distribution of temperature fluctuations there. Mann in his latest 1500+ year proxy global reconstruction retained a maximum of 7 PCs from the gridded instrumental data. But in order to obtain a reconstruction with reasonably detailed spatial information it is necessary to use rather more PCs than that.

Steig used the longer record ground station data to complete the satellite record for the pre-satellite period using the TTLS (truncated total least squares) version of RegEM, an iterative regularized expectation maximization algorithm devised by Tapio Schneider and much used by Mann et al for “hockey stick” paleoclimate reconstructions. With TTLS RegEM, the regularization (involving a reduction in the effective number of free parameters in the model) takes the form of including only a fixed number of EOFs in regression used to impute missing values.

The choices made by Steig in his procedure seemed unsatisfactory, albeit partly constrained by limitations of the original RegEM program. But in principle it may be possible to produce a satisfactory reconstruction using a purely RegEM approach, at least where the temporal evolution of the satellite data is not only reasonably correlated with the corresponding surface data but broadly consistent with it as to trends.

Another approach to producing a gridded reconstruction is to perform RLS (regularized least squares) regression of the surface station data against the satellite spatial (eigenvector) structure, as pioneered by Ryan O in relation to Antarctica. This gives a set of estimated PCs for the entire time period that are then applied to the corresponding satellite spatial eigenvectors to produce a gridded reconstruction. The surface station data is generally infilled using RegEM before the RLS regression is performed. This method essentially involves the same equations as used in RegEM, save that the spatial correlations between satellite grid cells are used rather than the correlations between the surface data and the satellite data PCs.

The RLS method has an advantage over a pure RegEM approach in that the regularization can be more finely controlled and optimised to suit the data, with regularization taking place both in the initial RegEM surface data infilling and in the RLS regression. On the other hand, it assumes that all surface station data is equally representative of the mean temperature for the grid cell occupied by the station, which (whilst a reasonable approximation in the case of Antarctica) is not necessarily the case.

I initially intended to concentrate first on the pure RegEM approach, which is simpler. However, much of the work involved is common to both methods, so I decided to try both.

Satellite data

The TIR AVHRR data used in Antarctica measures skin temperature, which one would expect to be reasonably close to near-surface air temperature as measured by the ground stations. TIR measurements can only be made with fairly clear skies, but despite the need for some heavy cloudmasking the AVHRR data generally shows good correlations with the ground station data, on a monthly basis. Unfortunately, over time the satellite data exhibited substantial drift and inhomogeneities. Further, AVHRR data is only available pole-wards of about 50 degrees, so is not suitable for a global reconstruction.

In contrast, microwave satellite TLT data is available globally and does not require cloudmasking, although close to the poles and over high terrain its accuracy is poor. Further, the TLT data is thought to suffer relatively little from drift and inhomogeneities (although RSS’s version does appear to have had some inhomogeneities in the early 1990s). In case you aren’t familiar with the picture the TLT data gives of temperature trends worldwide since coverage started, here is a map of them, using UAH’s data. The concentration of warming in the higher northern latitudes is very evident, although the projection used over-represents the area affected.

It is also relevant to look at the distribution of variability of the TLT temperature data. The mean monthly standard deviation of the TLT data over land is about 25% higher than over sea. However, the bulk of this difference is due to the strong positive correlation between standard deviation and distance from the equator, as shown in the following map.

In contrast to the AVHRR data, the TLT data does not measure temperature at ground level (although it does measure air temperature and not ground skin temperature). This means that there is a fundamental difference between the reconstructions produced by the pure RegEM and RLS methods. The former results in a reconstruction of TLT temperatures, as it is the TLT data (as represented by its lower EOFs) that is being imputed, with the regression coefficients reflecting the correlations between changes in TLT data and in surface data. But as the RLS method involves correlations between satellite data in different places (used, where ground stations are located, as a proxy for correlations between surface temperatures in those locations) and not between satellite and ground level data, the reconstruction is of surface temperatures. (These are my views; Ryan doesn’t necessarily agree with me on this – yet. J)

Whichever type of reconstruction is intended, the first step is to perform a SVD of the satellite TLT data and to examine the results.

I only used UAH’s TLT data, as their data seems to have been more homogenous than RSS’s and I have more confidence in their methodology. The UAH TLT data is available as mean monthly anomalies from December 1978 onwards at http://vortex.nsstc.uah.edu/data/msu/t2lt/tltmonamg.xxxx_5.2, xxxx being the year involved. I took data only up to November 2009, so as to use only complete years. The whole globe is covered in 2.5 x 2.5 degree grid cells, but UAH defines global as being from -85 to +85 latitude. As mentioned previously, data near the poles is of limited use.

The further from the equator one gets, the smaller each grid cell is. It is appropriate to correct for this before taking the SVD, otherwise areas towards the poles have a disproportionate influence. Ryan and I have together looked at how best to do this. One solution is to make each grid cell cover a larger range of longitude as latitude increases (N or S). However, I opted here to weight the grid cells by the square root of their areas. This has virtually the same effect as merging data into longitudinally wider grid cells (square root since SVD minimises a RMS function), once the spatial eigenvectors have been reweighted by the same factors. Weighting grid cells retains the conventional format and also has the advantage of preserving the finer spatial resolution available in the original data at higher latitudes. However, these are relatively minor advantages and come at the expense of a considerably increased computational load and the need to worry about weighting, so I may well in future instead use longitudinally wider grid cells as latitude increases. Ryan has written some helpful code for that purpose and for plotting data in that format on a more realistic global projection than the rectangular format employed here.

There is one other weighting issue. In Antarctica, Steig scaled all the satellite data to unit variance before taking the SVD and deriving the principal components and spatial eigenvectors. That means that the SVD was calculated using the correlation matrix of the data rather than its covariance matrix. The strong dependence of standard deviation on latitude could be seen as a reason for preferring to use a correlation basis. However, as Steve M pointed out in his reply to Huybers (http://climateaudit.org/?p=405) in the equivalent context of PCA, the statistical authorities favour using covariance where (as here) the data are measured in commensurate units. Accordingly, I have not scaled the TLT data to unit variance before taking the SVD. I may investigate the effects of doing so at a later stage.

SVD eigenvectors of global TLT data

The first six of the spatial eigenvectors resulting from taking the covariance SVD of the thus weighted global TLT data from -85 to +85 latitude are as follows. Their signs are of course arbitrary.

The correlations between adjacent TLT grid cells are generally high. But, as the generally horizontal orientation of the first six spatial eigenvectors suggests, the correlation between longitudinally adjacent (E-W) TLT grid cells is considerably closer to unity than that of latitudinally adjacent (N-S) cells.

The below graph shows the pattern of the SVD eigenvalues, each approximately proportional to the mean standard deviation of that part of the fluctuations in the data accounted for by the corresponding EOF. It will be seen that, apart from after the first seven eigenvalues, there are no steep declines in the curve that would suggest an obvious point at which to truncate.

The first seven EOFs only account for 40% of the total variability power in the data (the sum of the squared eigenvalues); to get to 90% thereof requires 58 EOFs, and to get to 95% requires 88. However, virtually all of the mean global trend is likely to be accounted for by a fairly small number of the lower EOFs. SVD eigenvectors of land only TLT data It may be better to reconstruct global temperatures separately for land and sea, as the surface data available and its characteristics are quite different for land and for sea. If so, using a land only and a sea only SVD might provide a more suitable representation for the corresponding reconstruction. It is relevant in particular to see what the land only SVD of the TLT data looks like, as there is likely to be more spatial detail over land than over sea. The first six spatial eigenvectors thereof are shown below. Surface data – Land The next step is to build suitable sets of surface data, both land and sea. I will start with land data, where there are many more choices to be made. l used GHCN land station monthly mean raw data up to December 2009. I processed it in the same way as described in my previous post (see https://noconsensus.wordpress.com/2010/01/19/long-record-ghcn-analysis/). That gave 7280 stations, excluding stations with no post 1800 data. Raw data was used in preference to adjusted data mainly because of the greater extent of raw data, particularly outside the USA and for earlier dates, but also partly because of concerns that the adjustment process could contain biases and did not in any event deal with urban heat island (UHI) and other effects of changing land use. In the case of the USA, adjustment should arguably be made for two widespread cooling biases. These are the change in time of observation, about which there is little debate, and possibly the change from liquid in glass thermometers to electronic sensors (MMTS) – although the effect of the latter is not entirely clear, and in any event seems small. But my view is that, overall, UHI and other land use change effects are likely to result in any bias in the raw data being in the direction of warming. I wanted to obtain as much data as possible going back at least to 1900, ideally 1850, but there is very limited availability of such data for much of the world outside the 48 contiguous USA states, and data for many stations ceased at some point in the last twenty years. The number of stations with data extending from 1850 to the late twentieth century is very limited (only 100 or so with fairly complete records), and most of these are included in the set of stations with good data from 1900 to 1999. I therefore included stations with data at least from 1900 to 1999 and, outside the contiguous USA, stations with data at least from 1950 to 2005. Initially I required not only that there be some data in the start and end year but that no more than 20% of data be missing over the whole qualifying period. However, I relaxed this to 30% for the 1900-1999 stations. Whilst doing so increases the number of 1900-1999 stations only modestly (from 1034 to 1060) it noticeably improves coverage in some areas of the world. My preference was to include only rural stations, with a view to avoiding UHI effects, but the coverage of long record rural stations outside the USA is too patchy to permit this. Additionally, there are instances of stations marked at GHCN as being rural that have in fact become urbanised; this may be a particular problem outside the USA. The final selection was of all 1900-1999 stations, save for non-rural USA stations, and of all 1950-2005 stations that were either outside the USA or were rural stations in Alaska. Coverage of the USA from rural 1900-1999 stations (of which there were 463, most of which also have post-1999 data) was generally excellent save for in Alaska, where the nine rural stations meeting the 1950-2005 record criteria were included. The final data set contained 1123 stations. The precise selection criteria are as per the following lines of R code, “araw” being the matrix of monthly anomalies temperatures for the 7280 stations from January 1850 to December 2009 and “tinv” the list of those stations with their metadata: ### create logical indexes of stations with at least one month’s data in the year concerned:1900, 1950, 1999 or 2005 araw.1900.idx=!is.na(colMeans(window(araw,st=1900,en=c(1900,12)),na.rm=TRUE)) araw.1950.idx=!is.na(colMeans(window(araw,st=1950,en=c(1950,12)),na.rm=TRUE)) araw.1999.idx=!is.na(colMeans(window(araw,st=1999,en=c(1999,12)),na.rm=TRUE)) araw.2005.idx=!is.na(colMeans(window(araw,st=2005,en=c(2005,12)),na.rm=TRUE)) ### create vectors of number of months with missing data for each station in periods 1900-1999 and 1950-2005 araw.55.mis=colSums(is.na(window(araw,st=1950,en=c(2005,12)))) araw.100.mis=colSums(is.na(window(araw,st=1900,en=c(1999,12)))) ### make index of all GHCN stations with raw data for some month in each of 1900 and1999 and <30% missing data araw.100.30.idx=araw.1900.idx&araw.1999.idx&(araw.100.mis<360) ; sum(araw.100.30.idx) #[1] 1060 ### make index of all GHCN stations with raw data for some month in each of 1950 and 2005 and <20% missing data araw.55.20.idx=araw.1950.idx&araw.2005.idx&(araw.55.mis<132) ; sum(araw.55.idx) #1808 ### make indexes of rural stations, of all non-USA stations and of 9 rural Alaskan stations included in araw.55.20.idx araw.exUSA.idx=(!(tinv[,1]==425)); sum(araw.exUSA.idx) # 5359 araw.rural.idx=(tinv[,9]==”R”) ; sum(araw.rural.idx) # 3912 araw.55.r.idx=araw.1950.idx&araw.2005.idx&(araw.55.mis<132)&araw.rural.idx ; sum(araw.55.r.idx) # 897 araw.55.20.alaska.idx=rep(FALSE, times=7280); araw.55.20.alaska.idx[c(3361,3364,3370,3375,3403,3404,3405,3410,3419)]=TRUE ### make index of all 1900-1999 stations save USA non-rural plus non-USA (and rural Alaska) 1950-2005 stations araw.rls.idx= (araw.100.30.idx&(!araw.exUSA.idx&araw.rural.idx|araw.exUSA.idx)) | (araw.55.20.idx&araw.exUSA.idx)|araw.55.20.alaska.idx ; sum(araw.rls.idx) #1123 [Note that I have not included other R code used to generate the results given in this post; there is much code and it is not currently in a sufficiently organised state for posting.] The two following maps shows respectively the locations and 1950-2005 trends of the1123 stations selected for the global reconstructions; stations with small trends show up much more clearly on the location map. [Ignore the number 1132 in the headings of these maps; I made a transposition error in my typing of 1123.]