Fall et al. 2011: The Statistics

By John Neilsen-Gammon (from his blog Climate Abyss – be sure to bookmark it, highly recommended – Anthony)

As I mentioned in my last post, I did a lot of the statistical analysis in the recent paper reporting on the effect of station siting on surface temperature trends. For those who are curious or extremely bored, here’s how I did the testing:

I was invited to participate after the bulk of the analysis was completed. I decided to confirm the analysis by doing my own independent analysis. It showed some differences, and we concluded that the technique I was using was better, so after some more testing we went ahead and used it in the paper.

Trend Generating

One subtle point: we didn’t assess the differences in individual station measurements. Because the accuracy of US climate trends was the original motivation, we assessed the differences in estimates of US trends using different subsets of the USHCN data.

There are two basic requirements for getting a robust trend estimate over a geographical area. First, you have to work with anomalies or changes over time (first differences) rather than the raw temperatures themselves. This is because individual temperatures are very location-specific, whereas anomalies are more uniform. If it was a cold year in Amarillo, it was probably a cold year in Lubbock too by about the same amount, even though the average temperatures might be 2-3 C different.

The second requirement is to take account of the uneven distribution of stations. For example, suppose you have climate stations in El Paso, Corpus Christi, and Dallas. An average of the anomalies at these three stations might be a good approximation to the statewide anomaly. But if another station gets added near El Paso, you wouldn’t want to do a straight four-station average because it would be too strongly influenced by weather goings-on near El Paso. A more reasonable approach might be to average the two El Paso stations together first. The more general principle is that a station should matter more in the overall average if it is far from other stations, and matter less if lots of other stations are nearby.

We chose to meet the first requirement by taking 30-year averages (we tested different periods and different ways of averaging and it didn’t matter much) and averaging stations within the nine climate regions (see Fig. 2 of the paper) before computing a US average. There are plenty of other approaches; for example, NCDC’s preliminary analysis of siting quality used a gridded analysis, but we checked and our numbers weren’t very different.

So, for example, the CRN 1&2 trend was computed by computing the anomalies at each CRN 1&2 (well-sited) station, averaging the anomalies within each climate region, then averaging nationally (using the size of each region as a relative weight), then computing the ordinary least-squares trend of those US averages.

Difference Testing: Monte Carlo

The next task was to determine whether trends from different groups of stations were significantly different from each other. The standard statistical tests for this compare the difference in slopes with the scatter of points about the trend lines. But this isn’t appropriate for our data because of a crucial problem: the scatter about the trend line is not uncorrelated noise. There’s a bit of autocorrelation, but more importantly, the scatter in one set of points is always going to be highly correlated with the scatter in another set of points. If a particular year was cold, it was cold no matter what quality class of station you use to measure it.

Whatever test we used had to reflect the correlation between different station classes as well as the autocorrelation within a station class. It also, ideally, would take into account that the distribution of stations among climate regions was uneven so some regions might only have two stations within a class, with each station therefore having a big influence on the overall trends.

No standard test can deal with all that, so I used a Monte Carlo approach. Ritzy name, simple concept. In fact, it’s so simple you don’t need to know statistics to understand it. Given two classes of stations whose trends needed comparing, I randomly assigned stations to each class, while making sure that the total number of stations in each class stayed the same and that each climate region had at least two stations of each class. I then computed and stored the difference in trends. I then repeated this process a total of 10,000 times.

The result is 10,000 trend differences obtained from random sets of stations. The conventional criterion for statistical significance is that there be a less than 5% chance that a trend difference so large could have come about randomly. So all you do is look at the random trend differences and see what percentage of them are larger than the one you computed using the real classification. Since you don’t know ahead of time which trend should be larger, you use the absolute value of the trends, or, equivalently, require that only 2.5% of the random trend differences be more positive (or more negative) than the observed trend difference.

Difference Testing: Proxy Stations

One assumption of our Monte Carlo approach is that the station locations are random. Now, random does not mean evenly spaced. But, as a reviewer pointed out, the good stations were often concentrated on one side or another of a climate region, moreso than would seemingly be expected randomly, and maybe some of the differences were due to the peculiar geographical arrangement of stations.

To test this possibility, I identified “proxy stations”. For each CRN 1&2 station, I found the nearest CRN 3 or CRN 4 station to serve as its proxy. I then compared the trends calculated using the real CRN 1&2 stations to the trends calculated using the proxy CRN 1&2 stations. The test is as follows: if the trend estimates from the proxy stations match those from the larger CRN 3&4 group, then the trend isn’t sensitive to that particular station distribution. If, instead, the trend estimate from the proxies match the trend estimates from the CRN 1&2 stations, then I can’t discard the possibility that the CRN 1&2 trends are due to the station distribution rather than the siting.

Because of the small number of CRN 5 stations, I also created proxies for them and performed a similar test.

The proxy test didn’t affect our trend results much, but it did matter a lot with Section 4, where we tried to look at temperature differences directly. So I’m very grateful to the reviewer for insisting on more proof.

With the proxies, we were also able to do a neat little attribution analysis. Consider a little algebra:

CRN 1&2 – CRN 5 = (CRN 1&2 – CRN 1&2 Proxies) + (CRN 1&2 Proxies – CRN 5 Proxies) + (CRN 5 Proxies – CRN 5)

The temperature difference between the best and worst sited stations can be broken down into three terms: the first term shows how the best stations differ from their (typically-sited) neighbors, the second term shows how the difference in station distribution contributes, and the third term shows how the worst stations differ from their neighbors.

By plotting these differences over time (Fig. 8 in the paper) we were able to show that most of the minimum temperature trend difference between best and worst comes from the third term, while most of the maximum temperature trend difference comes from the first term. There’s some info in there about the relative importance of different types of siting deficiencies on the maxes and mins, and we intend to explore this issue in more detail in a subsequent paper.

The same figure showed that the trend differences arises during the mid to late 1980s, when many stations underwent simultaneous instrumentation and siting changes.

The software I used for my analyses is going to be publicly posted by Anthony Watts once he gets all our supplementary information assembled. With a topic having such lay community interest, we thought it important to make it as easy as possible to duplicate (and go beyond) our results. I did my coding in Python, but it’s only the second Python program package I’ve ever written. I hope critical software engineers overlook the many fortranisms that are undoubtedly embedded in the code.

===============================================================

Note: I hope to have the SI completed later today or tomorrow at the latest. A separate announcement will be made here and also on surfacestations.org – Anthony

UPDATE 5/13 : The SI has been posted on surfacestations.org main page, see the link on the main page