Nic L has done what I think you’ll find to be a very interesting analysis of GHCN data. All of these analyses should be considered preliminary because of the huge amount of data and unusual QC issues in the data. I’ve checked his code briefly and his version of the getstation should work fine with the mod parameter set. My older version had a problem in it that resulted in it ignoring the WMO version number. There are a couple of other QC issues with the data that this getstation function doesn’t attempt to repair, but they should not affect the results.

In the meantime, It’s very interesting what Nic found. Consider that old GISS had 1934 vs 1998 ratios of 1.4 : 0.9 – Nic’s results mirror their original. Gissmatic.

—————-

Long Record GHCN Stations – an Analysis of the Data

At Jeff’s request, I present here some findings from work I have carried out using long-record GHCN stations. I have defined these as stations with temperature data in both 1900 and 1999 and fewer than 240 months (20%) in that period with no data. The reason for looking at long record stations is primarily that one can have some confidence that trends over the last 100 years or so reflect actual changes in recorded temperature and are not affected by changes over time in the set of stations that are combined to produce an average.

The GHCN database contains temperature records from 7280 separate stations; 4495 of these are WMO stations and the remainder are non-WMO stations, whose station number includes the WMO number of the closest WMO station but also a non-zero modifier to distinguish them from that WMO station. The map below shows the location of all 7280 GHCN stations. (Please ignore the large black spot in the centre, which I have not yet figured out how to get rid of. J)

Although the geographical coverage of the 7280 GHCN stations is impressive, unfortunately many of them have short records, with only a minority having data before 1950 or after 1990, and fewer still having data before 1900.

The map below shows the location of the 1034 GHCN stations (not all of which are WMO stations) that meet my long record criteria in respect of their raw data.

It can be seen that the long record station set is dominated by the USA, but that there are a fair number of stations elsewhere with 100 year raw data records, with quite a wide geographical spread.

The variation over time in the total number of stations with raw data at GHCN is shown below.





Most of the long record stations have data for some years after 1999, but the number of stations reporting data collapsed by over 1000 in April 2006, when many Indian stations ceased reporting, following a previous sharp fall in the early 1990s. Therefore, I have measured trends in the long record stations over the period 1900 to 2005, rather than to 2009.

I have been using a simpler QC method than Jeff but, unlike Jeff, analysing stations with the same WMO number but different modifiers completely separately. Where there are duplicate series for a station, I have taken their simple mean but marked as NA any months where the standard deviation between points in duplicate data series exceeded 4. This may be too high a tolerance, but bear in mind that GHCN has already carried out quite detailed QC on the data and thrown out all data that it considers likely to be bad. In many cases, there is little disagreement between the various duplicate data series, with data common to particular months showing exact or near exact matches, and different series covering different periods (overlapping or otherwise). For instance, there is actually very little disagreement between the 8 duplicate series for WMO station 25563, featured in Jeff’s recent post “GHCN, Sorting a box of sox“.

Having thus obtained a single temperature data series for each station, I then converted it to temperature anomalies by deducting from all data point s for each calendar month the mean of the post-1899 data for that calendar month. I then took the unweighted average of the anomaly series for all relevant long record stations. I used the same methods for adjusted temperature data, where present, as for raw temperature data.

Raw Data

So what story does the raw data from the 1034 long record GHCN stations tell? Shown below is the unweighted mean of the temperature anomalies for all data from those stations for each month from 1999 to 2005. The mean is slightly smoothed, hence the high Lag-1 serial correlation. The trend is fairly low, at 0.0269 Deg. C /Decade – which are the units I hereafter use for all trends. The confidence interval is over three times as high as the trend, which is accordingly far from being statistically significant. There is no pronounced peak in 1998, and the peak temperature occurred in the 1930s. As this data set largely represents temperatures in the USA, that is perhaps unsurprising.

In order to confirm that combining the long record station data had resulted in the trend of their mean temperature accurately reflecting the temperature trends of the individual stations, I also calculated the mean of those trends. It was 0.0256, very close to the 0.0269 trend of the mean temperature. The distribution of these trends is shown in the below plot.

Adjusted Data

For many of the long record stations, GHCN also presents homogeneity-adjusted temperature series, of which there may again be duplicates. Where there are duplicate series, the GHCN files do not indicate which of the duplicated raw series each adjusted series is based on, although often it will be possible to work this out.

Peterson’s 1997 Bulletin of the AMS paper “An overview of the Global Historical Climatology Network Temperature Database, available at the GHCN website, gives a very useful summary of the adjustment process, as well as of many other aspects of the GHCN temp data. However, the adjustments appear to be primarily, if not exclusively, aimed at correcting for discontinuities. So far as I can tell, they would not correct for gradual distortions to data caused by, for instance, an increasing Urban Heat Island effect as a conurbation encroached on a previously rural station or the size of a town in which a station was located grew.

The number of GHCN stations with adjusted data peaked at just over 4,000, compared with just under 6,000 with raw data, but has fallen even more sharply in the last twenty years, with recent adjusted data only being available for about 250 stations. This is illustrated below.

Of the 1034 long record GHCN stations, 764 also have adjusted temperature data satisfying the same completeness requirements. As can be seen from the below map, the dominance of USA stations is even greater than for the long record station raw data.

One can gain some idea of the effects of the homogeneity adjustments by comparing the mean trend for all stations with adjusted data with the mean trend for all stations with raw data. The graph below shows that the 1900-2005 trend in the mean anomaly temperature from the 764 stations using adjusted data was 0.0536, again not statistically significant. The mean of the trends of the individual series is almost identical at 0.0523. These trends are double those for the 1034 stations with long raw data records, which seems a little surprising.

A plot of the trends of the adjusted station data is given below. There is perhaps slightly less scatter than for the raw data, but then there are fewer stations.

In order to get a clearer picture of the effects of the adjustment process, I calculated the trend in the combined raw data from all 764 stations which had long records of adjusted data, thereby ensuring a like-for-like comparison. The mean raw temperature anomaly record of those 764 stations is shown below.

What this shows is that on average the adjustment process more than quadrupled the trend in raw temperatures, increasing the trend of the mean from 0.0113 for raw data to 0.0536. Indeed, if the mean trend change of 0.0423 resulting from adjustments to the long record stations were typical of the effect of adjustments to station data generally, the adjustment process would account for a substantial proportion of the recorded global mean temperature increase over the twentieth century.

It is not obvious why the mean adjustment should be positive, and of size that swamps the (admittedly very small) mean raw trend. Maybe there is some further metadata information that explains what accounted for all the discontinuities for which adjustments were made, but as the process appears to be largely driven by statistical analysis it may be that the cause of the discontinuities is unknown. In the Brohan et al (JGR 2006) paper describing the HadCRUT3 global surface temperature dataset, it is stated that station moves were often to a (cooler) out of town airport, so homogeneity adjustments were more likely than not to increase the trend. However, only 106 of the 764 GHCN stations with long records of adjusted data are shown as located at an airport, and in many cases that is unlikely to follow a move from a nearby town, so I find this explanation unconvincing here.

The homogeneity adjustments affect the GHCN gridded temperature database, which is based on the adjusted data. GISS uses raw data, and performs adjustments of its own. It is unclear to me whether the homogeneity-adjustments that CRU uses for its HadCRUT3 global surface temperature dataset are entirely their own or whether they use some adjusted data from GHCN.

One can envisage circumstances in which adjusting for discontinuities could artificially increase the true trend, such as where a station had been encroached upon by a city or gradually affected by some other non-climatic warming and was then moved (possibly on more than one occasion) in order to restore its original type of surroundings. That would produce a gradual warming that would not trigger any adjustment followed by a sharp drop that would be detected as a discontinuity and adjusted for. Does this sort of problem occur often in practice? Dunno.

Examination of the of the effect of the homogeneity adjustments on the trends of individual stations are by no means uniformly positive, as the below scatter plot shows. Nevertheless, the mean effect on trend of the adjustments is definitely greater than zero (t=15.6 – highly significant).

Rural Data

In an attempt to avoid possible inflation in station trends resulting from UHI effects, I also screened out from the 1034 GHCN stations with long raw data records all but stations marked as rural. That left 484 rural stations, of which unfortunately only 28 were outside the USA, located as per the below map.

The graph below shows that the mean 1900-2005 trend in the anomaly temperature raw data from the 484 rural stations was 0.0117, statistically completely insignificant. The mean of the trends of the individual series, as shown in the scatter plot below the temperature graph, is almost identical at 0.0102. These trends are under half those for all 1034 stations with long raw data records.

USA vs Rest of the World

Finally, I thought it worthwhile to divide the 1034 long record stations between USA stations (of which there are 832) and the 202 non-USA stations. The mean temperature anomaly of the USA stations, and the1900-2005 trend thereof (being 0.0144), is shown in the next graph.

By comparison, the mean temperature anomaly of the non-USA stations has a much higher 1900-2005 trend, of 0.0805, as shown in the below graph. This is the only one of the graphs that shows a statistically significant trend.

Why should the non-USA long record stations show a mean 1900-2005 decadal trend of 0.0805 whilst the USA ones show a mean trend of only 0.0144? Perhaps the USA has warmed by far less than other areas. But as the non-USA stations are also very largely northern hemisphere, with in many cases similar latitudes to those of USA stations, it is not obvious to me why that should be. However, there is one obvious possible explanation that is worth investigating further here: the UHI effect.

A majority, 456 out of 832, of the long record USA stations are classified as rural. Any many of the remainder may be in cities that had by 1900 already reached the size (relatively small, I believe) by which most of the UHI effect occurs. By contrast, only 28 of the 202 long record non-USA stations are classified as rural, and it may be that relatively more of the urban stations are in towns that only became sizeable post 1900. Could the bulk of the difference in trend (amounting to 0.7 deg. C over 1900-2005) between the long record USA and non-USA stations could be due to the UHI effect? At first sight, It seems conceivable. Having said that, the pattern of temperature movements over 1900-2005 is not the same for the USA and non-USA stations, and there is so much weather noise in the data (even when taking averages of hundreds of stations) that it is difficult to draw any firm conclusions.

In an attempt to estimate how much of the difference between the trends of long record USA and non-USA stations might be due to the UHI effect, I shortened the qualifying period to 1900-1990. Doing so increased the total number of rural stations from 484 to 574. More importantly, it increased the number of non-USA rural stations from 28 to 86, albeit these are rather dominated by Australia, Canada and Siberia, as shown on the below map (the final graphic, you will be pleased to know J).

I took the mean of the individual station trends over 1900-1995 rather than the trend of the mean, as the anomalisation period is more variable with only a 90 year qualifying period. For the USA stations, the mean trend was -0.0024 (yes, a negative trend, albeit a completely insignificant one).

By comparison, the mean 1900-1995 trend of the 86 non-USA rural stations with 90+ year records was 0.0548. The mean trend over the same period for the 297 non-USA non-rural stations was only modestly higher than this, at 0.0606, although the geographical distribution of the rural and non-rural non-USA data sets is different. These results suggest that the UHI effect may account for part of the difference between twentieth century mean recorded temperature trends in the USA and elsewhere, but not the bulk of it.

However, these rural station results may not be representative of the rest of the world, and are almost certainly not statistically significant. Further, non-USA stations marked as rural may be more likely to be affected by non-climatic warming, or warming affected by their type of environment, than are rural stations in the USA. For instance, only 12% of USA rural stations are near the coast, next to a large lake or on a small island, whereas 41% of non-USA rural stations are. It is conceivable that such environments could be are associated with greater (or lesser) twentieth century warming than land bound ones, although I am not aware of any evidence to that effect.

In conclusion, the raw data from long record GHCN stations shows little apparent warming in the USA and moderate warming (on land) elsewhere, only part of which seems likely to be due to the UHI effect. The adjusted data shows much higher trends than the raw data for the same stations, and it is not clear why the homogeneity adjustments should on balance be significantly positive.

I don’t have time to post a full turnkey R script, but I give in the Appendix my version of Jeff’s getstation function, amended to include the modifier part of the station number, and the other key parts of the code used for generating the various long record station sets and the temperature graphs. That should enable an interested reader who has loaded all the functions needed for running the script in Jeff’s CRU #2 post to generate the temperature graphs in this post.

I should end by noting that I cribbed functions and other bits of R code from work by Steve M, Ryan O and Jeff Id in order to carry out the work outlined in this post.

——————-

UPDATE: Per #2

——————–

Appendix – skeleton R code for extracting data for the various sets of long record GHCN stations.

(Thanks to everyone who told me how to put code up such that it’s formatted correctly – Jeff Id)

### getstation: returns station information (from ghmean) in collated fashion getstation=function(staid=89050,mod=0) { alladj=NA allraw=NA #raw data mask=(ghmean[,2]==staid) mask[!(ghmean[,3]==mod)]=FALSE data=ghmean[mask,] noser=levels(factor((data[,4]))) for(i in noser) { mask2=data[,4]==i startyear=min(data[mask2,5]) endyear=max(data[mask2,5]) dat=array(dim=c(endyear-startyear+1,12)) dat[data[mask2,5]-startyear+1,]=as.matrix(data[mask2,6:17]) dat=t(as.matrix(dat)) dim(dat)=c(nrow(dat)*ncol(dat),1) dat[dat==-9999]=NA dat=dat/10 rawd=ts(dat,start=startyear,deltat=1/12) if(!is.ts(allraw)) { allraw=rawd } else { allraw=ts.union(allraw,rawd) } } #colnames(allraw)=as.character(1:dim(allraw)[[2]]) mask=(ghmeanadj[,2]==staid) mask[!(ghmeanadj[,3]==mod)]=FALSE data=ghmeanadj[mask,] noser=levels(factor((data[,4]))) print(noser) if(length(noser)!=0) { for(i in noser) { mask2=data[,4]==i startyear=min(data[mask2,5]) endyear=max(data[mask2,5]) dat=array(dim=c(endyear-startyear+1,12)) dat[data[mask2,5]-startyear+1,]=as.matrix(data[mask2,6:17]) dat=t(as.matrix(dat)) dim(dat)=c(nrow(dat)*ncol(dat),1) dat[dat==-9999]=NA dat=dat/10 adjd=ts(dat,start=startyear,deltat=1/12) if(!is.ts(alladj)) { alladj=adjd print("res") } else { alladj=ts.union(alladj,adjd) print(ncol(alladj)) print ("R") } } }else(alladj=NA) station=list(allraw,alladj) station } loc="C:/agw/GHCN/v2.mean" # loc is the path and filename of the v2.mean file downloaded from GHCN wd=c(3,5,3,1,4,5,5,5,5,5,5,5,5,5,5,5,5) ghmean=read.fwf(loc,widths=wd) loc="C:/agw/GHCN/v2.mean_adj"# loc is the path and filename of the v2.mean_adj file downloaded from GHCN wd=c(3,5,3,1,4,5,5,5,5,5,5,5,5,5,5,5,5) ghmeanadj=read.fwf(loc,widths=wd) loc="C:/agw/GHCN/v2.temperature.inv" # loc is the path and filename of the v2.temperature.inv file downloaded from GHCN wd=c(3,5,3,31,7,9,4,5,1,5,2,2,2,2,1,2,16,1) tinv=read.fwf(loc,widths=wd,comment.char="") ##### find WMO stations no's present in GHCN and their coordinates sid=levels(factor(ghmean[,2])) rawcoord=array(0, dim=c(length(sid),2) ) j=1 for (i in sid) { mask=tinv[,2]==i rawcoord[j,2]=tinv[mask,5] rawcoord[j,1]=tinv[mask,6] j=j+1 } ##### construct time series for all ghch stations (WMO or not) ln=nrow(tinv) allrawar=array(NA,dim=c(210*12,ln)) alladjar=array(NA,dim=c(210*12,ln)) tol=4 # max acceptable standard deviation of multiple data for a month for(i in 1:ln) { sta = getstation(tinv[i,2],tinv[i,3]) if(max(time(sta[[1]]))>1800) { rawsta=rowMeans(sta[[1]],na.rm=TRUE) # could use sqrt wtd average instead of mean, to reduce effect of outliers? sta.sds=sd(t(sta[[1]]),na.rm=TRUE) sta.sds[is.na(sta.sds)]=0 rawsta[sta.sds>tol]=NA rawsta=window(ts(rawsta,start=min(time(sta[[1]])),deltat=1/12),start=1800,end=c(2009,12)) if( length(sta[[2]])>1 ) { adjsta=rowMeans(sta[[2]],na.rm=TRUE) sta.sds=sd(t(sta[[2]]),na.rm=TRUE) sta.sds[is.na(sta.sds)]=0 adjsta[sta.sds>tol]=NA adjsta=window(ts(adjsta,start=min(time(sta[[2]])),deltat=1/12),start=1800,end=c(2009,12)) }else{ adjsta=NA } }else{ rawsta=NA adjsta=NA } index=as.integer(((time(rawsta)-1800)*12+.02)+1) allrawar[index,i]=rawsta index=as.integer(((time(adjsta)-1800)*12+.02)+1) alladjar[index,i]=adjsta print(i) } allraw=ts(allrawar,start=1800,deltat=1/12) alladj=ts(alladjar,start=1800,deltat=1/12) araw = calc.anom(allraw) aadj = calc.anom(alladj) dimnames(araw)[[2]]=dimnames(aadj)[[2]]=round(tinv[,2]+0.1*tinv[3,],1) rm(allrawar,alladjar,index,i,rawsta,adjsta,sta.sds,ln,tol,sta) To create various long record station index vectors for applying to araw and aadj: araw.2000.idx=!is.na(colMeans(araw[2389:2400,],na.rm=TRUE)) # find all GHCN stations in monthly anomaly dataset of all post 1800 stations that have data for some month in 1999 sum(araw.2000.idx) # 2741 (c/f 2079 for WMO stations only) araw.1900.idx=!is.na(colMeans(araw[1201:1212,],na.rm=TRUE)) # find all GHCN stations in monthly anomaly dataset of all post 1800 stations that have data for some month in 1900 sum(araw.1900.idx) # 1697 (c/f 1146 for WMO stations only) araw.100.mis=colSums(is.na(araw[1201:2400,])) araw.100.idx=araw.1900.idx&araw.2000.idx&(araw.100.mis<240) # all stations with data in 1900 and 1999 and fewer than 240 points missing in 1900-1999 araw.rural.idx=which(tinv[,9]=="R") # find all rural stations #length(araw.rural.idx) [1] 3912 araw.100.r.idx=araw.1900.idx&araw.2000.idx&(araw.100.mis<240)&araw.rural.idx # all rural stations with data in 1900 and 1999 and fewer than 240 points missing in 1900-1999 sum(araw.100.r.idx) # 484 araw.exUSA.idx=(!(tinv[,1]==425)) araw.100.exUSA.idx=araw.1900.idx&araw.2000.idx&(araw.100.mis<240)&araw.exUSA.idx # all non-USA stations with data in 1900 and 1999 and fewer than 240 points missing in 1900-1999 sum(araw.100.exUSA.idx) # 202 To plot a graph of mean 1900-2005 raw long-record temperatures use, e.g.: araw.100=araw[1201:2520,araw.100.idx]; dim(araw.100) # 1320 1034 araw.100=ts(araw.100,st=1900,freq=12) plt.avg(ff(ts(rowMeans(araw.100,na.rm=TRUE),st=1900,freq=12)), main.t= paste("All 1034 GHCN Stations with largely complete data from 1900 on

Raw Temperatures - Unweighted Average"),st=1900,y.pos=0.9) and likewise for subsets of the 1034 long record stations, substituting the appropriate index vector for araw.100.idx and aadj for araw to use adjusted temperatures. wd=c(3,5,3,31,7,9,4,5,1,5,2,2,2,2,1,2,16,1) tinv=read.fwf(loc,widths=wd,comment.char="") # loc is the path and filename of the v2.temperature.inv file downloaded from GHCN



