When you analyse your data, you usually assume that you know what the data really represent. Or do you? This has been a question that over time has marred studies on solar activity and climate, and more recently cosmic rays and clouds. And yet again, this issue pops up in two recent papers; One by Feulner (‘The Smithsonian solar constant data revisited‘) and another by Legras et al. (‘A critical look at solar-climate relationships from long temperature series.’). Both these papers show how easily it is to be fooled by your data if you don’t know what they really represent.

First of all, I really think these papers are worth reading, because sometimes there are papers published that do not appreciate the importance of meta-data (information about the data) and do not question what they really represent.

Feulner demonstrates how the failure to adequately correct for seasonal variations and volcanic eruptions can lead to spurious results in terms of the brightness of the direct solar beam (pyrheliometry), the brightness of the sky in a ring around the Sun (pyranometry), and measured precipitable water content.

Such mistakes can easily give the impression that cosmic rays induce aerosol formation. Feulner’s work is a reanalysis of Weber (2010), which strongly suggested that cosmic rays cause a large part of atmospheric aerosol formation. We have already discussed cosmic rays and aeorsols (here), and similar claims have been made before. The new aspect of this includes analysis of pyranometry and pyrheliometry.

It is important to note that Feulner’s analysis focused on a relatively short period, 1923-1954, and that he only addressed parts of the analysis presented in the Weber (2010) paper (Weber examined the periods 1905 to 1954 and 1958 to 2008). I’m told that the Smithsonian (SAO; Smithsonian Astrophysical Observatory) data from 1905-23 are generally considered somewhat problematic due to instrument changes and calibration issues (I’m admit, I’m not expert on this issue).

Feulner also informs me that he has looked at the automatic measurements from Mauna Loa (1958-2008), but these were apparently problematic to analyse. They differ from the old SAO data because measurements taken during bad weather may be ‘contaminated’ and there may have been instrument failures. There is also a spurious drift over the whole period – possibly caused by anthropogenic water vapour and/or aerosols – and this period had frequent episodes of active volcanism.

One argument is that observation of low sunspot numbers (less than 50) tends to coincide with the winter (December-February) and spring (March-May), while large sunspot numbers tend to coincide with summer (June-July) and autumn (September -November). The coincidence between the seasons and sunspot numbers is purely coincidental. Feulner also asserts that there may be two simultaneous effects: (i) that the precipitable water content, pyranometry and the pyrheliometry measurements exhibit pronounced regular seasonal variations, and (ii) the seasonal distribution of sunspot numbers can give the impression of a change with solar activity.

Feulner also defined the years 1928-1931, 1932-1933, 1951-1952, and 1953-1955 as years with active volcanism (there is some discussion about such forcings here and here, but not for the same period). By accounting for these aspects, he subtracted the median of the measurements (precipitable water content, etc) for the corresponding calendar month for all the data, and excluded the years with volcanic activity. After accounting for seasonal bias and volcanic eruptions, Feulner finds no significant trend with the sunspot number, and that the solar activity influence is “comparatively small”.

Legras et al. refuted recent papers by Le Mouel et al. (2010) – referred to as ‘LMKC’ in their paper – and Kossobokov et al. (2010) on three counts: (1) By demonstrating that a correlation with solar forcing alone is meaningless unless other relevant factors are accounted for too (something Gavin and I also demonstrated in a JGR-article from 2009) (one sentence in their abstract may be interpreted as a strong statement which some people may find problematic: “sunspot counting is a poor indicator of solar irradiance”, but I too have some queries regarding the sunspot record); (2) demonstrating why long climate series must be homogeneous if one wants to study long term variability; (3) that incorrect application of statistical tests provide misleading results. They also provide their data and methods as a Mathematica notebook in the paper supplement, which I find commendable (I have not checked it, though) as sharing data and open source code make their arguments more convincing.

The analysis of Legras et al. focused on the maximum, minimum and mean daily temperature records from the ECA&D dataset representing Praha, (since 1775) and Bologna (1814) – data from Uccle, they say, were not available due to policy changes at the Belgium Met. Office. They also refer to an analysis by Wijngaard, saying that more than 94% of the [ECA&D] stations are flagged as ‘doubtful’ or ‘suspect’ over the period 1900-1999. Whereas the LMKC paper suggests that the chosen station records have the highest quality code in the ECA&D data set, Legras et al. presents a table where all the stations are listed as ‘suspect‘. Furthermore, they argue that the Bologna series exhibits a clear artefact: a large positive anomaly in daily maximum temperature greater than 2oC between 1865-1880, not seen in the daily minimum temperature, and hence, that the LMKC and related papers are all based on raw inhomogeneous data, contrary to what is claimed.

When it comes to the methodological flaws, Legras et al. argue that the LMKC doesn’t properly account for the real degrees of freedom in the data. They get an estimate for the effective degrees of freedom that is 9 times smaller than in LMKC, which leads to a much smaller estimate of confidence interval than Legras et al. get, also when they use a so-called ‘non-parametric‘ permutation test (this gets a bit technical/statistical). This means that the results in LMKC appear to have much greater significance than the real situation really tells us. Their results, rather, show some variations that are in accord with random statistical fluctuations.

The results of Feulner and Legras et al. convincing because of their careful analysis of the data and what they represent. They also explain why previous results and analysis are wrong, providing clear demonstrations showing how the methods work. Furthermore, they bring up well-established knowledge about the dangers associated with statistics and analysis, such as trend analysis of inhomogeneous data series. In essence, it is common sense that is it important to know what signals are hiding in your data, and how these can affect your analysis.