Tamino’s realclimate post re-states points that I’ve discussed at length in the past. Here is a re-posting of a 2008 post on Tamino that deals with most of the issues in his realclimate post.

Tamino has recently re-iterated the climate science incantation that Mann’s results have been “verified”. He has done so in the face of the fact that one MBH98 claim after another has been shown to be false. In some cases, the claim has not only been shown to be false, but there is convincing evidence that adverse results were known and not reported.

Today I’m going to look at what constitutes verification of a relationship between proxies and temperature, assessing MBH results in such a context, trying as much as possible to emphasize agreed facts.

Verification

One thing that Tamino and I agree on is that a proposed reconstruction should “pass verification”. Tamino says:

… frankly, that’s the real test of whether or not a reconstruction may be valid or not. If it passes verification, that’s evidence that the relationship between proxies and temperature is a valid one, and that therefore the reconstruction may well reflect reality. If it fails verification, that’s evidence that the reconstruction does not reflect reality. It has the drawback that the data we set aside for verification we must omit from calibration; with less data, the calibration is less precise. But without verification, we can’t really test whether or not the reconstruction has a good chance of being correct.

and later

… it’s the verification statistics that are the real test of whether or not a reconstruction may be valid. Pass verification: probably valid. Fail verification: probably wrong.

While we strongly disagree on what constitutes “verification” and whether the MBH reconstruction “passes” verification, I’m prepared to stipulate to a verification standard.

If the MBH reconstruction can be shown to pass thorough verification testing, including, at a minimum, the steps described below, then, however implausible the notions may seem, I will advise readers to get used to the idea that bristlecones are magic trees, that their tune is a secret recording of world climate history and that Donald Graybill had a unique method of detecting their tune. However, these alleged magic properties should be subjected to (and withstand) scrupulous scientific investigation and verification and I do not agree with Tamino that these magic properties have been “verified”.

Without limiting the range of scientific investigation that any claim of a magical relationship might be subject to, the following verification tests seem to be to be a minimum that any scientist should require prior to grudgingly acquiescing in the view that a magical relationship exists between Graybill’s bristlecone ring width chronologies and world climate. (Similar considerations apply to any reconstruction heavily dependent on a very small number of “key” series.)

(1) the reconstruction passes verification tests described in standard dendro texts (Fritts, 1991; Cook et al 1994), plus any more advanced econometric tests that can be relevantly applied. In particular, the reconstruction passes the tests that it was said to pass in the original article;

(2) An even better test is whether the “relationship between the proxies and temperature” can be verified in out-of-sample testing for the period. In this case, after 1980.

(3) Related to (2), but different is whether the “key” chronologies have been re-sampled and verified by independent researchers;

Failure in any one of these should result in Tamino rejecting the MBH reconstruction according to the verification standard. I submit that MBH has failed every one of these tests. Indeed, it’s hard to imagine a more dismal verification failure than what we’ve seen with MBH. Worse, efforts to verify their work have been contested and obstructed at every turn, leaving a very unsavory impression of the people involved.

Standard Verification Tests

First, the MBH AD1400 reconstruction failed standard dendroclimatic verification tests (Fritts 1976, 1991; Cook et al 1994; see NAS Panel Box 9.1): verification r2 (0.02 MM2005a; 0.018 Wahl and Ammann); CE ( -0.26 MM2005a; -0.21 Wahl and Ammann). These are not immaterial or irrelevant failures: for example, Eduardo Zorita said that his attitude towards the MBH reconstruction changed when he learned of the verification r2 failure.

Second, while Wahl and Ammann now (after the failure was exposed) argue that these failures don’t “matter”, that it’s all about low-frequency versus high-frequency, these are subtle issues where Wahl and Ammann hardly constitute high statistical authority (or even low authority). Readers are entitled to full disclosure of the adverse results and then judge for themselves whether they are persuaded by the Wahl and Ammann high frequency-low frequency argument. MBH readers were not given this alternative. MBH claimed that their reconstruction had “highly significant reconstructive skill”, not just in the RE statistic, but also in the verification r2 statistic, illustrating this claim in their Figure 3 excerpted below:

Figure 1: MBH98 Figure 3 panels b, c. The running text in MBH98 stated: “Figure 3 shows the spatial patterns of calibration β, and verification β and the squared correlation statistic r2, demonstrating highly significant reconstructive skill over widespread regions of the reconstructed spatial domain [emphasis added]” and later: “β [or RE] is a quite rigorous measure of the similarity between two variables … For comparison, correlation (r) and squared-correlation (r2) statistics are also determined. [emphasis added]”

These claims of statistical “skill” were not an idle puff by MBH, but were relevant to the widespread view that MBH methods represented a new level of sophistication, separating their work from Lamb’s prior work purporting to show a Medieval Warm Period. These claims of statistical skill were relied on by IPCC TAR, which made extensive use of the MBH reconstruction stating:

[MBH] estimated the Northern Hemisphere mean temperature back to AD 1400, a reconstruction which had significant skill in independent cross-validation tests.

The failure of important verification statistics should have been reported in MBH98, but wasn’t. It should have been reported in the 2004 Corrigendum wasn’t. Mann told Marcel Crok of Natuurwetenschap & Techniek that his reconstruction passed the verification r2 test,

Our reconstruction passes both RE and R^2 verification statistics if calculated correctly.

Later, Mann was reduced to telling a nonplussed NAS panel, well aware of Figure 3 shown above, that he had never calculated the verification r2 statistic, as that would be “foolish and incorrect reasoning”.

Perhaps Tamino can try, like Wahl and Ammann, to make a strained argument that the verification r2 (and CE) statistics don’t “matter”, but please – no more of this talk that MBH claims of statistical skill in the verification r2 statistic have been vindicated. They haven’t. And if you don’t believe me, look at Table 1S of Wahl and Ammann 2007 (which required a long and unsalubrious history prior to its inclusion in this article.

All of this discussion pertains to separation of in-sample calibration and verification periods – a separation which is complicated by the fact that you already know the results. The relevant test really comes from out-of-sample testing and verification scores, which I’ll discuss below.

“Robustness” to Dendroclimatic Indicators

Third, another important and untrue MBH claim has not been verified is its supposed “robustness” to the presence/absence of all dendroclimatic indicators. Various issues related to dendroclimatic indicators had been cited in IPCC Second Assessment Report; one of the main selling points of MBH was its multiproxy approach which seemed to offer some protection against potential dendro problems. MBH98 stated:

the long-term trend in NH is relatively robust to the inclusion of dendroclimatic indicators in the network, suggesting that potential tree growth trend biases are not influential in the multiproxy climate reconstructions. (p. 783, emphasis added.)

Mann et al 2000 stated:

We have also verified that possible low-frequency bias due to non-climatic influences on dendroclimatic (tree-ring) indicators is not problematic in our temperature reconstructions…

These claims have been demonstrated to be untrue. If a sensitivity analysis is done in which the Graybill bristlecone chronologies are excluded from the AD1400 network, then a materially different reconstruction results – a point made originally in the MM articles [note: also Cook’s old Gaspe chronology which has its own serious issues – see below], confirmed by Wahl and Ammann 2007 and noted by the NAS panel. In addition to failing the verification r2 test, a reconstruction without bristlecones fails even the RE test. Wahl and Ammann argue that this is evidence that the bristlecones should be included in the reconstruction; this argument has not been accepted by any third party statistician. However, for the present point, the issue is quite different and has never been confronted by Mannians: the discrepancy between reconstructions with bristlecones and without bristlecones means that the representation that the reconstruction was “robust” to the presence/absence of all dendroclimatic indicators is untrue. This recognition of non-robustness was recognized by the NAS panel which actually cited Wahl and Ammann on this point (STR, 111):

some reconstructions are not robust with respect to the removal of proxy records from individual regions (see, e.g., Wahl and Ammann in press)

There is convincing evidence that Mann et al knew of the impact of Graybill bristlecone chronologies on their reconstruction, as the notorious CENSORED directory shows the results of principal components calculations in which the Graybill chronologies have been “censored” from the network. Long before we identified the non-robustness to bristlecones, this non-robustness was known to Mann et al. While some comments in MBH99 can be construed as somewhat qualifying the robustness claims in MBH98, any such qualifications were undone in Mann et al 2000, which re-iterated the original robustness claims in even stronger terms than MBH98.

Some defenders of the Mann corpus have argued that the claims in Mann et al 2000 were narrowly constructed and referred only to the AD1730 network, which was the one illustrated in the graphic. In my opinion, the robustness claims were not limited to the AD1730 network, but included all networks [“our temperature reconstructions” is the phrase used.] But regardless, if Mann et al knew that the AD1400 network was not robust to the presence/absence of dendroclimatic indicators (which they did), then they had an obligation not to omit this fact (just as they had an obligation not to omit reporting the failed verification r2 statistics for networks prior to AD1820. )

Fifth, there is an important claim about the relative importance of the HS pattern in the North American network that not only has not been verified, but has been refuted. This particular issue has more resonance in terms of our personal experience than to others, but, as the people most directly involved, it was an extremely important matter. In response to MM2003, Mann et al argued that the HS shape of the North American PC1 represented the “dominant component of variance” or “leading component of variance” in the North American tree ring network and that the emulation in MM2003 had omitted this “dominant” component of variance. This was played out pretty loudly at the time. As readers will now recognize, this “dominant” or “leading” pattern was nothing of the sort. It was merely the shape of the Graybill bristlecone chronologies promoted into a far more prominent position in the PC rankings than they deserved, by reason of the erroneous Mann PC methodology.

In Mann’s first Nature reply, he was still holding to the “dominant component of variance” position. However, by the time of his revised Nature reply, he’d realized that the problem was deeper and conceded that the bristlecone shape had been demoted to the PC4 (an observation noted in MM 2005 (GRL, EE)). Instead of continuing to argue that the HS was the “dominant” or “leading” component of variance, he now argued that he could still “get” an HS shape with the bristlecones in the 4th PC if the number of retained PCs was increased to 5, invoking Preisendorfer’s Rule N as a rationale for expanding the roster to include the PC4. Of course, MBH98 had indicated a somewhat different rationale for PC retention in tree ring networks, but the description was vague.

My calculations indicate that it is impossible to obtain observed PC retention patterns using Rule N, with notable discrepancies in some networks. Was Rule N actually used in MBH98 or was it an after the fact effort to rationalize inclusion of the PC4? Wahl and Ammann didn’t touch the issue. With 20-20 hindsight, Mann et al might wish that they had used Rule N in MBH98, but no one’s verified that they did.



Graybill and Gaspé Chronologies

Given the acknowledged dependence of the MBH reconstruction on a very small number of tree ring chronologies, any engineering-quality verification for policy reliance, would inevitably include a close examination and assessment of the reliability of these chronologies, including re-sampling if necessary.

The key bristlecone chronologies were taken over 20 years ago. They were all taken by one researcher (Donald Graybill), who was trying to prove the existence of CO2 fertilization. Graybill may well have been eminent in his field but it is ludicrous that major conclusions should be drawn from unreplicated results from one researcher. TParticularly when there are also extremely important and unexplained differences in the behavior of Graybill’s chronologies from those of all other North American chronologies. The graphic on the left is a scatter plot compares the weights of the Graybill chronologies (red) in the MBH PC1 to those of all authors, relative to the difference between the 20th century mean and overall mean. You can tell visually that the Graybill chronologies have a far larger difference in mean than the majority of chronologies (unsurprisingly, this difference in mean is statistically significant under a t-test).

Figure 2. Comparison of MBH98 NOAMER PC1 weights to difference in mean, showing Graybill in red. Left – unquared; right – squared weights.

Aside from every other issue pertaining to MBH, any examination of this data requires an explanation of why the Graybill chronologies have a difference in mean that is not present in the other chronologies. This issue has nothing to do with PC1 or PC4. It’s really a question of whether there is an “instrumental drift” in the Graybill chronologies.

Let’s suppose that you have 70 satellites, using 8 different instruments, and that one instrument type has a drift relative to the others. If you do a Mannian pseudo-PC analysis on the network, the Mannian PC1 will pick out the instruments with the drift as a distinct pattern. Obviously, that would only be the beginning of the analysis, not the end of it. You then have to analyze the reasons for the drift of one set of instruments relative to the others – maybe the majority of instruments are wrong. But neither Spencer and Christy on the one hand nor Mears and Wentz on the other would simply say that Preisendorfer’s Rule N shows that the instrumental drift is a “distinct pattern” and terminate the analysis at that point. They’d get to the bottom of the problem.

Unfortunately, nothing like that has happened here. Mann and his supported have paralyzed the debate on esoteric issues like Preisendorfer’s Rule N and “proper” or “correct” or “standard” rules of PC retention and most climate scientists seem to be content with this and have failed to inquire as to the validity of the Graybill chronologies, both as tree ring chronologies and as tree-mometers capable of acting as unique antennae for world temperature.

Updating the Graybill Chronologies

An obvious way of shedding light on potential problems with the Graybill chronologies would simply be to bring them up-to-date, show that they are valid or not. Mann (and this argument is repeated by supporters) justified the failure to verify the Graybill chronologies on the basis that it is too “expensive” and that the sites are too “remote” – a justification conclusively refuted by our own “Starbucks Hypothesis” in Colorado.

Aside from our own efforts at Almagre in 2007, there is one other reported (but not archived) update, one which happened to be at the most important Graybill chronology – Sheep Mountain, a site which is not merely the most important in the AD1400 network, but one which becomes progressively more important in the longer PCs (especially the Mann and Jones 2003 PC1.) The Sheep Mt chronology was updated by Linah Ababneh, then a PhD student at the University of Arizona in 2003: see Ababneh 2006 (Ph. D. Thesis), 2007 (Quat Int). However, as previously reported at CA here (and related posts), Ababneh failed to replicate the distinctive HS shape of Graybill’s Sheep Mountain chronology, a shape that imprints the MBH reconstruction and, in particular, failed to verify the difference between the 20th century mean and long-term mean that led to the heavy weighting in the PC1. Her reconstruction was based on a far larger sample than Graybill’s. The differences are illustrated below:



Figure 3. Sheep Mountain Chronologies, Graybill versus Ababneh.

Linah Ababneh’s work has definitely not verified the most critical Graybill bristlecone chronology. Quite the contrary. Until the differences between her results and Graybill’s results are definitively reconciled, I do not see how any prudent person can use the Graybill chronologies, regardless of the multivariate method.

In our own work at Almagre, we identified issues related to ring widths in trees with strip bark that compromise statistical analysis, but have nothing to do with CO2 fertilization or previously identified issues. We found (See here here ) that strip bark forms can result in enormous (6-7 standard deviation) growth pulses in one portion of the core that are totally absent from other sections of the core, as illustrated below.



Figure 4. Almagre Tree 31 core samples, showing difference between cores taken only a few cm apart. Black (and red) show 2007 samples.

In a small collection (and “small” here can be as high as 30 or 50 cores), the presence/absence of a few such almost “cancerous” pulses would completely distort the average. The NAS panel said that “strip bark” forms should be “avoided” although they seem to have in mind the more traditional concerns of CO2 fertilization, than what seem to Pete Holzmann and myself as the problematic “mechanical” issues. Here there are some worrying aspects about the Graybill chronologies that should be of concern to more people than ourselves. Graybill and Idso (1993) said that cores were selected for the presence of strip bark so the possibility of a bias is latent in the original article. Second, at Almagre, we identified trees with tag numbers where cores had been taken and are located at the University of Arizona, but Graybill’s archiving was incomplete. Why were cores excluded from the archive? Given that the Graybill chronologies underpin the entire MBH enterprise, these missing invoices are, to say the least, disquieting, given Graybill’s seemingly unique ability to detect 20th century differences.

Gaspé

As noted elsewhere, there are issues about whether the Gaspé reconstruction has been included in the AD1400 netowrk only through ad hoc, undisclosed and unjustified accounting methods.

But aside from such issues, there is the important problem that, like Sheep Mountain, an update of the Gaspé chronology failed to yield the HS shape of the reconstruction used in MBH98. In this case, the authors of the update (Jacoby and d’Arrigo) failed to report or archive their update and it is through sheer chance that I even know about the update (which has not been reported anywhere other than CA). Again the “key” chronology used in MBH98 has not been verified.

The Bristlecone Divergence Problem

Ultimately the most relevant test of the “relationship between proxies and temperature” is whether updated proxies can reconstruct the temperature history of the 1980s, 1990s and 2000s. Here I mean the exact MBH98-99 proxies used in the AD1400 (and AD1000) networks; not a bait-and-switch. In the AD1400 (and AD1000) MBH case, a few key chronologies have been updated and so we have some insight on how the supposed “relationship” is holding up.

In our own sampling at Almagre, we found that ring widths in the 2000s were not at the record levels predicted by the Mannian relationship – and in fact had declined somewhat – one more instance of the prevalent “divergence problem”, but this example not limited to high latitudes and affecting one of the MBH PC1 proxies. Likewise, the Mann “relationship” at Sheep MT would call for record ring widths there, but not only did Ababneh not observe such records, as noted above, she raised serious questions about the original Graybill chronology in the first place.)



RE Statistic

In the face of all of this, how can Tamino (or anyone else) claim that the MBH reconstruction has been “verified”? Other than uncritical reliance on realclimate?

The main sleight of hand involves the RE statistic. The AD1400 reconstruction with old Sheep Mt and Gaspe chronologies has a high RE statistic. This appears to be the beginning and end of what Tamino (and realclimate) regards as “verification”. No need to verify the individual proxies. No need to pass other verification tests – even ones said to have been used in MBH98. No need to prove the validity of the relationship out-of-sample. All you need is one magic statistic – the RE statistic.

The trouble with the RE statistic, as we observed long ago, is that, meritorious or not, it’s not used in conventional statistics and little is known about its properties. In MM2005 (GRL) we showed that you could get high RE statistics using Mannian methodology on red noise. However, the problem with the RE statistic can be illustrated far more easily than occurred to us at the time. As noted on CA, I checked RE statistics for “reconstructions” using two of the most famous examples of spurious regression in econometrics: 1) Yule (1926) which shows a relationship between mortality and proportion of Church of England marriages; 2) Hendry (1980) which shows a relationship between cumulative U.K. rainfall and inflation). Both classic spurious regressions yield extremely high RE statistics – even higher than MBH98.

So although Mann characterizes the RE test as “rigorous”, it isn’t. It will fail with virtually any spurious regression (between co-trending unrelated series.) I’m not saying that the RE test shouldn’t be run: I see no harm in using this test, but it’s only one test and is not in itself anywhere near sufficient to constitute verification of a supposed relationship between proxies and temperature. For Mann, Wahl and Ammann or Tamino to argue that passing an RE test is some sort of accomplishment merely sounds goofy to anyone familiar with Yule 1926 or Hendry 1980. You’d think that third party climate scientists would catch onto this by now.

I don’t think that anything useful can be shown by more and more calculations on the MBH network. At this point, the only relevant testing is the out-of-sample re-sampling, showing that the supposed “relationships between proxies and temperature” can be confirmed. Available information on MBH proxies has not verified these relationships.



Anything Else?

Is there anything else that remotely constitutes verification of MBH? I’d be happy to consider and respond to any suggestions or inquiries.

In the above discussion, I haven’t talked about principal components very much and there’s a reason for that. In our articles, we observed that the Mannian pseudo-PC methodology was severely biased towards picking out HS-shaped series. In the critical NOAMER network, the relationship between the difference in 20th century mean and PC1 weighting is so strong that the MBH PC1 could be described as follows:

Construct the following linear combination of chronologies: assign a weight to each chronology equal to the difference between the 20th century mean and overall mean (with negative weights assigned to negative differences.)

This methodology will regularly deliver HS shaped series from red noise. Mannian pseudo-PC methodology is a poor methodology in that its efforts to locate a HS shape interfere with the operation of the PC algorithm. If there is a very strong “signal” or if the true signal actually is HS-shaped, then the poor methodology doesn’t matter much relative to conventional PC methodology. In the practical situation of the NOAMER network, the net result of the flawed methodology was to deliver a high weight to bristlecones.

If the bristlecones are magic trees, then the methodology might be flawed, but, at the end of the day, that wouldn’t “matter”.

If (1) bristlecones are not magic trees and/or the Graybill chronologies have sort of “instrumental drift” resulting in a spurious regression relationship to world temperature, (2) the Mannian pseudo-PC methodology is flawed and (3) there is some other methodology that avoids the grossest flaws of the Mannian pseudo-PC methodology, but is still inadequate to detect a spurious regression against the Graybill methodologies, then, in a bizarro-world, bizarro-scientists might argue that the flawed methodology didn’t “matter” because they were going to do the calculation incorrectly anyway. Leading bizarro-scientists would perhaps go futher, arguing, in addition that the fact that they could go on to make completely different errors meant that criticisms of the original errors were “wrong”.

At the end of the day, the issue, as the NAS panel realized, is about proxies and verification statistics. That doesn’t mean that the criticisms of the PC methodology are incorrect; they aren’t. Just that the PC issues could be coopered up without settling the key issues on proxies and verification.

Preisendorfer described PC methodology as “exploratory” and this is precisely how we (but not Mann) applied PC methodology. Mannian pseudo-PC methodology identified the most HS-shaped series quite effectively. We used this to explore the NOAMER network and found that its selections were not random – it picked out the Graybill bristlecones. The scientific issue is then whether these are valid proxies – and this is an issue that is not settled by Rule N, but one that requires scientific evidence. And in all the discussion to date, Mann et al have produced no such evidence.

So did the PC error “matter”? Well, it probably mattered in a different way than people think.

Consider what would have happened had MBH had not used an erroneous PC methodology. Let’s suppose that they used a centered PC calculation together Preisendorfer’s Rule N. So that they retained 5 PCs in the AD1400, including the bristlecones, and everything reconciled the first time. What would have happened? In 2003, I’d probably have more or less replicated their results and thought no more about it. I would probably not have peered beneath surface inquiring about the PC4 and bristlecones. verification r2 statistics and so on. I’d be making a handsome living in speculative mining stocks.

I followed the magic flute instead.



