I’ve had a fondness for Daryl Bem ever since his coauthored paper appeared in Psychological Bulletin back in 1994: a meta-analysis purporting to show replicable evidence for psionic phenomena. I cited it in Starfish, when I was looking for some way to justify the rudimentary telepathy my rifters experienced in impoverished environments. Bem and Honorton gave me hope that nothing was so crazy-ass that you couldn’t find a peer-reviewed paper to justify it if you looked hard enough.

Not incidentally, it also gave me hope that psi might actually exist. There’s a whole shitload of things I’d love to believe in if only there was evidence to support them, but can’t because I fancy myself an empiricist. But if there were evidence for telepathy? Precognition? Telekinesis? Wouldn’t that be awesome? And Bem was no fruitcake: the man was (and is) a highly-regarded social scientist (setting aside the oxymoron for the moment) out of Cornell, and at the top of his field. The man had cred.

It didn’t really pan out, though. There were grumbles and rebuttals about standardisation between studies and whether the use of stats was appropriate — the usual complaints that surface whenever analysis goes meta. What most stuck in my mind back then was the point (mentioned in the Discussion) that these results, whatever you thought of them, were at least as solid as those used to justify the release of new drugs to the consumer market. I liked that. It set things in perspective (although in hindsight, it probably said more about the abysmal state of Pharma regulation than it did about the likelihood of Carrie White massacreing her fellow graduates at the high school prom).

Anyhow, Bem is back, and has made a much bigger splash this time around: nine experiments to be published in the Journal of Personality and Social Psychology, eight of which are purported to show statistically-significant evidence of not just psi but of actual precognition. The New York Times picked it up; everyone from Time to the Huffington Post sat up and took notice. Most of the mainstream reaction has been predictable and pretty much useless: Time misreads Bem’s description of how he controlled for certain artefacts as some kind of confession that those artefacts weren’t controlled for; the Winnipeg Free Press simply cites the study as one of several examples in an extended harrumph about the decline of peer-reviewed science. Probably the most substantive critiques hail from Wagenmakers et al (in a piece slotted to appear in the same issue as Bem’s) and an online piece from James Alcock over at the Skeptical Inquirer website (which has erupted into a kind of three way slap-fight with Bem and one of his supporters). And while I by no means dismiss all of the counter-arguments, even some of the more erudite skeptics’ claims seem a bit suspect — one might even use the word dishonest — if you’ve actually read the source material.

I’m not going into exquisite detail on any of this; click on the sources if you want details. But in general terms, I like what Bem set out to do. He took classic, well-established psychological tests and simply ran them backwards. For example, our memory of specific objects tends to be stronger if we have to interact with them somehow. If someone shows you a bunch of pictures and then asks you to, say, classify some of them by color, you’ll remember the ones you classified more readily than the others if presented with the whole set at some later point (the technical term is priming). So, Bem reasoned, suppose you’re tested against those picture before you’re actually asked to interact with them? If you preferentially react to the ones you haven’t yet interacted with but will at some point in the future, you’ve established a kind of backwards flow of information. Of course, once you know what your subjects have chosen there’s always the temptation to do something that would self-fulfil the prophecy, so to speak; but you get around that by cutting humans out of the loop entirely, let software and random number generators decide which pictures will be primed.

I leave the specific protocols of each experiment as an exercise for those who want to follow the links, but the overall approach was straightforward. Take a well-established cause-effect test; run it backwards; if your pre-priming hit rate is significantly greater than what you’d get from random chance, call it a win. Bem also posited that sex and death would be powerful motivators from an evolutionary point of view. There weren’t that many casinos or stock markets on the Pleistocene savannah, but knowing (however subconsciously) that something was going to try and eat you ten minutes down the road — or knowing that a potential mate lay in your immediate future — well, that would pretty obviously confer an evolutionary advantage over the nonpsychics in the crowd. So Bem used pictures both scary and erotic, hoping to increase the odds of significant results.

Note also that his thousand-or-so participants didn’t actually know up front what they were doing. There was no explicit ESP challenge here, no cards with stars or wavy lines. All that these people knew was that they were supposed to guess which side of a masked computer screen held a picture. They weren’t told what that picture was.

When that picture was neutral, their choices were purely random. When it was pornographic or scary, though, they tended to guess right more often than not. It wasn’t a big effect; we’re talking a hit rate of maybe 53% instead of the expected 50%. But according to the stats, the effect was real in eight out of nine experiments.

Now, of course, everyone and their dog is piling on to kick holes in the study. That’s cool; that’s what we do, that’s how it works. Perhaps the most telling critique is the only one that really matters; nobody has been able to replicate Bem’s results yet. That speaks a lot louder than some of the criticisms that have been leveled against Bem in recent days, at least partly because some of those criticisms seem, well, pretty dumb. (Bem himself responds to some of Alcock’s complaints here).

Let’s do a quick drive-by on a few of the methodological accusations folks have been making: Bem’s methodology wasn’t consistent. Bem log-transformed his data; oooh, maybe he did it because untransformed data didn’t give him the results he wanted. Bem ran multiple tests without correcting for the fact that the more often you run tests on a data set, the greater the chance of getting significant results through random chance. To name but a few.

Maybe my background in field biology makes me more forgiving of such stuff, but I don’t consider tweaking one’s methods especially egregious when it’s done to adapt to new findings. For example, Bem discovered that men weren’t as responsive as women to the level of eroticism in his initial porn selections (which, as a male, I totally buy; those Harlequin Romance covers don’t do it for me at all). So he ramped the imagery for male participants up from R to XXX. I suppose he could have continued to use nonstimulating imagery even after realising that it didn’t work, just as a fisheries biologist might continue to use the same net even after discovering that its mesh was too large to catch the species she was studying. In both cases the methodology would remain “consistent”. It would also be a complete waste of resources.

Bem also got some grief for using tests of statistical significance (i.e., what are the odds that these results are due to random chance?) rather than Bayesian methods (i.e., given that our hypothesis is true, what are the odds of getting these specific results?). (Carey gives a nice comparative thumbnail of the two approaches.) I suspect this complaint could be legit. The problem I have with Bayes is that it takes your own preconceptions as a starting point: you get to choose up front the odds that psi is real, and the odds that it is not. If the data run counter to those odds, the theorem adjusts them to be more consistent with those findings on the next iteration; but obviously, if your starting assumption is that there’s a 99.9999999999% chance that precognition is bullshit, it’s gonna take a lot more data to swing those numbers than if you start from a bullshit-probability of only 80%. Wagenmakers et al tie this in to Laplace’s famous statement that “extraordinary claims require extraordinary proof” (to which we shall return at the end of this post), but another way of phrasing that is “the more extreme the prejudice, the tougher it is to shake”. And Bayes, by definition, uses prejudice as its launch pad.

Wagenmakers et al ran Bem’s numbers using Bayesian techniques, starting with standard “default” values for their initial probabilities (they didn’t actually say what those values were, although they cited a source). They found “substantial” support for precognition (H 1 ) in only one of Bem’s nine experiments, and “substantial” support for its absence (H 0 ) in another two (they actually claim three, but for some reason they seem to have run Bem’s sixth experiment twice). They then reran the same data using a range of start-up values that differed from these “defaults”, just to be sure, and concluded that their results were robust. They refer the reader to an online appendix for the details of that analysis. This is what you’ll find there:

Notice the figure caption: “… the results in favor of H 1 are never compelling, except perhaps for the bottom right panel.” Except perhaps? I’m sorry, but that last panel looks pretty damn substantial to me, if for no other reason than that so much of the curve falls into the evidentiary range that the axis itself labels as, er, substantial. In other words, even assuming that these guys were right on the money with all of their criticisms, even assuming that they’ve successfully demolished eight of Bem’s nine claims to significance — they’re admitting to evidence for the existence of precognition by their own reckoning. And yet, they can’t bring themselves to admit it, even in a caption belied by its own figure.

To some extent, it was Bem’s decision to make his work replication-friendly that put this particular bullseye on his chest. He chose methods that were well-known and firmly established in the research community; he explicitly rejected arcane statistics in favor of simple ones that other social scientists would be comfortable with. (“It might actually be more logical from a Bayesian perspective to believe that some unknown flaw or artifact is hiding in the weeds of a complex experimental procedure or an unfamiliar statistical analysis than to believe that genuine psi has been demonstrated,” he writes. “As a consequence, simplicity and familiarity become essential tools of persuasion.”) Foreseeing that some might question the distributional assumptions underlying t-tests, he log-transformed his data to normalise it prior to analysis; this inspired Wagenmakers et al to wonder darkly “what the results were for the untransformed RTs—results that were not reported”. Bem also ran the data through nonparametric tests that made no distributional assumptions at all; Alcock then complained about unexplained, redundant tests that added nothing to the analysis (despite the fact that Bem had explicitly described his rationale), and about the use of multiple tests that didn’t correct for the increased odds of false positives.

This latter point is true in the general but not in the particular. Every grad student knows that desperate sinking feeling that sets in when their data show no apparent patterns at all, and the temptation to inflict endless tests and transforms in the hope that please God something might show up. But Bem already had significant results; he used alternative analyses in case those results were somehow artefactual, and he kept getting significance no matter which way he came at the problem. Where I come from, it’s generally considered a good sign when different approaches converge on the same result.

Bem also considered the possibility that there might be some kind of bias in algorithms used by the computer to randomise its selection of pictures; he therefore replicated his experiments using different random-number generators. He showed all his notes, all the messy bits that generally don’t get presented when you want to show off your work in a peer-reviewed journal. He not only met the standards of rigor in his field: he surpassed them, and four reviewers (while not necessarily able to believe his findings) couldn’t find any methodological or analytical flaws sufficient to keep the work from publication.

Even Bem’s opponents admit to this. Wagenmakers et al explicitly state:

“Bem played by the implicit rules that guide academic publishing—in fact, Bem presented many more studies than would usually be required.”

They can’t logically attack Bem’s work without attacking the entire field of psychology. So that’s what they do:

“… our assessment suggests that something is deeply wrong with the way experimental psychologists design their studies and report their statistical results. It is a disturbing thought that many experimental findings, proudly and confidently reported in the literature as real, might in fact be based on statistical tests that are explorative and biased (see also Ioannidis, 2005). We hope the Bem article will become a signpost for change, a writing on the wall: psychologists must change the way they analyze their data.”

And you know, maybe they’re right. We biologists have always looked at those soft-headed new-agers over in the Humanities building with almost as much contempt as the physicists and chemists looked at us, back before we owned the whole genetic-engineering thing. I’m perfectly copacetic with the premise that psychology is broken. But if the field is really in such disrepair, why is it that none of those myriad less-rigorous papers acted as a wake-up call? Why snooze through so many decades of hack analysis only to pick on a paper which, by your own admission, is better than most?

Well, do you suppose anyone would be eviscerating Bem’s methodology with quite so much enthusiasm if he’d concluded that there was no evidence for precognition? Here’s a hint: Alcock’s critique painstakingly picks at every one of Bem’s experiments except for #7. Perhaps that seventh experiment finally got it right, you think. Perhaps Alcock gave that one a pass because Bem’s methodology was, for once, airtight? Let’s let Alcock speak for himself:

“The hit rate was not reported to be significant in this experiment. The reader is therefore spared my deliberations.”

Evidently bad methodology isn’t worth criticising, just so long as you agree with the results.

This leads nicely into what is perhaps the most basic objection to Bem’s work, a more widespread and gut-level response that both underlies and transcends the methodological attacks: sheer, eye-popping incredulity. This is bullshit. This has to be bullshit. This doesn’t make any goddamned sense.

It mustn’t be. Therefore it isn’t.

Of course, nobody phrases it that baldly. They’re more likely to claim that “there’s no mechanism in physics which could explain these results.” Wagenmakers et al went so far as to claim that Bem’s effect can’t be real because nobody is bankrupting the world’s casinos with their psychic powers, which is logically equivalent to saying that protective carapaces can’t be advantageous because lobsters aren’t bulletproof. As for the whacked-out argument that there’s no theoretical mechanism in place to describe the data, I can’t think of a more effective way of grinding science to a halt than to reject any data that don’t fit our current models of reality. If everyone thought that way, earth would still be a flat disk at the center of a crystal universe.

Some people deal with their incredulity better than others. (One of the paper’s reviewers opined that they found the results “ridiculous”, but recommended publication anyway because they couldn’t find fault with the methodology or the analysis.) Others take refuge in the mantra that “extraordinary claims require extraordinary evidence”.

I’ve always thought that was a pretty good mantra. If someone told me that my friend had gotten drunk and run his car into a telephone pole I might evince skepticism out of loyalty to my friend, but a photo of the accident scene would probably convince me. People get drunk, after all (especially my friends); accidents happen. But if the same source told me that a flying saucer had used a tractor beam to force my friend’s car off the road, a photo wouldn’t come close to doing the job. I’d just reach for the Photoshop manual to figure out how the image had been faked. Extraordinary claims require extraordinary evidence.

The question, here in the second decade of the 22nd Century, is: what constitutes an “extraordinary claim”? A hundred years ago it would have been extraordinary to claim that a cat could be simultaneously dead and alive; fifty years ago it would have been extraordinary to claim that life existed above the boiling point of water, kilometers deep in the earth’s crust. Twenty years ago it was extraordinary to suggest that the universe was not only expanding but accelerating. Today, physics concedes the theoretical possibility of time travel (in fact, I’ve been led to believe that the whole arrow-of-time thing has always been problematic to the physicists; most of their equations work both ways, with no need for a unidirectional time flow).

Yes, I know. I’m skating dangerously close to the same defensive hysteria every new-age nutjob invokes when confronted with skepticism over the Healing Power of Petunias; yeah, well, a thousand years ago everybody thought the world was flat, too. The difference is that those nutjobs make their arguments in lieu of any actual evidence whatsoever in support of their claims, and the rejoinder of skeptics everywhere has always been “Show us the data. There are agreed-upon standards of evidence. Show us numbers, P-values, something that can pass peer review in a primary journal by respectable researchers with established reputations. These are the standards you must meet.”

How often have we heard this? How often have we pointed out that the UFO cranks and the Ghost Brigade never manage to get published in the peer-reviewed literature? How often have we pointed out that their so-called “evidence” isn’t up to our standards?

Well, Bem cleared that bar. And the response of some has been to raise it. All along we’ve been demanding that the fringe adhere to the same standards the rest of us do, and finally the fringe has met that challenge. And now we’re saying they should be held to a different standard, a higher standard, because they are making an extraordinary claim.

This whole thing makes me deeply uncomfortable. It’s not that I believe the case for precognition has been made; it hasn’t. Barring independent confirmation of Bem’s results, I remain a skeptic. Nor am I especially outraged by the nature of the critiques, although I do think some of them edge up against outright dishonesty. I’m on public record as a guy who regards science as a knock-down-drag-out between competing biases, personal more often than not. (On the other hand, if I’d tried my best to demolish evidence of precognition and still ended up with “substantial” support in one case out of nine, I wouldn’t be sweeping it under the rug with phrases like “never compelling” and “except possibly” — I’d be saying “Holy shit, the dude really overstated his case but there may be something to this anyway…”)

I am, however, starting to have second thoughts about Laplace’s Principle. I’m starting to wonder if it’s especially wise to demand higher evidentiary standards for any claim we happen to find especially counterintuitive this week. A consistently-applied 0.05 significance threshold may be arbitrary, but at least it’s independent of the vagaries of community standards. The moment you start talking about extraordinary claims you have to define what qualifies as one, and the best definition I can come up with is: any claim which is inconsistent with our present understanding of the way things work. The inevitable implication of that statement is that today’s worldview is always the right one; we’ve already got a definitive lock on reality, and anything that suggests otherwise is especially suspect.

Which, you’ll forgive me for saying so, seems like a pretty extraordinary claim in its own right.

Maybe we could call it the Galileo Corollory.