As I have mentioned a number of times, I teach data mining to the MBA students here at the Tepper School. It is a popular course, with something like 60% of our students taking it before graduating. I offer an operations research view of data mining: here are the algorithms, here are the assumptions, here are the limits. We talk about how to present results and how to clean up data. While I talk about limits, I know that the students get enthusiastic due to the not-insignificant successes of data mining.

We give students access to a real data mining system (SAS’s Enterprise Miner currently, though we have used SPSS’s Clementine Miner (now PASW, for some reason) in the past). Invariably, students start putting in financial data in an attempt to “beat the stock market”. In class, I talk about how appealing an approach this is, and how unlikely it is to work. But students try anyway.

The Intelligent Investor column of the Wall Street Journal has a nice article on this (thanks Jainik for the pointer!) with the title: “Data Mining Isn’t a Good Bet for Stock-Market Predictions”.

The Super Bowl market indicator holds that stocks will do well after a team from the old National Football League wins the Super Bowl. The Pittsburgh Steelers, an original NFL team, won this year, and the market is up as well. Unfortunately, the losing Arizona Cardinals also are an old NFL team. The “Sell in May and go away” rule advises investors to get out of the market after April and get back in after October. With the market up 17% since April 30, that rule isn’t looking so good at this point. Meanwhile, dozens — probably hundreds — of Web sites hawk “proprietary trading tools” and analytical “models” based on factors with cryptic names like McMillan oscillators or floors and ceilings. There is no end to such rules. But there isn’t much sense to most of them either. An entertaining new book, “Nerds on Wall Street,” by the veteran quantitative money manager David Leinweber, dissects the shoddy thinking that underlies most of these techniques.

The article then gives a great example of how you can “predict” the US stock market by looking at Bangledeshi butter production. Now, I don’t buy the starkly negative phrasing of the column: he (Jason Zweig) refers to data mining as a “sham”, which I think goes much too far. Later in the article, he talks about what it takes to do data mining right:

That points to the first rule for keeping yourself from falling into a data mine: The results have to make sense. Correlation isn’t causation, so there needs to be a logical reason why a particular factor should predict market returns. No matter how appealing the numbers may look, if the cause isn’t plausible, the returns probably won’t last. The second rule is to break the data into pieces. Divide the measurement period into thirds, for example, to see whether the strategy did well only part of the time. Ask to see the results only for stocks whose names begin with A through J, or R through Z, to see whether the claims hold up when you hold back some of the data.Next, ask what the results would look like once trading costs, management fees and applicable taxes are subtracted. Finally, wait. Hypothetical results usually crumple after they collide with the real-world costs of investing. “If a strategy’s worthwhile,” Mr. Leinweber says, “then it’ll still be worthwhile in six months or a year.”

This is all good advice, and part of what I try to talk about in the course (though having the article makes things much easier). My conclusion: there is “sham” data mining, but that doesn’t mean all data mining is a sham. I’d love to read the book, but the Kindle version is running at $23.73, a price that I suspect was set by data mining.