while (true) { error=error.calculateError(); }

[SPOILER ALERT: THIS WEEK’S NFL OUTCOMES BELOW]

So, yesterday I just finished running diagnostics on that old Windows 98 machine, in an effort to figure out where its sudden fit of apparent self-awareness, and interest in American Football, had come from.

Churning through its logs (messy business), I discovered a JVM stack dump pointing to some code in a file which no longer existed. After a bit of undeleting end decompiling I actually found the source of the crash.

Here’s what happened:

Not content in coming up with a set of predictions for the outcomes of last weeks games, it seems that the computer was also trying to also come up with an indication of how certain its predictions were.

This struck me as remarkably forward thinking for such a humble machine. Of course, any pundit can have a stab at what they think the outcome of a game will be but – depending on the specific expertise of the pundit, the information available, and how closely matched the game is – some stabs are going to be aimed precisely at the jugular, while others are going to more closely resemble a blindfolded kid trying to whack a Piñata (a fact which last week’s predictions demonstrate only too well).

Certainly, if we’re going to be placing actual bets based on algorithmic predictions of game outcomes, we are going to want to weigh this information up very carefully; if the likelihood that a given prediction is correct is as slight as the potential winnings associated with it, it probably wouldn’t be prudent to put any money down at all.

That’s all well and good then. But what actually caused the machine to self-destruct?

Well, it’s all to do with the way in which it was trying to calculate its confidence. After training a machine learning algorithm to predict the actual outcomes, it studied the size of the difference between the predicted outcomes and the actual outcomes (i.e. the absolute error), and then went about training a second machine learning algorithm to predict that figure. That is, an algorithm which could predict how well the first algorithm could predict game outcomes. Algorithm heaven!

Note that predicting the size of the error is not the same as predicting the actual difference between the predicted outcomes and the true outcomes (or the residuals). If you could do this, then you would simply be able to make better original predictions. Boosting algorithms are based on this principle, and exploit the fact that algorithms used further down the chain can be selected and tuned specifically towards the problem of predicting the things (e.g. game outcomes) which preceding algorithms failed to. However, the gains from doing this are usually slight, owing to the fact that all algorithms are essentially trying to solve the same problem, and no new information is brought into the fray at each step.

In comparison, learning to predict the size of the error (i.e. the absolute or squared residuals) is actually quite a different problem compared to making the original predictions (and in many cases may even be an easier one).

Anyway, the reason the machine “crashed” is because it got a bit obsessed with this problem. After predicting the errors to accompany its original predictions, it then seemingly thought to itself “Hang on a minute, how do I know how good my errors are?” Very shorly thereafter it arrived at the conclusion “What I really need, is a model for predicting the error in my errors” (which of course is an entirely new problem). After dilligently solving this problem, it seems that it then thought to itself “Hang on a minute, how do I know how good my error errors are?”… and so on.

I know this because (after the smoke cleared) the stack dump went on for about fifty pages.

Although this obsessive episode may not seem to bode very well for general-purpose AI, the good news is that I have managed to reverse engineer the code to the point that I can now execute it again to generate predictions for this week’s games. Because the code is written in some extremely opaque AI-devised proto-language (itself seemingly compiled from some equally opaque AI-devised proto-language… all built on top of Java), I don’t actually have a clue how it works. As such, the only way I could force it to stop executing and output its predictions (and – importantly – its error estimates), before taking down the power of a whole city block and re-incarnating itself as Gandalf the White, was to construct an unnecessarily complex Rube Goldberg style contraption operated by a big red fruit machine button, which simultaneously operates the CTRL+C keys and an old medium-format film camera pointed at the screen.

So, my sincere apologies if these have taken a while:

FIXTURES VISITOR @ HOME VISITOR

SPREAD OVER/

UNDER NY GIANTS @ CHICAGO 7.5 47.5 PITTSBURGH @ NY JETS 2.5 41 DETROIT @ CLEVELAND -2.5 44 CAROLINA @ MINNESOTA 2.5 44 CINCINNATI @ BUFFALO -7.5 42 PHILADELPHIA @ TAMPA BAY -1.5 45.5 OAKLAND @ KANSAS CITY 9.5 40.5 GREEN BAY @ BALTIMORE -3.5 49 ST.LOUIS @ HOUSTON 7.5 42.5 JACKSONVILLE @ DENVER 26.5 53.5 TENNESSEE @ SEATTLE 13.5 40.5 NEW ORLEANS @ NEW ENGLAND 2.5 50.5 ARIZONA @ SAN FRANCISCO 10.5 41.5 WASHINGTON @ DALLAS 5.5 53.5 INDIANAPOLIS @ SAN DIEGO -1.5 50 PREDICTIONS VISITOR-

HOME WINNER AGAINST SPREAD TOTAL 19-28 CHI (78%) CHI (55%) UNDER (52%) 21-23 NYJ (58%) PIT (52%) OVER (63%) 22-20 DET (57%) CLE (52%) UNDER (58%) 21-23 MIN (56%) CAR (51%) ? (50%) 22-20 CIN (56%) BUF (67%) ? (50%) 24-19 PHI (68%) PHI (63%) UNDER (60%) 17-23 KC (75%) OAK (65%) UNDER (53%) 27-22 GB (71%) GB (56%) ? (50%) 21-23 HOU (55%) STL (63%) OVER (54%) 15-42 DEN (100%) DEN (51%) OVER (60%) 20-22 SEA (58%) TEN (98%) OVER (57%) 25-20 NO (68%) NO (77%) UNDER (73%) 17-19 SF (57%) ARI (80%) UNDER (73%) 24-29 DAL (68%) WAS (52%) UNDER (52%) 27-22 IND (71%) IND (65%) UNDER (54%) For your betting pleasure, I’ve highlighted any predictions about which the system is more than 75% confident in white, and those about which it is more than 65% confident in light grey. Predictions with closer to 50% error (which you may notice are the majority) are essentially cases where the algorithm has very little inkling one way or another, and should not be taken as recommendations. If only human pundits were as precise. -Justin