The Crown Game Affair



What constitutes evidence of cheating?



Faye Dunaway is an Academy Award-winning actress who co-starred with the late Steve McQueen in the 1968 movie “The Thomas Crown Affair.” She plays a freelance insurance fraud investigator, Vicki Anderson, who believes that millionaire playboy Thomas Crown is guilty of instigating a $2.6 million bank heist, but falls in love with him anyway. The most famous scene in the movie shows her both defeating and seducing Crown in a game of chess.

Today I write about the difficulty of detecting fraud at chess, and the role of statistical evidence.

The New York Times this morning joins several chess media outlets covering the allegations against a Bulgarian player who was searched during a tournament in Croatia last month. When we mentioned this case in our “Predictions and Principles” post earlier this month, I had the issue of principles regarding statistical evidence high in my mind, and this is reflected in my exemplar of a formal report. It accompanies a cover letter to the Association of Chess Professionals, raising the issue of what do you do when there is no physical or observational evidence but the statistical evidence is strong, and who should have oversight of standards and procedures for statistical tests.

Dunaway also had a small role in the 1999 remake, in which Crown again escapes uncaught but with a different endgame. Crown is played by Pierce Brosnan of James Bond fame. There is a James Bond quality to current speculation about possible cheating methods, from embedded chips to special reflective glasses. They are among items considered at the end of this 70-minute video by Bulgarian master Valeri Lilov, which was also covered by ChessBase.com. But none of this speculation is accompanied by any hard evidence. The real action may be not with the kind of gadgeteers to interact with M or Q or even Miss Moneypenny, but rather the actuaries down below who track the numbers.

Arbiter By Numbers

Cheating and gamesmanship at chess are nothing new—only the possible sources of illegal information during games have changed from ‘animal’ and ‘vegetable’ to ‘mineral.’ The following lyrics from the “Arbiter’s Song” in the 1986 musical “Chess” come from actual incidents at championship matches before then.

If you’re thinking of the kind of things

that we’ve seen in the past:

Chanting gurus, walkie-talkies, walkouts, hypnotists,

tempers, fists—

Not so fast.

Now there are calls for directors and judges and arbiters at chess tournaments and matches to take high-tech measures against possible cheating even while the games are going on. One limitation of testing moves in the manner needed by my model is that the games must be analyzed to evaluate all reasonable move choices with equal thoroughness, which takes a fast processor core more time than the duration of the game itself when it was played.

Still, the need for move-analysis tests is recognized by many. Indeed the very first comment posted in the story breaking on December 30 came from British master Leonard Barden, who has been chess columnist of The Guardian newspaper for 54 years—and the Financial Times for a mere 34 years. Barden put the present issues plainly:

Either 1 Borislav Ivanov is probably the first adult (as opposed to a junior talent) with a confirmed low rating ever to achieve a 2600+ GM norm performance in an event of nine rounds or more… or 2 [He] is the first player ever to successfully cheat at a major tournament over multiple rounds without the cheating mechanism being detected.

Here 2600 is a chess rating that usually distinguishes a “strong grandmaster,” while my own rating near 2400 is typical of the lesser title called “international master,” and Ivanov’s pre-tournament rating of 2227 is near the 2200 floor to be called any kind of master. Although Magnus Carlsen recently broke Garry Kasparov’s all-time rating record to reach 2861, my program for “Intrinsic Ratings” clocked Ivanov’s performance in the range 3089–3258 depending on which games and moves are counted according to supplementary information in the case, all higher than any by Carlsen and his two closest pursuers enumerated by me here, or anything here or here.

Barden bears with him the memory of a British prodigy born the same year as he, coincidentally named Gordon Thomas Crown, who passed away of illness in 1947 shortly after defeating Soviet grandmaster Alexander Kotov in one of two games during a “summit match” between Britain and the USSR. He continued:

There are no examples known of devices successfully transmitting chess moves in competitive play via contact lenses, the skin, the brain or other such concepts … [T]he cheating mechanism in this case remains unexplained. That’s why it is important that somebody with access to Houdini or another top program examines the nine [games] with a program. Such a program check of the games may help to establish whether the player used computer assistance.

Thus all the tech-talk takes a back seat to simple numbers when it comes to getting results. The question remains, are they enough?

The Issue

It took me two days to run the main test with two top programs, Rybka 3 and Houdini 3, run several supporting analyses, and then run my statistical analyzer. Writing the report took another week, however, as I felt responsible also for articulating issues of how to evaluate this kind of evidence, and spelling out scientific particulars for due process. The drift of reactions to others’ early scattershot tests also moved my originally-advised intent of writing my conclusions briefly and simply for chess players to writing for experts in statistical fields—and for a student audience such as in a seminar I am running this coming term.

My report gives examples addressing when and why and how odds of “a million to one” should be treated differently from “a thousand to one.” The latter typifies my results in some cases where there was also physical or observational evidence, but here there is as yet none. Here is a different example to the same effect.

Mark Crowther of London has provided an incredible service called The Week In Chess (TWIC), which collects for free download several thousand games played in tournaments over the preceding week. The current week, TWIC 948, has games by over a thousand players—1,010 to be exact—typically 4–6 per player for a weekend tournament up to 9 for an all-week event such as the Zadar Open itself. If one were to dredge all their games, one would expect to find a statistical deviation that would translate to 1,000–1 odds against some kind of “null hypothesis” about cheating. Clearly Inspector Javert should have left the other characters in Les Misérables alone and taken up statistics. The fear of players being fingered this way is remarked by Dylan McClain in today’s New York Times column:

If every out-of-the-ordinary performance is questioned, bad feelings could permanently mar the way professional players approach chess.

Hence my policy has been that such statistical results have meaning only when there is evidence against the player that is independent of performance or move-match tests with computers by others.

With results citing million-to-one odds, however, the considerations are different—at least for chess. To find such a deviation by natural causes, one would need to dredge 20 years of TWIC—and the indefatigable Crowther has just started his 20th year.

Fail-Safes?

A second factor is that my tests are not invariantly correlated to quality of performance. My co-author Guy Haworth—who gave me heroic multiple detailed feedbacks on my report helping it achieve clarity and fairness—alerted me to discussion of a similar mercurial performance by Scottish master Alan Tate, also in Croatia, in 2010. I ran Tate’s games through a screening test, and found only 51% move-matching, compared to figures near 70% in the present case. Indeed, Tate’s defeated opponents had higher concordance to the computer in those games.

My tests have also rendered negative results; my letter notes that in two major international Open tournaments they were determinative for awarding a delayed prize. Thus they are not always “bad news” even when presuppositions are heightened.

My report describes two main tests, which are partially independent. Presumably their combined confidence would be higher, though I have not yet worked out how to do this numerically. Several alternative specifications for the tests, such as using the player’s rating before rather than after the tournament as the main baseline, excluding one (or two) game(s) where public transmission of moves was switched off amid suspicion of him, and excluding moves after (say) move 70 in very long games when the time available to think might be too short for some cheating mechanisms, exhibit much higher deviations. Although the “Intrinsic Rating” component does not accompany a statement of odds, it indicates that the inherent quality of the moves, as judged by computers, was highly significantly beyond what goes with a 2700-level performance.

Thus I claim specific value for my tests beyond being a metric of performance, which buttresses my point in asking the chess world, what shall we do about all this?

The Letter

In a series of fortunate events after breaking his leg playing soccer, Grandmaster Bartlomiej Macieja of Poland traveled with brace and cane into downtown Warsaw to meet me during MFCS 2011, became co-author on a paper with me and Haworth, became husband and father, was hired as a coach by the University of Texas at Brownsville, and became General Secretary of the ACP—not all in that order. Hence it was logical to address my letter to him as well as ACP President, Grandmaster Emil Sutovsky of Israel. Here are some excerpts:

I pose two questions, of which at least the first should be an immediate concern of ACP in conjunction with FIDE and national organizations. The second is a deeper issue that I believe needs consultation with experts in statistics and computer sciences, and with representatives of bodies in other fields that have established protocols for using evidentiary statistics in fraud detection and arbitration. What procedures should be instituted for carrying out statistical tests for cheating with computers at chess and for disseminating their results? Under whose jurisdiction should they be maintained? How should the results of such tests be valued? Under what conditions can they be regarded as primary evidence? What standards should there be for informing different stages of both investigative and judicial processes?

…

The point of approaching ACP is to determine how the contexts and rules should be set for chess. The goals, shared by Haworth and others I have discussed this with, include: (a) To deter prospective cheaters by reducing expectations of being able to get away with it.

(b) To test accurately such cases as arise, whether in supporting or primary role, as part of uniform procedures recognized by all as fair.

(c) To educate the playing public about the incidence of deviations that arise by chance, and their dependence on factors such as the forcing or non-forcing quality of their games.

(d) To achieve transparency and reduce the frequency of improper accusations.

(e) Finally, hopefully to avert the need for measures, more extreme than commonly recognized ones, that would tangibly detract from the enjoyment of our game by players and sponsors and fans alike.

More simply, I share the worry of many that a few cases of people “being clever” may ruin much pleasure. This extends to accusations I believe have been ill-informed, such as the one noted in the introduction to my “Fidelity” public site. (The data files behind my results are kept private; whether to open them is another hard question.) I hope that certain little details in my report, such as getting such positive results despite there being ten consecutive non-matches in one game and seven in another, will be noticed and deter others from trying to be “cleverer.”

Open Questions

What cases of statistical evidence in your field may best inform this one?

Update (1/15 9:30pm): Slashdot posted a note on this earlier today, and their comment thread has a wealth more of informative comparisons and reactions.