Thirteen Sigma



Monitoring fabrication in industry and chess



Bill Smith joined Motorola as a quality control engineer in 1986. He coined the term Six Sigma to formulate a goal for vastly reducing the fault rate of manufactured components. This needed improving not only the monitoring of quality but also the resolution of testing devices and statistical tools so they could make reliable projections in units of faults per million rather than per thousand. The resulting empowerment of Motorola’s engineers gave such great and verifiable results that Motorola received the Malcolm Baldridge National Quality Award in 1988.

Today I want to talk about the meaning of high-sigma confidence in areas where the results may not be verifiable.

“Six Sigma” refers to the normal distribution curve, whose major properties were established by Carl Gauss. Gauss and others discovered that deviations in scientific measurements all tended to follow this distribution. His Central Limit Theorem provided an explanation of its universality. Thus magnitudes of deviations of many kinds can be expressed as multiples of the standard deviation of this distribution. Doing so provides a universal estimate of the frequency of large deviations.

The goal in manufacturing is to make the process so reliable that its is below of the magnitude of deviations that would cause components to fail at point of creation. When only one side of deviations matters, this puts the failure rate below the tail-error function value , which is almost exactly two parts per billion. By the end of assembly the tolerance is raised to , so it is really “Four Point Five Sigma” that sets the end-product goal to failure of less than 3.4 parts per million.

Six-Sigma programs spread quickly and evolved a martial-arts mythos. Six-Sigma organizations award officially-certified Green Belts, Yellow Belts, Brown Belts, Black Belts, and Master Black Belts. They also have a “Champion” designation.

I wonder if those belts are conferred according to how many Sigmas one achieves, 8 being greater than 7, which is greater than the basic 6 (or rather, 4.5). If so then I should apply—because last week I achieved 13 Sigmas of confidence from my own software process (or rather, 11.3).

Chess Cheating Developments

Last month I was named to a 10-person joint commission of the World Chess Federation (FIDE) and the Association of Chess Professionals (ACP) to combat cheating with computers in human chess events. Discussions have gone into high swing this month, working toward drafting concrete proposals at the FIDE General Assmebly in Tallinn, Estonia, the first week of October. I am on the committee because my statistical model of human decision-making at chess answers a need voiced by many commentators, including Britain’s venerable Leonard Barden as quoted here.

I have, however, been even busier with a welter of actual cases, reporting on four to the full committee on Thursday. One concerned accusations made in public last week by Uzbek grandmaster Anton Filippov about the second-place finisher in a World Cup regional qualifier he won in Kyrgyzstan last month. My results do not support his allegations. Our committee is equally concerned about due-diligence requirements for complaints and curbing careless allegations, such as two against Austrian players in May’s European Individual Championship. A second connects to our deliberations on the highly sensitive matter of searching players, as was done also to Borislav Ivanov during the Zadar Open tournament last December. A third is a private case where I find similar odds as with Ivanov, but the fourth raises the fixing of an entire tournament, and I report it here.

Add to this a teen caught consulting an Android chess app in a toilet cubicle in April and a 12-year-old caught reading his phone in June, plus some cases I’ve heard only second-hand, and it is all scary and sad. It is also highly stressful having my statistics be the only ‘regular’ evidence in several currently unresolved cases—all in which other players made accusations based on unscientific testing before my work came on the scene. Previously, as with the case of Sébastien Feller (which ended for truth purposes with an accomplice’s confession last year), my results were supporting clear physical or observational evidence. But these new cases have deviations beyond the pale of selection-effect caveats, while the following story is on another plane.

Unquiet Flows the Don

Over all my playing years I’ve heard nonspecific rumors of rigged tournaments. Besides prizes and qualifying spots for championship competitions, a motive can be achieving a so-called title norm. The titles of FIDE Master (FM), International Master (IM), and Grandmaster (GM) are FIDE’s “belts,” and to earn them one must score a designated number of points according to the strength category of the tournament. I scored two IM norms in early 1977, but they covered only 23 of the 24 required total games, and achieving my third norm took until 1980. The higher titles bring financial benefit along with prestige. However, until now my results on the few famous specific rumors had been inconclusive.

The Don Cup 2010 International was held three years ago in Azov, Russia, as a 12-player round-robin. The average Elo rating of 2395 made it a Category 6 event with 7 points from 11 games needed for the IM norm, 8.5 for the GM norm. It was prominent enough to have its 66 games published in the weekly TWIC roundup, and they are also downloadable from FIDE’s own website. Half the field scored 7 or higher, while two tailenders lost all their games except for drawing each other and one other draw, while another beat only them and had another draw, losing eight games.

My informant suspected various kinds of “sandbagging”: throwing games in the current event, or having an artifically-inflated Elo rating from previous fixed events, so as to bring up the category. He noted some of the tailenders now have ratings 300 points below what they were then. Hence I thought to test for deviations down. I first took the 21 games involving the bottom two, with their 19 losses, and ran the procedure to compute their “Intrinsic Performance Rating” (IPR) detailed in a new paper whose final version will be presented at the IEEE CIG 2013 conference next month. I wondered whether getting significantly high error with an IPR under 2000 would really constitute evidence of “unreasonably poor” play, but even the high results of my preliminary test did not prepare me for the enormity of the printout of the full test:

IPR = 2925.

When I included the moves made by their opponents in the 21 games, my program gave 3008. This is well above the ratings of the strongest human players, but in the range typical for computer programs before Rybka 3 (my mainstay) emerged in 2008. Moreover my program gave about confidence that players with their 2300 ratings would not show so many agreements with Rybka 3.

That was from the losers. I wondered what the winners’ games would look like… So I took the 3 days needed to run all my cores on the other 45 games.

Sigmas Amok

Running all 66 games created a sample of almost 4,000 analyzed moves—after excluding turns 1–8 of any game, so called repetition moves, and positions where one side has a crushing advantage. Most cases with single players have involved 9 games totaling about 250 analyzed moves, barely one-fifth of the sample size recommended for a reliable poll. This was effectively 132 games since it covered both sides of a game.

Hence the baseline value was only about the size I usually get. This lent extra heft to the 2880 IPR for the whole tournament, higher than any human tournament I’ve recorded except 2904 for the 4-player Bilbao Grand Slam Final in 2010. When I took out the 6th and 7th place finishers, the IPR jumped to 2997. This is despite some games having blunders and ending before move 20, while others have many moves, discarded by my analyzer, where most humans would have given up long ago.

The IPR does not come with a formal statement of unlikelihood, so I ran my Rybka-agreement test for that purpose. My program last Saturday printed the multiplier (which for normal distribution is called a z-score) needed for 2400-rated players to produce such computer concordance as:

z = 13.0011.

The last two digits are not significant—they owe to my global use of a 4-place C++ format specifier—but they show that the “13” is not rounded up. For reasons described earlier on this blog I divide by 1.15 to report an “adjusted z-score,” which allows for lack of full independence between moves and other modeling error. This yields the aforementioned 11.3. But I’ve tested that policy only for ; beyond that I have no idea except thinking that dividing by a fixed factor should be mathematically conservative.

There it is: internal confidence in a fabrication process—here in one having been used to manufacture games that were not actually played. The corresponding odds (of legitimacy after all) are about , meaning

1-in-163,000,000,000,000,000,000,000,000,000,000,000,000.

I don’t know whether any physics experiment for a yes/no predicate has ever claimed confidence—for comparison, sufficed for the Higgs Boson. However, this still raised for me a question I have understandably been posed on the anti-cheating committee:

Is it a proof?

And here is the difference from Six-Sigma: an industrial process can be verified by later automated testing of the millions of items, but a one-shot predicate about human events often cannot be.

It Shines Like Truth

The German word for probability, Wahrscheinlichkeit, has the great feature of literally meaning, “the quality of shining like truth.” Whereas the corresponding root for our own word is Latin proba, meaning “test” or “proof.” Truth or proof, can it be either?

In this case I did not have to wait long for more-than-probability. Another member of our committee noticed by searching his million-game database that:

Six of the sixty-six games are move-by-move identical with games played in the 2008 World Computer Chess Championship.

For example, three games given as won by one player are identical with Rybka’s 28-move win over the program Jonny and two losses in 50 and 44 moves by the program Falcon to Sjeng and HIARCS, except one move is missing from the last. One of his victims has three lost games, while another player has two wins and another two losses. Indeed the six games are curiously close to an all-play-all cluster.

I verified this against my own collection of over 11,000 major computer-played games, tolerating 8-move differences, and was surprised to find just the same six identities, no more. So where do the other 60 games come from? My program’s confidence in computer origin is no less, but perhaps someone actually took the trouble to generate them fresh by playing two chess programs against each other?

I am expanding the search to match my database of over 200,000 human games per recent year against the 11,000 computer games, but each year is taking a day. A trial partial search of 2012 turned up a game in a junior tournament identical to the 39-move draw between Garry Kasparov and IBM’s Deep Blue in game 3 of their first match in 1996, but nothing beyond a children’s joke is apparent.

Open Problems

Six identical games may amount to six smoking-gunshots, but why don’t six sigmas, or thirteen?



Note for viewers of this Reddit item: though my 2009–2010 work with Guy Haworth and Giuseppe DiFatta used their Bayesian model, this one is elementary frequentist.

[simplified mention of FIDE’s “belts”; DB-GK draw was 39 not 38 moves; improved overall readability]