A Chess Firewall at Zero?



We halve our blunder rate when infinitesimally ahead, but why?



Crop from Seneca chess quote source

Lucius Seneca was a Roman playwright, philosopher, and statesman of the first century. He is called “the Younger” because his father Marcus Seneca was also a famous writer. His elder brother Lucius Gallio appears as a judge in the Book of Acts. Beside many quotations from his work, Seneca is famous for one he is not known to have said:

“To err is human.”

Lest we cluck at human error in pinning down ancient quotations, the source for the following updated version is also unknown—even with our legions of computers by which to track it:

“To err is human, but to really screw things up requires a computer.”

Today I report a phenomenon about human error that is magnified by today’s computers’ deeper search, and that I believe arises from their interaction with complexity properties of chess.



I have previously reported some phenomena that my student Tamal Biswas and I believe owe primarily to human psychology. This one I believe is different—but I isolated it in full only a week ago so who knows. It is that the proportion of large errors by human players in positions where computers judge them to be a tiny fraction of a pawn ahead is under half the rate in positions where the player is judged ever so slightly behind.

The full version of what Seneca actually wrote—or perhaps didn’t write—is even more interesting in the original Latin:

Errare humanum est, perseverare autem diabolicum, et tertia non datur.

This means: “To err is human; to persevere in error is of the devil; and no third possibility is granted.” The phrase tertium non datur is used for the Law of Excluded Middle in logic. In logic the law says that either Seneca wrote the line or he didn’t, with no third possibility. We will say more about this law in an upcoming post. Amid disputes about whether human behavior and its measurement follow primarily “rational” or “psychological” lines, we open a third possibility: “complexitarian.”

State of Computer Chess

Here are some important things to know about chess and computer chess programs (called “engines”).

Chess is hard. Finding a best move in a given position or even telling if you’re winning is a concrete case of a -hard problem.

-hard problem. Computers can now slaughter the best human players on even terms using commodity hardware; the champion Komodo program recently gave substantial handicaps to US champion Hikaru Nakamura and still won.

All leading programs work in progressively deeper rounds of search. Under the common UCI protocol they can be configured to compute a value for each possible move at each depth of search, but in the usual “Single-Line” playing mode they save time by only bounding inferior moves away from the value of their current best move.

Engines often change their “mind” about the values of moves and which one to rank first, but their verdicts become more stable as the search deepens. The values are commonly measured in units of 0.01 called centipawns—figuratively hundredths of a pawn.

Search depth is measured in plies meaning moves by White or Black, so depth 12 means looking ahead 6 moves for both. Programs have a basic search depth that notches up but can extend their search at any time when the critical nature of the moves warrants doing so.

Basic depth 12 was once projected to subdue the human champion; I suspect it’s really depth 18 with today’s programs but no matter: they hit 18 in seconds and reach the high 20s and beyond in games at standard time controls. We recently discussed how closely they are knocking on perfection.

Perfection in chess is probably a drawn outcome, which gets the value 0.00 in centipawns for both players. One way 0.00 values occur during a search is when both players must repeat the same sequence of moves—where any deviation gives that player a negative centipawn value according to the engine. The tendency of engines to give 0.00 at higher depths in positions that look crazy-complicated to human players has been remarked more and more in game commentary by grandmasters in magazines such as New in Chess.

One upshot is that depth of cogitation is solidly quantifiable in the chess setting. We have previously posted about our papers giving evidence of its connection to human thinking and error. The new phenomenon leans on this connection but we will argue that it has a different explanation.

The Phenomenon

My training sets include all recorded games in the years 2010–2014 between players rated within 10 points of the same century or half-century milepost of the Elo rating system. They range all the way from Elo 1050 to Elo 2800+, that is, from beginning adult tournament-level players to the human world champion and his closest challengers. This “milepost set” has over 1.15 million positions from 18,702 games, counting occurrences of the same position in different games but skipping the first 8 turns for both players. Each position was analyzed to depth at least 19 by both the newly-released version 7 of the number-two ranked Stockfish and last month’s version 9.3 of Komodo, using a multitude of single threads (for reproducibility) on the supercomputing cluster of the University at Buffalo Center for Computational Research.

The value of a position is the same as the value of the best move(s) in the position. In multi-line mode we can get the value of the played move directly, while in single-line mode we can if needed take the value of the next position. A value of +1.50 or more is commonly labeled a “decisive advantage” by chess software (though there are many exceptions), whereas values between -0.30 and +0.30 are considered grades of equality—at least by some chess software. Accordingly let’s call any move that drops the value by 1.50 or more a blunder. We will tabulate ranges in -0.40…+0.40 where a blunder most matters. Value exactly 0.00 gets a range to itself.

The final wrinkle is that we will use the engines’ highest-depth values to group the positions, but distinguish cases where the human player’s move was regarded as a blunder at depth 5, 10, 15, and 20. The last includes some depth-19 values. Doing so distinguishes immediate blunders like hanging your queen from subtle mistakes that require a lot of depth to expose. The data divides fairly neatly into thirds, one for “amateur” players under 2000 (387,024 positions from 6,187 games), then the 2000–2250 range (394,186 from 6,380), then 2300 and above (372,587 from 6,035). The depth-5 numbers for Stockfish 7 are “off” for reasons I demonstrated in a chess forum post; responses seem to confirm a recent bug that has lessened impact for depths 9 and higher. (See Update below.) The tables are for single-line mode (the multi-line mode results are similar including the Stockfish anomaly), and positions are analyzed sans game history so that only repetitions downstream in the search are detected.







The point is not that weaker players make more large mistakes, but rather the “Red Sea” effect at 0.00. It becomes more pronounced at the greatest depths and for the strongest players. Here are the figures for players rated 2600 and above, from 90,045 positions in 1,335 games. They are tabled in smaller intervals of 0.10 within the “equality” range -0.30 to +0.30.







Although Stockfish and Komodo have differences in their evaluation scales—happily less pronounced than they were 1 and 2 years ago—they agree that the world’s elite made six times more large errors when on the lower side of equality. This begs asking not only why but whether it is a real human phenomenon.

What Can Explain It?

There seems no possible rational explanation—indeed the sharp change at 0.00 counters the rational hypothesis discussed in this paper and toward the end of my 2014 TEDxBuffalo talk. But it is also hard to sustain a psychological explanation. If being slightly behind puts one off one’s game, why is the ratio accentuated for the top players? That this training set includes only games in which the players are evenly matched minimizes any variance from overconfidence or “fear factors.”

My proposed explanation leans on the above-mentioned engine behavior with repetitions and 0.00 values. Spot-checks affirm that the wide 0.00 bin includes many positions where the armies trade blows but fall into a repeating sequence from which neither can veer without cost. If the machine judges your position worth +0.01 or more then it places you discretely above this no-man’s zone. Any notionally big slip you make still has a good chance of being caught in this large attraction basin and hence being charged as only a small error by the computer. Whereas a slip from 0.00 or below has no safety net and falls right down.

This hypothesis comes with one large falsifiable prediction and perhaps a few others regarding games played by computers and computer-human teams as opposed to humans alone. My entire model uses no feature of chess apart from the move values supplied by engines at progressively increasing depths of search. It thus transfers to any alternating-decision game in which sufficiently authoritative move values can be computed. Some games like Go rule out repeating sequences, while others like Shogi allow them but rarely end in draws.

For any strategy game that disallows repetitions and/or has negligible draw frequency, there will be no similar “firewall at 0.00” effect on the rate of human mistakes.

Go and Shogi are just as hard as chess, so this hypothesis does not pertain to worst-case complexity. Rather it addresses heuristically complex “most-case” behavior. It also highlights a general risk factor when using computers as judges: notional human errors are necessarily being evaluated by machines but the machines are evidently not doing so with human-relevant accuracy.

How to Handle It?

This last issue is not theoretical or didactic for me. In applying my statistical model to allegations of cheating I need to project distributions of errors by unaided human players of all skill levels accurately. I have been aware of ramped-up error below and away from 0.00 since 2008, when I devised a pleasingly symmetric way to smooth it out: Take a differential dμ(x) that has value 1 at x = 0.00 but tapers off to 0 symmetrically. This says that the marginal value of a centipawn is the same whether you are ahead or behind the same amount but lessens when the (dis)advantage is large. When a player makes an error of raw magnitude e in a position of value v, charge not e but rather the integral of dμ(x) from x = v – e to x = v. The metric can be weighted to make 0.00 effectively as wide as neighboring intervals.

The idea is that if you blunder when, say, 0.50 ahead, the integral will go through the fattest part of the metric and so charge near face value. But when 0.50 behind it will go through the thinner region -0.50 to -0.50 – e and so record a lesser value for my analyzer. This “metric correction” balanced and sharpened my system in ways I could verify by trials such as those I described on this blog in 2011 here.

Privately I’ve indulged an analogy to how relativity “corrects” the raw figures from Newtonian physics. But with today’s chess programs cannonballing to greater depths, I can tell when carrying over my previous calibrations from the earlier Rybka and Houdini engines that this method is being strained. Having now many times the amount of data I previously took on commodity PCs has made the quantum jump at 0.00 so clear as to need special attention. Making an ad hoc change in my model’s coefficients for positions of non-positive value still feels theoretically uncomfortable, but this may be what the data is dictating.

Open Problems

How can you explain the phenomenon? How should regard for phenomenology—in both experimental and philosophical senses—influence the manner of handling it?



Update 2/6/16: I received a suggestion to modify one line of the Stockfish 7 code to avoid shallow-depth “move count based pruning” at nodes belonging to the current principal variation. This inserts a third conjunct && !PvNode into what is currently the second block of “step 13” in the Stockfish source file search.cpp . Re-running the data for the Elo 2600–2800 levels produced the following table:

Evidently the change greatly reduces the depth-5 “noise” but still leaves it distinguishably higher than the depth-5 figures for Komodo 9.3.

[fixed typo in -10 to -01 range label for 2600+ blunder tables]