Using Heuristic-Search Based Engines for Estimating Human Skill at Chess

By Matej Guid and Ivan Bratko

University of Ljubljana, Faculty of Computer and Information Science, Artificial Intelligence Laboratory, Ljubljana, Slovenia. This article is based on a paper by the same authors published in the ICGA Journal; full reference is given below.

In 2006, we carried out a computer comparison of world chess champions. We analysed moves played in World Championship matches with the chess program C RAFTY , in an attempt to assess objectively one aspect of the playing strength of chess players of different eras. The basic criterion for comparison among the players was, to what extent a player’s moves deviate from the computer’s moves. More precisely, we considered the average deviation between computer's evaluations of moves played by the human and machine-preferred moves. Also, we designed a method for assessing the complexity of a position, in order to take into account the different playing styles of the players and the difficulty of the positions they faced.

Many people found the results surprising. For example, Capablanca emerged as the top scorer according to the main criterion in that analysis. One reservation regarding these results was: Was Crafty, whose rating is no more than 2620, sufficiently strong for such a demanding comparison? What if a stronger chess program was used? Here we present some results from two further studies (Guid et al. 2008; Guid and Bratko 2011) that are relevant to these questions. These results, which include analyses with programs S HREDER , R YBKA 2 and R YBKA 3, can be summarised as follows:

In a repeat of our earlier analysis of chess champions, the results obtained with one of the strongest programs, R YBKA 3, were qualitatively very similar to the results obtained with C RAFTY .

The rough ranking of the players according to different programs is quite stable. More precisely, given two chess players, either all the programs unanimously rank one player to be clearly stronger than the other, or all the programs assess their strengths to be similar.

In order to obtain a sensible ranking of the players according to the criterion considered, it is not necessary to use a computer that is stronger than the players themselves. Interestingly, even considerably weaker programs tend to consistently produce similar rankings.

Introduction

The aim of our computer analysis of individual moves in World Chess Championship matches was to assess as objectively as possible one aspect of the playing strength of chess players of different eras. The open-source chess program C RAFTY was used. Among several criteria considered, the basic criterion for comparison among players was the average deviation between computer's evaluations of moves played by humans and machine-preferred moves.

Subsequent statistical analysis (2008) demonstrated that, at least for pairs of the players whose scores differed significantly, it is not very likely that the relative rankings of the champions according to the criterion considered would change if (1) a stronger chess program was used, or (2) if the program would search more deeply, or (3) larger sets of positions were available for the analysis. In current article, we further verify propositions (1) and (2) empirically by applying three chess programs stronger than CRAFTY for the same type of computer analysis of the players’ skills.

We also repeated our earlier analysis with the program CRAFTY of World Chess Champions with currently one of the strongest chess programs, RYBKA 3, and obtained qualitatively very similar results as with CRAFTY (click if you're too curious and want to see the results of the computer analysis with RYBKA 3 immediately).

Mission impossible?

Establishing computer programs for estimating chess players' skills may seem impossible, since it is well known that both programs' evaluations and programs’ decisions tend to change as the depth of search increases. Moreover, different programs typically assign different evaluations to a given position, even when using the same depth of search. Which program is therefore to be the most trusted one as the estimator: the strongest one in terms of the programs’ competition strength, or perhaps the one that is most equally unfair to all the players that are a subject of evaluation?

Program Evaluation Chessmaster 10 0.15 Crafts 19.19 0.20 Crafty 20.14 0.08 Deep Shredder 10 -0.35 Deep Shredder 11 0.00 Fritz 6 -0.19 Fritz 11 0.07 Rybka 2.2n2 -0.01 Rybka 3 -0.26 Zappa 1.1 0.13

Lasker-Capablanca, St. Petersburg 1914, position after White’s 12th move. The table on the right shows backed-up heuristic evaluations obtained by various chess programs, using 12-ply search.

All the above stated issues seem like an obstacle on the way to establishing heuristic-search based methods as competent estimators of human skill levels, especially in complex games such as chess.

In the present research, we provide an analysis of the differences between heuristic-search based programs in estimating chess skill, using a similar approach as in our earlier studies (2006-2008). We are particularly interested in analysing the behaviour of different chess programs, and to what extent the scores and the rankings of the players are preserved at different depths of search. Since the strength of computer-chess programs increases by search depth, the preservation of the rankings at different search depths would therefore suggest not only that the same rankings would have been obtained by searching more deeply, but also that using stronger chess programs would probably not affect the results significantly.

In the sequel, we present some of our results from the following scientific papers:

M. Guid and I. Bratko. Using Heuristic-Search Based Engines for Estimating Human Skill at Chess. ICGA Journal, Vol. 34, No. 2, pp. 71-81, 2011.

M. Guid, A. Pérez and I. Bratko. How Trustworthy is C RAFTY 's Analysis of World Chess Champions? ICGA Journal, Vol. 31, No. 3, pp. 131-144, 2008.

An interested reader will find the links to both papers below.

To avoid possible misinterpretation of the presented work, it should be noted that these articles are not concerned with the question of how appropriate the particular measure of the playing strength (deviation of player's moves from computer-preferred moves, using a sufficiently large sample of positions for the computer analysis) is as a criterion for comparing chess players' ability in general. It is only one of many sensible criteria of this kind. Nevertheless, we believe that the relatively good preservation of the rankings at different levels of search using various programs of different competition strengths may represent a solid ground on the way to discover increasingly better methods for establishing heuristic-search based computer programs as reliable estimators of skill levels in game playing.

What if a WEAKER program than (12-ply) Crafty was used for the computer analysis of the champions?

Various discussions about our 2006 article took place at different popular blogs and forums across the internet, and Chessbase soon published some interesting responses by various readers from all over the world, including some by scientists. A frequent comment by the readers could be summarised as: "A very interesting study, but it has a flaw in that program C RAFTY , of which the rating is only about 2620, was used to analyse the performance of players stronger than C RAFTY . For this reason the results cannot be useful." Some readers speculated further that the program will give a better ranking to players that have a similar strength to the program itself.

The main two objections by the readers to the used methodology were: (1) the program used for analysis was too weak, and (2) the search depth of 12 plies performed by the program was too shallow.

We addressed these objections in order to assess how reliable C RAFTY (or, by extrapolation, any other fallible chess program) is as a tool for comparison of chess players, using the suggested methodology. In particular, we were interested in observing to what extent the scores and the rankings of the players are preserved at different depths of search. We first observed C RAFTY 's rankings of the champions at lower search depths.

In each position, we performed searches to depths ranging from 2 to 12 plies extended with quiescence search to ensure stable static evaluations. Backed-up evaluations of each of these searches were used for analysis. PV stands for principal variation.

Searches to various depths were conducted (including quiescence search to obtain stable evaluations). Then the scores obtained by the program are the average differences between computer evaluations of the players' choices of moves and the computer's choices at each particular search depth. Based on players' scores, rankings of the players are obtained in such a way that a lower score results in a better ranking. Additional constraints and other details about the methodology used are described in our ICGA Journal paper (2008).

The computer analysis was based on the evaluation of the games played by the chess champions in the classical World Chess Championship matches between 1886 (Steinitz-Zukertort) and 2006 (Kramnik-Topalov). The following fourteen champions were therefore included in the analysis: Steinitz, Lasker, Capablanca, Alekhine, Euwe, Botvinnik, Smyslov, Tal, Petrosian, Spassky, Fischer, Karpov, Kasparov, and Kramnik.

Our results show, possibly surprisingly, that at least for the players whose scores differ sufficiently from the others the ranking remains preserved, even at very shallow search depths.

C RAFTY 's scores (average deviations between the evaluations of played moves and best-evaluated moves according to C RAFTY ) of each "classical" world champion (1886-2006) at different depths of search. Highlighted are the scores of Capablanca (bottom curve), Kramnik, Euwe and Steinitz (top curve).

Some champions whose rankings significantly deviate from the others were identified: Capablanca, Euwe, and Steinitz. Their relative rankings among all the champions were preserved at each level of search. The stability of their rankings was further supported by the statistical analysis of the results and by the fact that Capablanca had the best score in 95% of all the subset-depth combinations in 100 samples consisting of 500 randomly chosen positions. As we did in our study, this result should be interpreted in the light of Capablanca's playing style that tended towards low complexity positions.

Using stronger programs for the analysis

To study whether different heuristic-search based programs can be used as estimators of skill levels, we will now focus on these three players and check whether other chess programs rank them in the same order as C RAFTY did. We will observe variations with depth of average differences between player’s and program’s decisions on a large subset of randomly chosen positions from the World Chess Championship matches using three chess programs stronger than C RAFTY , namely: S HREDDER , R YBKA 2, and R YBKA 3.

All positions for the analysis were taken from classical World Chess Championship matches (1886-2006). Besides analysing the moves of the three players, we also randomly picked more than 20,000 positions of the rest of the players as a control group.

The scores of Capablanca (bottom curve), the control group, Euwe, and Steinitz (top curve), obtained on a large subset of games from World Chess Champion matches, using S HREDDER .

The scores of Capablanca (bottom curve), the control group, Euwe, and Steinitz (top curve), obtained on a large subset of games from World Chess Champion matches, using R YBKA 2.

The scores of Capablanca (bottom curve), the control group, Euwe, and Steinitz (top curve), obtained on a large subset of games from World Chess Champion matches, using R YBKA 3.

The relative ranking of Capablanca, Euwe, Steinitz, and the players in the control group are preserved at all depths using any of the programs at any level of search. These experimental findings reject possible speculations that using chess programs stronger than C RAFTY for the analysis could lead to completely different results, and that Capablanca’s result is merely a consequence of his style being similar to C RAFTY ’s style. We also view these results as another confirmation that in order to obtain a sensible ranking of the players, it is not necessary to use a computer that is stronger than the players themselves.

Computer analysis of World Chess Champions revisited with R YBKA 3

We present the results of computer analysis of World Chess Champions obtained by R YBKA 3. We analysed the same matches as in (2008), in order to make the comparison of the results obtained by chess programs C RAFTY and R YBKA 3 as relevant as possible. That is, games for the title of World Chess Champion, in which the fourteen classic World Champions (from Steinitz to Kramnik) contended for or were defending the title, were selected for the analysis. The details of the methodology are given in the paper (2011). Searches to 11 and 12 plies were omitted in order to shorten the computational time required for such an extensive analysis.

R YBKA 's scores (average deviations between the evaluations of played moves and best-evaluated moves according to R YBKA 3) of each "classical" world champion (from 1886 to 2006) at different depths of search. Highlighted are the scores of Capablanca (bottom curve), Kramnik, Euwe and Steinitz (top curve). Based on players' scores, rankings of the players are obtained in such a way that a lower score results in a better ranking.

Let us take a closer view:

A closer view: the scores of the champions obtained by chess program R YBKA 3 at various depths of search. For almost all depths it holds that rank(Capablanca) < rank(Kramnik) < rank (Kasparov) < rank (Karpov) < rank(Petrosian) < rank(Botvinnik) < rank(Euwe) < rank(Steinitz).

The results closely resemble the ones obtained by C RAFTY (2008). Again the rankings stay the same for almost all depths: rank(Capablanca) < rank(Kramnik) < rank (Karpov, Kasparov) < rank(Petrosian) < rank(Botvinnik) < rank(Euwe) < rank(Steinitz). However, there are nevertheless some obvious differences between the results obtained by the two programs. Firstly, C RAFTY ranked Karpov better than Kasparov at all depths: just the opposite holds for R YBKA 3. Nevertheless, the scores of both (and also many other) players are rather similar to each other.

Secondly, the performance of Fischer as seen by R YBKA 3 is significantly better in comparison with the results obtained by C RAFTY . This finding is associated with the following (so far unanswered) question: Does a program's style of play exhibit preference for the styles of any particular players? The comparison of the results of computer analysis of the champions obtained by C RAFTY and R YBKA 3 suggests that a program’s style of play may affect the rankings, but only to a limited extent.

The meaning of the scores

One frequent question by the readers was associated with the meaning of the players’ scores obtained by the program. A typical misinterpretation of their meaning went like this: "For every 8 moves on average, C RAFTY expects to gain an advantage of one extra pawn over Kasparov." We would like to emphasize here that the scores obtained by the program only measure the average differences between the players' choices of move and the computer's choice. However, as the analysis shows these scores that are relative to the computer used, have good chances to produce sensible rankings of the players.

The experimental results not only confirm that the scores are not invariable for the same program at different depths of search, the scores also differ significantly when using different programs. This is most clearly seen in the following graph, where average scores of all the players, obtained on the same large subset of games from World Chess Champion matches, are shown for the three programs, and compared to average scores of all the players according to C RAFTY . While the scores of S HREDDER are very similar to the scores of C RAFTY , the scores of the two R YBKA s differ considerably from those of the other two programs.

Comparison of average scores of all the players (including the ones in the control group), obtained

by C RAFTY , S HREDDER , R YBKA 2, and R YBKA 3.

In the paper (2011), we provided the following explanation regarding the magnitude of the scores with respect to the search depth. The decreasing tendency of the scores with an increasing search depth (typical at the shallower search depths) is a consequence of increasingly better choices by the program. At higher search depths, however, the higher cost of not choosing the best move according to the program may even lead to increased scores due to the monotonicity property of heuristic evaluation functions. That is, in won positions programs' evaluations tend to increase monotonically with depth of search.

Where to go on from here?

Our approach to estimating skill levels is aimed at assessing the quality of play regardless of the game score, and assumes using fallible estimators (as opposed to infallible estimator such as chess tablebases). Since our paper in 2006, several related and interesting scientific works emerged. Haworth (2007) defined a mapping of the apparent player’s skill into a Referent Agent Space, using a stochastic agent and a Bayesian inference method. The results of the experimental analysis with this approach (Di Fatta, Haworth, and Regan, 2009; Haworth, Regan, and Di Fatta, 2010; Regan and Haworth, 2011) showed that the inferred probability distributions of the apparent skills are able to discriminate players’ performances in different Elo ranges. These results further speak in favour of a possibility to estimate human skills based on computer heuristic evaluation. Full references of these papers are given below.

Both of these approaches to estimating skill levels in complex games have something in common, besides that they are based on computer analysis of player’s actions: they both assume (implicitly or explicitly) that average differences in computer evaluations between the player’s moves and the computer’s moves provide a solid ground for sensible benchmarking of the players, when a sufficiently large amount of data for the analysis is available. That is, regardless of how this measure is further refined to take into account the full context of the player’s decisions, these average differences alone have good chances as rather stable performance measures, even when using different programs as estimators of the players’ skills.

A remarkable amount of computation for a similar computer analysis of chess champions was performed by Charles Sullivan. According to the author, for 24 hours a day for 15 months (from February 2007 through May 2008), 12 computing threads (on three Intel quad-core Q6600 computers running at 3.0 GHz) analyzed the games of the World Champions. Entire playing careers were analyzed - in all, 617,446 positions from 18,785 games were processed. The results are published at Truechess.com. The author considers "Best 10-Year Period" statistics to be useful for selecting "the greatest chessplayer of all time." His obtained rankings were somewhat different from ours. However, these results cannot be directly compared because his analysis included ALL the games played, but limited to the best 10-year period. It should also be noted that in his analysis, the amount of computation per analysed position was limited by time and not by depth. Thus, the average differences between the players' choices and the computer's choices in Sullivan's analysis were obtained at different search depths and may not be directly comparable, due to the monotonicity property of heuristic evaluation functions (for a detailed explanation refer to Matej Guid's Ph.D. Thesis, Chapter 8.5.1). We strongly recommend using fixed search depth for this type of computer analysis of chess games.

One of the important (and still not completely resolved) issues is: How to take into account the differences between players in the average difficulty of the positions encountered in their games? In 2006, we devised a heuristic-search based method to assess average difficulty of positions used for estimating the champions’ performance and presented the results of applying this method in order to compare chess players of different playing styles. Those results suggested that Capablanca’s outstanding score in terms of low average differences in computer evaluations between the player’s moves and the computer’s moves should be interpreted in the light of his playing style that tended towards low complexity positions. However, we believe that further work on the above question is required.

The authors Matej Guid has received his Ph.D. in computer science at the University of Ljubljana, Slovenia. His research interest include computer game-playing, automated explanation and tutoring systems, heuristic search, and argument-based machine learning. Some of his scientific works, including the Ph.D. thesis titled Search and Knowledge for Human and Machine Problem Solving, are available on Matej's Research page. Chess has been one of Matej's favourite hobbies since his childhood. He was also a junior champion of Slovenia a couple of times, and holds the title of FIDE master.

Ivan Bratko is professor of computer science at University of Ljubljana, Slovenia. He is head of Artificial intelligence Laboratory, Faculty of Computer and Information Science of Ljubljana University, and has conducted research in machine learning, knowledge-based systems, qualitative modelling, intelligent robotics, heuristic programming and computer chess (do you know the famous Bratko-Kopec test?). Professor Bratko has published over 200 scientific papers and a number of books, including the best-selling Prolog Programming for Artificial Intelligence. Chess is one of his favourite hobbies.

References

Copyright Guid/Bratko/ChessBase