3/27/2016 – "I write for ChessBase because, as a kind of ‘community service’, we academics are expected to convey our research to the public in more palatable and widespread forms than just technical papers," writes Dr Azlan Iqbal. Unfortunately some readers interpreted his last article to be misogynistic, having “gratuitous sexist content”. The author replies to his critics and describes the application of the scientific method to an area as nebulous as aesthetics in chess.

Do Women Play More Beautiful Chess? – A Response to Critics

By Azlan Iqbal, Ph.D.

In response to the original article, I received a lot of feedback; most of which was negative and the rest neutral. This was not entirely unexpected given the subject matter. Regardless, not all the feedback (including the personal attacks) were seemingly from militant feminists and men who felt as if I had just insulted their girlfriends or wives. The former may feel that we are still at risk of regressing to a time when the roles of men and women in society were more clearly defined and the latter are probably just succumbing to what some might say are protective or defensive evolutionary instincts (e.g. mother for child, men for ‘defenseless’ women). There is only so much value or credibility one can attach to anonymous commentators on the Internet. Are they really who they claim to be? One also has to wonder if they would have responded similarly, or in stark contrast praised the work, had I found that women played more beautiful chess (even if just within the scope of mate-in-3 sequences).

Besides, anyone taking single sentences totally out of context and drawing conclusions from that about the whole body of work or writing a long, angry-sounding ‘rebuttal’ the very next day (especially with admittedly little or no background in artificial intelligence or computer science) is doing a poor job of making a cogent argument or providing constructive criticism. For the record, and to assuage any concerns, I did not doctor any of the experimental results and ChessBase did not strike some kind of Faustian bargain with me to exploit women for the sake of a few more mouse clicks. Our relationship goes back many years and has virtually nothing to do with women or money. As academics, we are actually used to negative feedback, especially when it pertains to new or controversial ideas being proposed. I recall in Cambridge University back in 2008 when I first met him, Lotfi Zadeh (who introduced the concept of fuzzy sets) said that when he first proposed some of his ideas, some of his peers said he “should be lynched” for trying to promote a ‘lack of precision’ in computing. Fortunately, to my knowledge, none of my peers feel that way about my work.

Anyway, some of the feedback I received were nevertheless actually genuine questions and concerns (including about my credentials and credibility) that I felt I should address in the interest of science; hence this follow-up article. I will not list the questions individually as many of them overlap. Instead, I will simply begin to address them collectively based on my interpretation of what they were trying to get at. Let me begin by saying that the aesthetics model in chess that I developed for my Ph.D. is considered a seminal piece of work in then uncharted waters. Therefore a realistic and doable scope needed to be set, i.e. three-move mate sequences. This is actually how one goes about doing a Ph.D. You should not bite off more than you can chew or you will never complete the degree. ‘Aesthetics’ was also defined as a common ground between the domains of chess problem composition and real games, since both are not without beauty. The necessary experiments were performed and the results showed that the computer could indeed apply the model to recognize aesthetics within the game (given three-movers) in a way that correlated positively and well with domain-competent human assessment. This had never been demonstrated before and was a significant contribution to the pool of knowledge in artificial intelligence. It also had practical applications, such as allowing the aesthetic analysis of thousands upon thousands of chess problems and winning move sequences in games that would be far too difficult for humans to do reliably.

During my study, I worked with many chess experts as well. In order to understand all this satisfactorily, you will probably have to read my thesis in its entirety. There are no shortcuts, just as there were none for me in preparing it. Even though we tend not to read as much as we used to (we see more pictures and videos now), we are not yet at the stage where we can depend on a computer to comprehend complex written texts and answer intelligent questions intelligently thereby saving us a lot of time and effort. I was awarded my Ph.D. from the University of Malaya which, when I graduated, was the top university in Malaysia. They also had a policy that, in addition to my own supervisor, there would be one internal and two external reviewers, all of whom must be full professors with related expertise. The external reviewers must also be from overseas, which in my case happened to be from renowned universities in the UK and Australia. My thesis was with them for seven months.

Also, unlike some institutions, all four professors (including my supervisor) must unanimously agree that the Ph.D. be awarded. It is not unusual or uncouth for one who has successfully attained their Ph.D. to put the ‘Ph.D.’ title at the end of one’s name in scientific reports or articles, just as medical doctors would put ‘M.D.’ at the end of theirs. Chess grandmasters are also known for putting ‘GM’ at the front of theirs in articles and it is ironic and hypocritical that any of them should think doing this is cocky or ‘showing off’. Given all this, for anyone to suggest that the aesthetics model developed lacks credibility is to declare one’s ignorance or preconceptions about me and the region I am from. While I have enjoyed visiting the West many times on business and pleasure, I have never had any intention of actually working or staying there. Not once at any stage of my education or career have I ever even applied to do so. I am quite happy living in and serving my own country. Thanks to the growth of the Internet, we are living in a virtually borderless world anyway.

The three-mover aesthetics model I developed and tested was furthermore extended in our 2012 IEEE paper, also with the help of chess experts (who also happen to be Ph.D. holders), to include not only three-movers but also studies (and logically, longer mates). The paper, of course, was thoroughly peer-reviewed and had to be revised before being accepted for publication so the extended model is also, ‘experimentally-validated’. It is true that in validating the model, the average of three cycles of evaluation by Chesthetica (as opposed to just one cycle) for each move sequence was tested but using just one cycle in future experiments is valid as well because, like a human judge of aesthetics, Chesthetica may or may not deliver exactly the same evaluation each time it looks at the same sequence (you will have to read the IEEE paper carefully to learn why this works well). This phenomenon is of little concern because the program’s consistency and reliability ‘over time’ has already been demonstrated by taking the average of its evaluations using multiple cycles. It does not imply that multiple cycles should always be used in the future and that only crisp, unchanging aesthetic values are acceptable for each sequence. The possibility of slight variations in an aesthetic evaluation makes the model more dynamic yet still, on average, consistent and reliable (much like a human judge). As for replication of experimental results, that depends on the hypothesis. Should the original hypothesis have stated that a single cycle of evaluation be used, then the replication of the experiment should use a single cycle as well and the result accepted, whatever it may be. Analogously, the p-value for statistical significance should also be determined beforehand and not reset after the experiment to better suit the results (e.g. changing it from 0.01 to 0.05).



While I have written many papers, to the extent memory serves, I do not self-publish at all even though admittedly, some of my publications are certainly better or more prestigious than others (like any academic). For example, when Britannica invited me to write the entry for “computational aesthetics” (my PhD field of study) in their encyclopedia or Springer’s recent publication of our book on the DSNS approach that my Chesthetica software uses to create original chess problems. Not to mention many papers published in the ICGA Journal, a reputable computer games journal with a high standard of publication. Ken Thompson published there too. I also have papers in high ranking AI conferences such as the AAAI and IJCAI. I do not ordinarily like to draw attention to these things but when questioned, I suppose I must set the record straight. As a side note, it is probably not a good idea to prepare conference slides at the last minute because typos may show up and you really cannot tell how seriously some people might take things like that and use it to draw conclusions about you.

As for ‘impact factor’, academics are well-aware of its limitations and interested readers might care to look those up as well, such as explained here. In short, it is not necessarily a good indicator about the quality of any particular piece of research work. For instance, a paper essentially reminding us yet again about the dangers of consuming too many burgers, fries or sodas could have a high impact factor largely because it is published in a popular medical journal, because medical science tends to get the most research funding and because many of them tend to study our eating habits (a lot more people than those looking into say, the computational aesthetics aspect of chess). On the other hand, I write for ChessBase (with no impact factor) because it is a kind of ‘community service’. As academics, we are expected to convey our research to the public in more palatable and widespread forms than just technical papers which tend to be of rather limited distribution and out of the layman’s typical scope of understanding, and in many cases even those from outside the particular field.

With regard to my chess-playing expertise, I never bothered to obtain an official chess rating even though I have been playing casually for 30 years and have won several medals in local tournaments. In fact, I am quite confident I could last at least 20 moves even against Magnus Carlsen under tournament conditions. If I had an official Elo rating, my probability of winning that match (or my ‘expected score’) can indeed be calculated and would probably be so low one might think I would lose faster than Bill Gates. I could probably beat him too, by the way. So, yes, I can say that I do indeed “know how to play” but would not consider myself an ‘official master’ at the game. The truth is, given my line of work, I simply do not need to be a chess master as there are many official chess masters only too happy to assist and work with me on projects. I am frankly quite amazed at how open-minded and forward-thinking some of them are.

The same can be said about scientists who study say, bodybuilding. They are not and need not necessarily be renowned bodybuilders themselves (though they do probably work with a few). Having said all that, I do not think I am smarter or “more intelligent than everyone else”. It is not like my IQ is in the 180-range or anything like that. I took a scientifically-accurate test back in 2003 and it was only 131 with the “unusual distinction of being equally good at math and verbal skills”. It is unfortunate that some people interpreted the original article here on ChessBase to be misogynistic, having “gratuitous sexist content” and claim that perhaps I did not even know any women. I was also ‘threatened’ that my academic standing and credibility would be undermined by all this and that I should think about my future in academia. Untrue on all counts, I would say. I have known plenty of women in my time. At last count, 52 from 23 different countries, as a matter of fact; and most of them would only have nice things to say about me, I am fairly confident.

As for academic standing, I am more concerned about scientific truth than what the effects of revealing it might have on my career. Certainly, not revealing it (the file-drawer effect) or trying to bury it without a good enough reason would have a greater effect on society (myself included). Besides, not all academics are so desperately looking for tenure or its equivalent and would ‘do or conceal anything’ to get it. Some of us (though I am not necessarily claiming to be in this group) – and presumably just like some grandmasters – may also be independently wealthy and could retire tomorrow if we pleased; never having to work another day in our lives. So now, after hopefully having set the record straight on these matters, let us look into some of the other concerns about the experiments in my paper that suggested men play more beautiful chess than women.

The first thing one should realize when reading a scientific paper is that there is probably always a scope specified (e.g. three-move mate sequences). There ought to be. Scientists do not claim to know everything and the scope serves as an indicator about the extent to which whatever was being tested was actually tested or could be tested. This does not mean nothing useful can be said about the subject matter. For instance, we may only know how certain parts of the brain function with respect to certain aspects of human activity, but that does not mean those findings are useless until and unless neuroscientists know how the whole brain works with regard to all of human activity. Science is a cumulative and self-corrective process.

Now, some ‘experts’ may feel their personal or collective intuitions about certain things trump experimental validation. However, from a scientific standpoint they are wrong. What you need to trump experimental validation is more or better experimental validation. ‘Common sense’ is not a scientific argument and has been known to be wrong or misleading. Just like one might be inclined to think that a bowling ball would hit the ground faster than a feather dropped from the same height in a vacuum chamber. So if anyone would like to analyze longer or different types of sequences in chess using say, some other method, you will first need to develop and experimentally validate your own aesthetics model for those types of sequences or all you have is essentially just personal (and quite probably biased) intuitions. Being a master player does not help you scientifically here.

Moving on to the perfectly valid question about whether playing strength correlates with aesthetics. In other words, do stronger players play more aesthetically? In the original study, that was not taken into account but the study does contrast the aesthetics of play between two engines, i.e. Rybka 3 vs. Fritz 8 (10+10) and Rybka 3 vs. Fritz 8 (1+1) scoring, on average, 1.979 and 1.992, respectively. The difference was not statistically significant. So this would suggest that playing strength is not necessarily relevant to beauty. However, I did happen to have two older databases with me with 1,000 randomly selected games that ended in mate between players with an Elo rating above 2,500 and between players with an Elo rating below 1,500. The games were sourced from Big Database 2011 and gender was irrelevant here, even though most were likely games between men, especially given the first set (so perhaps the result does not even apply to games between women). The average aesthetics scores (using the same statistical approaches as described in the original paper) were 1.815 and 1.693, respectively, and the difference was indeed statistically significant. So this suggests further that playing strength is relevant in the aesthetics of three-move mating sequences that result from play between humans.

What are the implications of this? Should playing strength have been taken into account in the original study so that only games between women within the same Elo range as the games between men were used? Perhaps it should but unfortunately, this was not possible without tampering with the selection process which is supposed to be random because there is no automatic (and unbiased) way to search for players based on their gender and there were simply not enough games between women in the database ending with ‘exclusivity’ (read the original paper to learn what this means) and mate that were also within any particular Elo range. Of course, most games between strong players do not even end in mate (they tend to resign) but again, the aesthetics of games like that are at present not scientifically testable. Besides, in comparing samples of the same kind (i.e. three-move mate sequences) drawn from a normal population (i.e. whatever was randomly obtainable from the database) the differences between men and women are still valid (within that scope, obviously).

The original study minimized introducing any kind of bias into the samples (of both men and women) by assuming that whatever was in the 6+ million game database used was an unbiased representation of games played throughout the world by both men and women. If it happens that there was a greater number of strong male players than female players in that database and therefore the samples of each used also reflected that and thus the games between females would necessarily score lower aesthetically... well, that begs the question, why, in a normal population, are there more games by stronger male players to begin with? This is not something that can be ‘adjusted for’ without introducing bias into the samples. If I were to select only specific, strong female players to compare against specific, strong male players... that would introduce so much bias I would have to justify how and why each of those players were chosen. It does not reflect what is typically found in the real-world population of players and what can realistically be selected at random from that. Now, imagine the additional biases introduced if arbitrary, ‘manual’ filters based on age were also applied.

Similarly, if I were to test a mixture of longer mates and study-like endings (which Chesthetica can also analyze aesthetically now) along with three-movers, arguments could be made that one sample had more of one type of mate or ending than the other sample and that affected the outcome because experiments also show that studies score, on average, higher aesthetically using the model than mates. Never mind yet the issue of deciding how far back one needs to go from the ending of a game to determine where the ‘study’ starts and how that decision was made for each game (talk about introducing bias!). This is why a doable, testable scope and consistency in experimentation is paramount. Otherwise, it makes the conclusions and implications of the research only more tenuous. So in summary, the original study assumed that the database used had an ‘as-fair-as-can-get-without-introducing-bias’ distribution of games between men and between women and compensated further for bias by using the average aesthetics score.

The two games shown in the original ChessBase article, for instance, should therefore not be seen as comparing apples and oranges but rather what the aesthetics model thinks of the sequences themselves, independent of who the players are or the conditions under which those moves were made (something humans might find very difficult to ignore). Additionally, by themselves, these two sequences are not ‘proof’ of anything and were never intended to be. The samples of 1,069 games each that were used surely also contained some games between women that were of higher quality than some of the games between men. This is the beauty of random selection and the bell curve. Hence the necessity for comparing only averages and not drawing grand conclusions from individual games or sequences. More games, I suppose, could have been used (e.g. by artificially flipping the colors where Black mates and treating the position as White mates in games that never actually occurred in that form) but again, this would have introduced bias, especially if playing with the white or black pieces influences the way people play at all. So since both samples featured only White wins (like the standard for most chess problems), comparisons between them are technically still valid. Besides, the randomly-selected 1,069 games in each sample were considered a sufficient number for experimental purposes.

What, then, about games between men and women or games between higher and lower rated players? How does the aesthetics analysis account for these? In the original study, games between men and women were scarcer still and virtually impossible to obtain automatically and randomly, so that is why they were not used. As was pointed out to me, it is also the case that there are ‘women only’ tournaments but no ‘men only’ tournaments. A strange (perhaps even sexist) double-standard that automatically excludes men (even low Elo ones) from some tournaments but does not exclude women from any. So this would further explain the aforementioned scarcity. As for higher rated players versus lower rated players, this was considered introducing more variability (read as ‘lack of consistency’) into the samples compared to using players of about the same rating. For instance, if the mate occurred as a result of a 2,500 Elo player defeating a 1,600 Elo player (I am guessing such games rarely take place to begin with), the larger gap in rating points (i.e. 900) would inherently introduce more things to be accounted for than if the difference was only, say, 150 Elo points.

There is also no evidence that a large Elo gap necessarily permits the stronger player to play more beautiful chess but it is certainly something I could test in future experiments given sufficient data. I do not know if there were necessarily more games like this in the female sample used in the original study but trying to find out and then arbitrarily deciding which ones to include and which ones to reject (and then doing the same for the male sample) would, once again, introduce more bias than the source database itself yielded automatically and with no interference from me. How about the argument that ‘forced’ three-move mate sequences undermine creativity and aesthetics? Well, in previous research work (Aesthetics in Mate-in-3 Combinations: Part II: Normality, ICGA Journal, December 2010), I have shown that forced mates, on average, are actually no different aesthetically, according to the experimentally-validated model and in the case of games between human players, than those that are not forced.

A human player or composer may be influenced to think somewhat less of a sequence that is not forced upon doing some deeper analysis on the position, however. This is why forced sequences are typically considered more beautiful and preferred in experiments because eventually, humans are going to perceive them. Again, as long as both samples are similar in the sense of being forced (or unforced) mates, the comparisons between them are more credible than say, if one sample was forced and the other was not. Now, do not get me wrong. Overall there are probably several dozen if not hundreds of different variations or permutations of the original study that could also have been done by filtering this out and compensating for that in order to test specifically for this with regard to that, but those, precisely, are other experiments with different scopes and different sets of constraints and limitations. I really do hope there are people who can find the funding and time to do them all; armchair commentators included. I would certainly be interested to read about the results and happy they have contributed to the literature on the subject, in however small a way.

Finally, the conclusions of the original study are actually supported by the fact that in the world of chess problem composition (typically having the highest aesthetics scores, even according to my model), the best compositions (if not just about all of them) are by men. It could be that the male ‘patriarchy’ of the composition world are secretly dismissing some of the most fantastic compositions ever composed simply because they are submitted by women, or it could also be that women, in general, are less interested in the aesthetics of chess for reasons that neuroscientists might be curious about (assuming learning more about the physiological differences between male and female brains and their implications is not yet considered forbidden research). I will leave it to readers to decide for themselves which explanation is more likely. I have no vested interests in the outcome and am more interested in the truth. I doubt women are so feeble-minded and lacking confidence as to be discouraged from chess by findings such as this and even if they turn out to be true, it is not necessarily something that cannot be compensated for with the right tutelage from a man (or woman) with greater skill. If anything, I hope the original study motivates even more women into playing the game and into the world of chess problem composition to prove they are indeed equal or even superior to men in this regard as well.

Wrapping up, let me also congratulate Google’s DeepMind on AlphaGo’s victory over humanity’s Go champion, Lee Sedol. I had absolutely no doubt this would happen and can only wonder why it took so long to achieve. By the way, Google, if you happen to have one of your quantum computers just lying around doing nothing, I would love to plug Chesthetica into it for a while for some serious computational creativity DSNS processing if the two are compatible. Just kidding (well, not really). Anyway, good show and respect to all the industry big boys out there breaking new ground and taking board game AI seriously.

This article was first uploaded to ResearchGate and you may contact Dr Azlan Iqbal via e-mail with any further questions or concerns you may have, at his official email address with a c.c. to his private address. Yes, he does reply to the best of his ability.

Previous ChessBase articles by Prof. Azlan Iqbal