Welcome to my data analysis blog! This will be my first project playing around with R, which I’m new to at this stage. In this project I’ll be looking at the dataset of chess games at http://chess-research-project.readthedocs.io/en/latest, which as a chess player is of major interest to me. This dataset contains over 3.5 million games from high rated players. I’ll start by loading and cleaning the data, then doing some basic analysis.

The code for this project is on my Github at https://github.com/wanshun123/chess-dataset.

Getting started: loading the data into a data frame

First I’ll load the libraries I’ll be using during this project, as follows:

library(tidyverse) library(sqldf) library(ggplot2) library(ggpubr) library(lubridate)

Now we’ll be loading the file “all_with_filtered_annotations.txt” into a dataframe, using the read_lines function of the tidyverse package. From the website, we know the first 5 lines of this file are just comments, and every line after that is a chess game (including plenty of relevant information about the game). The following command reads this text file, excluding the first 5 lines, and assigns it to the variable df:

df <- read_lines("all_with_filtered_annotations.txt", skip = 5)

Per the website, each line of data here should contain the moves of a chess game plus 16 other pieces of information, such as the date the game was played, the rating of the white player, the rating of the black player, the result of the game and so on. The website gives a sample of what the first 8 lines look like – the first line after the 5 lines of comments is like this:

1 2000.03.14 1-0 2851 None 67 date_false result_false welo_false belo_true edate_true setup_false fen_false result2_false oyrange_false blen_false ### W1.d4 B1.d5 W2.c4 B2.e6 W3.Nc3 B3.Nf6 W4.cxd5 B4.exd5 W5.Bg5 B5.Be7 W6.e3 B6.Ne4 W7.Bxe7 B7.Nxc3 W8.Bxd8 B8.Nxd1 W9.Bxc7 B9.Nxb2 W10.Rb1 B10.Nc4 W11.Bxc4 B11.dxc4 W12.Ne2 B12.O-O W13.Nc3 B13.b6 W14.d5 B14.Na6 W15.Bd6 B15.Rd8 W16.Ba3 B16.Bb7 W17.e4 B17.f6 W18.Ke2 B18.Nc7 W19.Rhd1 B19.Ba6 W20.Ke3 B20.Kf7 W21.g4 B21.g5 W22.h4 B22.h6 W23.Rh1 B23.Re8 W24.f3 B24.Bb7 W25.hxg5 B25.fxg5 W26.d6 B26.Nd5+ W27.Nxd5 B27.Bxd5 W28.Rxh6 B28.c3 W29.d7 B29.Re6 W30.Rh7+ B30.Kg8 W31.Rbh1 B31.Bc6 W32.Rh8+ B32.Kf7 W33.Rxa8 B33.Bxd7 W34.Rh7+

So we can see each item of information about a game is separated by a space ” ” , and if we’re to load all this data into a table with columns, each column should be one of these properties. Note though that the moves of the game (at the end of the line) are also separated by a space, and of course we don’t want a separate column for each game move – rather a column containing all the game moves. The following command splits each line by a space up to the first 18 spaces, whereupon it’ll get to the last part of the line with the game moves and leave that unsplit:

df <- str_split(df, " ", 18, TRUE)

The TRUE parameter there turns df into a character matrix (if it were FALSE it’d make df a list of character vectors), which are used when all data is of the same type. What we want though is a data frame, since different columns would be of different types (eg. a players rating would be numeric, and most of the other columns would be logical, ie. TRUE/FALSE). The following command turns df into a data frame:

df <- as_tibble(df)

Now we can see what our data frame looks like:

view(df)

So far so good. There are over 3.5 million rows in this data frame and the next step is to clean it up.

Cleaning the data

First thing you might notice is that the V17 column is useless, just containing “###”. That can be removed as follows:

df <- subset(df, select = -c(V17))

We also want to give proper names to each column, rather than them just being V1, V2 etc. This can be done as follows (the website helpfully gives an explanation of what each column is):

colnames(df) <- c("number", "date", "result", "white_rating", "black_rating", "total_moves", "date_missing", "result_missing", "white_rating_missing", "black_rating_missing", "event_date_missing", "different_start_position", "different_start_position2", "result_not_properly_provided", "out_of_year_range", "bad_length", "moves")

We can immediately start checking if there are rows of data here that are useless or make no sense. For example, we should see if there are any rows where the game moves are missing, and remove them. As I’m most familiar with SQL, I’ll be using the sqldf package to query the data frame just as if it were a table in a database. The following command will count how many rows there are where the total_moves column is 0:

sqldf('select count(*) from df where total_moves is 0')

That returns 37978. So there are that many rows that don’t even have any game moves, according to the total_moves column. Those can be removed as follows:

df <- df[df$total_moves != 0,]

If we then run the query to check how many rows have no game moves again, it should come back with 0. Next lets check if there’s any games that have a strange result. In chess, there’s only 3 possible results: 1-0 (White won), 0-1 (Black won) and 1/2-1/2 (draw). Lets see if the result column has anything in it other than that. Using unlist will return all the values in a column, and the command below also puts them into a table:

table(unlist(df$result))

That returns the following:

* 0-1 1-0 1/2-1/2 10 1093539 1362354 1067589

So almost every row has a valid result, except for 10. To remove those we can use this command:

df <- df[-grep("1-0|0-1|1/2-1/2", df$result, invert = TRUE),]

That is saying look for everything in the result column that doesn’t contain “1-0”, “0-1” or “1/2-1/2” (the invert = TRUE looks for items that don’t match the pattern), and remove them (with the – sign).

There’s plenty more data cleaning we can still do. Beyond rows that have no game moves or an invalid result, we could also easily remove rows that are missing a date (there is a column date_missing that tells us if that’s missing) or if the white or black players rating is missing, or even remove certain rows where the white or black rating was considered too low to be a high quality game. However, for the purposes of this project I’m not concerned with those being included, so lets go to the next step.

Casting the data

Earlier I mentioned that we want the data in a data frame, as that allows for each column to be a different data type, such as numeric or logical. Currently all the columns in our df data frame are just of type character (chr), as can be verified by checking the structure of df as follows:

str(df)

This returns the following:

Classes ‘tbl_df’, ‘tbl’ and ‘data.frame’: 3523492 obs. of 17 variables:

$ number : chr “1” “2” “3” “4” …

$ date : chr “2000.03.14” “2000.03.14” “1999.11.20” “1999.11.20” …

$ result : chr “1-0” “1-0” “1-0” “1-0” …

$ white_rating : chr “2851” “2851” “2851” “2851” …

etc.

Let’s turn each column into a suitable data type. To begin with, the number, white_rating, black_rating and total_moves columns should all be of type numeric, since they are just a single number:

df$number <- as.numeric(df$number) df$white_rating <- as.numeric(df$white_rating) df$black_rating <- as.numeric(df$black_rating) df$total_moves <- as.numeric(df$total_moves)

We can also turn the date column into a year, month, date format (this ymd function is part of the lubridate package):

df$date <- ymd(df$date)

All the remaining columns (except for the moves column which can remain as a character type), such as date_missing, result_missing and so on, are all just true or false values and should therefore be cast to logical. In the date_missing column, for example, it’ll have values of either date_false or date_true, depending on whether the date is missing, and we want to turn the whole column into a logical type where date_false will be replaced with FALSE and date_true will be replaced with TRUE.

The following large block of code will go through each of the 2 possible values of the columns we want to change to a logical type, setting each value to TRUE or FALSE as appropriate (also, if there’s something other than the 2 values we expect it’ll be set to TRUE):

df <- df %>% mutate( date_missing = case_when(date_missing == "date_false" ~ FALSE, date_missing == "date_true" ~ TRUE, TRUE ~ NA), result_missing = case_when(result_missing == "result_false" ~ FALSE, result_missing == "result_true" ~ TRUE, TRUE ~ NA), white_rating_missing = case_when(white_rating_missing == "welo_false" ~ FALSE, white_rating_missing == "welo_true" ~ TRUE, TRUE ~ NA), black_rating_missing = case_when(black_rating_missing == "belo_false" ~ FALSE, black_rating_missing == "belo_true" ~ TRUE, TRUE ~ NA), event_date_missing = case_when(event_date_missing == "edate_false" ~ FALSE, event_date_missing == "edate_true" ~ TRUE, TRUE ~ NA), different_start_position = case_when(different_start_position == "setup_false" ~ FALSE, different_start_position == "setup_true" ~ TRUE, TRUE ~ NA), different_start_position2 = case_when(different_start_position2 == "fen_false" ~ FALSE, different_start_position2 == "fen_true" ~ TRUE, TRUE ~ NA), result_not_properly_provided = case_when(result_not_properly_provided == "result2_false" ~ FALSE, result_not_properly_provided == "result2_true" ~ TRUE, TRUE ~ NA), out_of_year_range = case_when(out_of_year_range == "oyrange_false" ~ FALSE, out_of_year_range == "oyrange_true" ~ TRUE, TRUE ~ NA), bad_length = case_when(bad_length == "blen_false" ~ FALSE, bad_length == "blen_true" ~ TRUE, TRUE ~ NA))

Now calling str(df) on df will show that all the columns are of a suitable type according to the data they contain, such as numeric, date, character or logical, rather than all just being character:

Classes ‘tbl_df’, ‘tbl’ and ‘data.frame’: 3523482 obs. of 17 variables:

$ number : num 1 2 3 4 5 6 7 8 9 10 …

$ date : Date, format: “2000-03-14” “2000-03-14” “1999-11-20” “1999-11-20” …

$ result : chr “1-0” “1-0” “1-0” “1-0” …

$ white_rating : num 2851 2851 2851 2851 2851 …

$ black_rating : num NA NA NA NA 2633 …

$ total_moves : num 67 53 57 49 97 52 79 71 72 49 …

$ date_missing : logi FALSE FALSE FALSE FALSE FALSE FALSE …

etc.

Visualising some of the metrics

To get a better feel for this data it’s worth first graphing some of the metrics such as player ratings, results and the length of games.

Typical number of moves in a chess game

ggplot(df, aes(x = df$total_moves)) + geom_histogram(binwidth = 5)

This is a count of how many games are of a certain number of total moves, in groups of 5 moves at a time (binwidth = 5). Looking at that graph there must be at least one game with a monster number of ~600 total moves, so we can make a new graph with an x axis going to a max of 200 to get rid of that and see things a bit clearer. Let’s also change the binwidth to 1 instead of 5:

ggplot(df, aes(x = df$total_moves)) + geom_histogram(binwidth = 1) + xlim(0,200)

Note this is total moves, so for example a game of 1. e4 c5 is two moves, not one. There are two things here that look a little peculiar, first being the surge in games that end at around 80 and 120 total moves, and also the spikes that seem to be every 2nd or 3rd total move – if we unlist total_moves and put it in a table we see something like the following (I’m only showing part of it):

table(unlist(df$total_moves))

51 52 53 54 55 56 42955 35061 44714 36743 46481 38422

I’d expect that to be because White wins more than Black, and players almost always only resign when it’s their move (not their opponents), and every odd value of total_moves means White made the last move. This can easily be verified. The command below divides the number of games where the total_moves is odd (total_moves % 2 == 1) and therefore White made the last move, and White won (result is “1-0”), with the total number of games where Black made the last move (total_moves % 2 == 0) and White won:

sqldf('select count(*) from df where total_moves % 2 == 1 and result is "1-0"')/sqldf('select count(*) from df where total_moves % 2 == 0 and result is "1-0"')

This returns 10.07794, so there are over 10 times as many games where White won and White made the last move, compared to games where White won and Black made the last move. Lets also check how more often White wins compared to Black:

sqldf('select count(*) from df where result is "1-0"')/sqldf('select count(*) from df where result is "0-1"')

That returns 1.245821, ie. White wins nearly 25% more often than Black. So we can see why there are far more games with an odd number of total moves than an even number.

Why there should be a spike in games with 80 and 120 total moves is unclear to me and an exercise for later.

White wins, Black wins, draws

ggplot(df, aes(x = result)) + geom_bar()

We’ve already covered this and there’s nothing terribly interesting here. White wins significantly more than Black, and there are a tad more Black wins than draws.

White and Black ratings

ggplot(df, aes(x = df$white_rating)) + geom_histogram(binwidth = 5)

ggplot(df, aes(x = df$black_rating)) + geom_histogram(binwidth = 5)

The graphs for White and Black ratings are basically completely identical, which you’d expect as any player should play both White and Black equally. While in the vast bulk of games ratings are above 2000 (which could certainly be considered a fairly strong player), there are a significant number of games played with players of weak ratings, and it’d be reasonable to get rid of these if we just wanted to do analysis on games between strong players (or we could create two different data frames, one for games between strong players of say over 2000 rating, and one below, and compare metrics between the two if we wanted).

ggscatter(df, x = "white_rating", y = "black_rating", cor.coef = TRUE)

As expected there is basically a perfect correlation between White and Black ratings.

Years when games were played

The first 4 characters of the date column will be the year the game is played, so we can extract just that part of the column (and make it a data frame so it can be graphed):

as_tibble(substr(df$date, 1, 4))

And turn it into numeric:

years <- as.numeric(years$value)

If we graph this it’s obvious there are some years that make no sense:

ggplot(years, aes(years)) + geom_histogram(binwidth = 1)

According to the website the earliest game in this dataset was supposed to be played in 1783, though we’re picking up years before that, and it’s clear there’s very little before around 2000. Let’s try narrowing it down a bit:

ggplot(years, aes(years)) + geom_histogram(binwidth = 1) + xlim(1950,max(years))

We can see the vast majority of games took place from the late 90’s to 2008. Obviously there being around 30,000 games with a year of 2020 is a mistake, assuming this dataset didn’t come from sometime in the future 🙂

Opening move analysis

Now we get into the fun stuff where we can do some analysis on the success of different moves. To start with something fairly basic, I want to see what starting move with white is the most successful. Let’s say we want to find out what percentage of games where White begins with 1. e4 result in a win for White. That could be done as follows:

sqldf('select count(*) from df where moves like "%W1.e4%" and result is "1-0"')/sqldf('select count(*) from df where moves like "%W1.e4%"')*100

The above finds the number of games where White started with 1. e4 and won, divides it by the total number of games where White played 1. e4 and times it by 100 to get the percentage. For this we get 38.94065.

As we want to test a number of different starting moves (there are 20 possible in total) it’d make much more sense to write a function where we just pass a move in as a parameter and the function spits out the percentage of times that move was played that resulted in a win. Here’s my attempt at that:

win_tendency <- function(move, colour = "W") { total_occurences <- fn$sqldf('select count(*) from df where moves like "%$move%"') if (colour == "W") { wins <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "1-0"') } else if (colour == "B") { wins <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "0-1"') } return(wins/total_occurences*100) }

Then we can call it as follows, to see what percentage of games starting with 1. e4 result in a win for white:

win_tendency("W1.e4")

… And we get the same 38.94065. The function takes two parameters, the first being a move (it can be any move from either Black or White, and not just a starting move, eg. “B20.Bf5” – or it could even be multiple moves such as “W1.e4 B1.c5 W2.Nf3” etc) and the second the player colour, being W or B (the function will fail if something else is put in), which determines what player we’re calculating the win percentage for for that move. By default the colour parameter is W, but if I changed it to B I could calculate the percentage of times black wins when white plays 1. e4, eg:

win_tendency("W1.e4", "B")

That returns 32.64099, ie. Black wins 32.6% of the time when White plays 1. e4.

We can also easily write a function to return the percentage of the time a certain move results in a draw:

draw_tendency <- function(move) { total_occurences <- fn$sqldf('select count(*) from df where moves like "%$move%"') draws <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "1/2-1/2"') return(draws/total_occurences*100) }

Let’s run two popular starting moves for white, 1. e4 and 1. d4, through that:

draw_tendency("W1.e4") draw_tendency("W1.d4")

This returns 28.41836 and 31.77669. That is what I largely suspected, that 1. d4 results in more draws than 1. e4, being that (the following sentence is only relevant if you play chess) very generally speaking 1. d4 may lead to quieter positional games (Queen’s gambit declined type positions) while 1. e4 can lead to sharper games like the Sicilian.

Before testing other moves, it’s apparent that to get some meaningful statistics, the win_tendency() and draw_tendency() functions above are still too slow and lacking because they don’t save their result, and only give part of the picture. It’d be better if there was one function that returned all the important statistics on a move – its total occurrences, wins, losses and draws – and then saved the result in a data frame, so we could look at it again later. Let’s try that. First we’ll create a data frame that will store the information on each move:

movedf <- data.frame(matrix(ncol = 6, nrow = 0), stringsAsFactors = FALSE) colnames(movedf) <- c("move", "colour", "occurrences", "win %", "loss %", "draw %")

And now a function that will take any move, find all relevant statistics on it, and add that information to the movedf data frame we just created:

move_stats_append_df <- function(move, colour = "W") { if (colour == "W") { wins <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "1-0"') losses <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "0-1"') } else if (colour == "B") { wins <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "0-1"') losses <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "1-0"') } draws <- fn$sqldf('select count(*) from df where moves like "%$move%" and result is "1/2-1/2"') total_occurrences = wins + losses + draws result_list <- list("move" = move, "colour" = colour, "occurrences" = total_occurrences, "win %" = wins/total_occurrences*100, "loss %" = losses/total_occurrences*100, "draw %" = draws/total_occurrences*100) movedf[nrow(movedf) + 1,] <<- result_list return(list(result_list)) }

This function may look complicated but is straightforward. It just does the following:

Determines the number of wins, losses and draws for the move the user gives it, for either Black or White (depending on if the user calls the function with the colour parameter = “B” or not) Finds the total occurrences of the move by adding the number of wins, losses and draws for it Puts all the information it has (move, colour, move occurrences, wins, losses and draws) into a list called result_list Adds result_list to the movedf data frame as a new row Returns result_list to the user so they can see it immediately

We can try running every possible white starting move through this function, and it’ll populate the movedf data frame. This takes a long time on my machine and I’m sure there are much better faster of doing this without using the sqldf package.

move_stats_append_df("W1.a3") move_stats_append_df("W1.a4") move_stats_append_df("W1.b3") move_stats_append_df("W1.b4") move_stats_append_df("W1.c3") move_stats_append_df("W1.c4") move_stats_append_df("W1.d3") move_stats_append_df("W1.d4") move_stats_append_df("W1.e3") move_stats_append_df("W1.e4") move_stats_append_df("W1.f3") move_stats_append_df("W1.f4") move_stats_append_df("W1.g3") move_stats_append_df("W1.g4") move_stats_append_df("W1.h3") move_stats_append_df("W1.h4") move_stats_append_df("W1.Na3") move_stats_append_df("W1.Nc3") move_stats_append_df("W1.Nf3") move_stats_append_df("W1.Nh3")

Then the movedf data frame might as well be ordered by occurrences of each move, from highest to lowest. That actually requires two commands:

movedf <- as.data.frame(lapply(movedf, unlist)) movedf <- movedf[order(movedf$occurrences, decreasing = TRUE),]

Now viewing movedf gives some very interesting data:

Only relevant to chess players: As a 1. e4 player, this surprised me because it shows 1. d4 as being an objectively better move as it only wins about 0.13% less than 1. e4, yet loses over 3% less than 1. e4. I’d have expected 1. e4 to lead to significantly more wins for white than 1. d4. 1. e4 tends to lose more than the next four most common white moves (1. d4, 1. Nf3, 1. c4 and 1. g3), which didn’t surprise me (those all often tend to lead to fairly similar, quieter positions) – nor did the fact that a crazier move like 1. f4 suddenly leads to a far higher loss percentage for white, as it easily results in very unbalanced games especially if it goes 1. f4 e5 (From’s gambit), for example. Looking at the occurrences of each move, the vast majority of games start with a sensible move, which along with the ratings is a decent indication these games are of reasonable quality (if there were a lot of amateur games here I’d expect a lot more 1. a4 and 1. h4).

Earlier I mentioned these functions can be used to lookup a sequence of moves at any point (not just a single starting move), so let’s try plugging in some of the most popular openings beyond just the first move:

# Sicilian Najdorf move_stats_append_df("W1.e4 B1.c5 W2.Nf3 B2.d6 W3.d4 B3.cxd4 W4.Nxd4 B4.Nf6 W5.Nc3 B5.a6") # King's Indian move_stats_append_df("W1.d4 B1.Nf6 W2.c4 B2.g6 W3.Nc3 B3.Bg7 W4.e4 B4.d6") # Petrov move_stats_append_df("W1.e4 B1.e5 W2.Nf3 B2.Nf6") # Ruy Lopez move_stats_append_df("W1.e4 B1.e5 W2.Nf3 B2.Nc6 W3.Bb5") # Ruy Lopez / Berlin move_stats_append_df("W1.e4 B1.e5 W2.Nf3 B2.Nc6 W3.Bb5 B3.Nf6") # Dutch Defense move_stats_append_df("W1.d4 B1.f5") # Queen's Gambit Declined move_stats_append_df("W1.d4 B1.d5 W2.c4 B2.e6")

Here’s what movedf then looks like for these (excuse the bad formatting):

Only relevant to chess players: The Sicilian Najdorf is the best opening for Black, scoring an impressive 35% win rate, and a much better choice if the Black player needs to win than something like the Petrov which is much more drawish. There are a couple of things that surprised me here, first being how well the Dutch Defense did for Black, which is a fairly uncommon opening in top level chess that easily leads to unbalanced and unique positions – it outperformed the King’s Indian and the much more conservative Queen’s gambit declined, which actually did quite badly. Also interesting is the Berlin in the Ruy Lopez, which commonly has a reputation for being solid and drawish, resulted in a worse result for Black than the Ruy Lopez as a whole.

Analyzing patterns of moves

So far I’ve only analyzed fixed opening moves. However, there are certain move combinations that could take place at any time, and aren’t exact. For example, I may want to pull all games where Black captured a piece on a1 with check, White moved their king somewhere on the second rank, and black then captured a piece on h1 (that would likely be a double rook sac by White). To identify these games regular expressions (regex) have to be used. Here’s how I created a separate data frame containing only games that match the above pattern:

white_double_rook_sac <- df[grep("B\\d+\\.Qxa1\\+ W\\d+\\.K\\D\\d B\\d+\\.Qxh1", df$moves),]

If I have the regex right, that’s saying that within the moves column of the df dataframe, only keep games where Black played Qxa1+ on any move, White responded by moving their king to the second rank (eg. could have been Ke2, Kf2, Kd2 etc), and Black then played Qxh1. Those games are then saved in a new dataframe white_double_rook_sac (I couldn’t get the sqldf package working with regex properly, so I can’t just plug the above regex into the move_stats_append_df() function).

The resulting white_double_rook_sac dataframe only has 100 rows compared to the original dataframe which has over 3.5 million rows, so it’s only once in a blue moon that white sacrifices two rooks this way. Let’s see the results of those games:

table(unlist(white_double_rook_sac$result))

That gives:

0-1 1-0 1/2-1/2 64 25 11

Unfortunately, most of the time White loses two rooks like this, they lose. But presumably many of the 25 games where White won were brilliant and deliberate sacrifices. Because the white_double_rook_sac dataframe is so small, these games could easily be viewed by just directly looking at the dataframe, or simply pulled as follows:

sqldf('select moves from white_double_rook_sac where result is "1-0"')

Percentage of games that are unique after a certain number of moves

Another thing I wanted to know is how many games are unique after a certain number of moves, ie. how many cases are there where there’s only ONE game that has had that sequence of moves. Obviously the more moves are played, the higher percentage of the games that should be unique. Like when analyzing opening moves, it makes sense to create a new data frame that’ll have a row for each number of moves (eg. row 1 will be for unique games after 1 move by both players, row 2 will be for unique games after 2 moves by each player and so on).

First we’ll create a data frame with two columns:

unique_games <- data.frame(matrix(ncol = 2, nrow = 0)) colnames(unique_games) <- c("total_unique", "percentage_unique")

And now the function that’ll populate this data frame:

find_unique_games <- function(up_until) { vec <- substr(df$moves, 1, regexpr(up_until, df$moves) - 1) unique <- sum(!duplicated(vec) & !duplicated(vec, fromLast = TRUE)) total_values <- length(grep(up_until, df$moves)) percentage_unique <- unique/total_values*100 result_list <- list("total_unique" = unique, "percentage_unique" = percentage_unique) unique_games[nrow(unique_games) + 1,] <<- result_list }

The up_until parameter of the function will be the move number we’re looking up until. For example “W2\\.” would look up to only the first move by both players, or “W10\\.” would look up until both players have made nine moves. vec is all of the game moves for all of the games, up until the up_until move, and unique is the number of unique strings in vec. We can then easily add the number of unique games and percentage of games that are unique into the unique_games data frame.

We can now try seeing how many games are unique for each move (we’ll do it up till move 20):

find_unique_games('W2\\.') find_unique_games('W3\\.') find_unique_games('W4\\.') find_unique_games('W5\\.') find_unique_games('W6\\.') find_unique_games('W7\\.') find_unique_games('W8\\.') find_unique_games('W9\\.') find_unique_games('W10\\.') find_unique_games('W11\\.') find_unique_games('W12\\.') find_unique_games('W13\\.') find_unique_games('W14\\.') find_unique_games('W15\\.') find_unique_games('W16\\.') find_unique_games('W17\\.') find_unique_games('W18\\.') find_unique_games('W19\\.') find_unique_games('W20\\.')

That gives the following:

That’s saying that by the time both players have made 8 moves each, there’s a just over 50% chance the game will be entirely unique, ie. there’s no other game in the dataset with that sequence of starting moves. By the time a game gets to 19 moves played by both players, there’s a well under 1% chance there will be any other game with those moves. Note this doesn’t consider games that reach the same position through a different move order as being the same.

Tracking the Queens

It’s also possible to get some stats on individual pieces for each game, such as their last positions and whether they’ve been captured. For this example I’m just tracking Queen moves, firstly because it’s the most powerful piece and secondly because it’s far easier to track when each side only has 1. Here’s the steps to do this for White:

Find the index in the string where the last Queen move was – for White this would be with the regex “W\\d+\\.Qx?” (eg. a move like W10.Qf3, or it could be W10.Qxf3) IF there was a Queen move made, extract the next two characters after the “W\\d+\\.Qx?”regex to get the square (eg. f3) – otherwise if no match for the regex was found, mark the Queen position as the default d1 (where it starts) Search the remainder of the moves after the last Queen move to see if the 2 characters marking the square the Queen was last determined to be on appear again – if so an enemy piece must have moved to this square and captured the Queen

We’ll be adding 4 new columns to the dataframe (“white_queen_last_position”, “white_queen_captured”, “black_queen_last_position” and “black_queen_captured”) to store the information. While I was able to write a loop to go through each game one by one and populate these columns, it ran into major performance issues and it turns out loops in R are often horrendously inefficient compared to most other languages – the time taken for each game was actually getting slower the more iterations were done and in over 24 hours my computer still had a long way to go to get through all 3.5 million games. I had to post on StackOverflow for help on how to make it faster (https://stackoverflow.com/questions/51296632/r-improving-performance-of-complicated-loop) and used another users slightly modified version of my code that uses the adply function, which got the job done in under an hour. Here it is:

add_queen <- function(datarow) { for (i in 1:nrow(df)) { #white white_last_queen_position_index <- stri_locate_last_regex(datarow$moves, "W\\d+\\.Qx?")[2] if (is.na(white_last_queen_position_index)) { white_last_queen_position <- 'd1' white_remainder_of_game <- df$moves[i] } else { white_last_queen_position <- str_sub(datarow$moves, white_last_queen_position_index + 1, white_last_queen_position_index + 2) white_remainder_of_game <- str_sub(datarow$moves, white_last_queen_position_index + 3, nchar(datarow$moves)) } white_is_captured <- grepl(white_last_queen_position, white_remainder_of_game) #black black_last_queen_position_index <- stri_locate_last_regex(datarow$moves, "B\\d+\\.Qx?")[2] if (is.na(black_last_queen_position_index)) { black_last_queen_position <- 'd8' black_remainder_of_game <- datarow$moves } else { black_last_queen_position <- str_sub(datarow$moves, black_last_queen_position_index + 1, black_last_queen_position_index + 2) black_remainder_of_game <- str_sub(datarow$moves, black_last_queen_position_index + 3, nchar(datarow$moves)) } black_is_captured <- grepl(black_last_queen_position, black_remainder_of_game) #add to df ndf <- data.frame(white_queen_last_position = white_last_queen_position, white_queen_captured = white_is_captured, black_queen_last_position = black_last_queen_position, black_queen_captured = black_is_captured) return(ndf) } } library(plyr) df <- adply(df, 1, add_queen)

This function tracks Queen positions for both players via the 3 steps above. The part to track Black’s Queen is the same, replacing the W in the regex with B of course and making the position d8 instead of d1 if no move for the Black Queen was found. For each iteration a new dataframe ndf is generated, the results are put into it and it’s returned. When using adply on this function all the data is then added to our df dataframe.

After the 4 new columns are added to our dataframe, the type of the columns “white_queen_last_position” and “black_queen_last_position” are factors – we ought to change them to type character to keep the type consistent with other columns:

df$white_queen_last_position <- as.character(df$white_queen_last_position) df$black_queen_last_position <- as.character(df$black_queen_last_position)

Here’s what the four columns generated to track the Queen’s look like after running the above code (just looking at the first few rows of course):

That looks good, though it’s worth mentioning there’s one small thing I’d overlooked. After running the above code, if we check every entry in the new column “white_queen_last_position” as follows:

table(unlist(df$white_queen_last_position))

There are some (albeit just a tiny fraction) of entries that don’t make sense, such as “gf”, “hx”, “ba” etc. If the data’s right we’d expect it all to be in the format [a-h][1-8], representing the 64 squares on a chess board. The reason there are some strange entries is if a player happens to have TWO Queen’s – which happens very rarely if they happen to promote a pawn to a Queen while their original Queen is still on the board – then if they make a Queen move where either one of their Queen’s could have moved to that square, it has to be specified which Queen it was, eg. the move might be “Qaa5” (White’s Queen on the a file moves to a5), or “Q1a5” (White’s Queen on row 1 moves to a5), and as the function says look at the 2 characters after the “Q” (or “Qx”) it’s not pulling the square right. I could modify the regex and run the function again but will just replace any cases where the move doesn’t represent a valid square with “na”, as follows:

df$white_queen_last_position <- gsub("^(?![a-h][1-8]).*$", "na", df$white_queen_last_position, perl = TRUE) df$black_queen_last_position <- gsub("^(?![a-h][1-8]).*$", "na", df$black_queen_last_position, perl = TRUE)

The regex “^(?![a-h][1-8]).*$” above will match anything that is NOT one of the valid squares on a chess board like a1, b2, c3 etc, and the gsub command replaces any of those with “na”. We’ll know that any entries with “na” were games where White had two Queens.

Now (finally!) some basic stats can be pulled on this new data. First how many times White’s Queen is captured vs Black’s:

table(unlist(df$white_queen_captured))

FALSE TRUE 1616462 1907020

And for Black:

table(unlist(df$black_queen_captured))

FALSE TRUE 1713955 1809527

It is very strange that White’s Queen gets captured more than Black’s, especially when White wins more games. Without really digging into the games in a lot more detail it’s hard to say why this should be the case. We can also check the results for different scenarios where Queen’s are captured or not, eg. the query below returns the percentage of games where White wins after losing their Queen despite never capturing Black’s Queen (note in R for a column with type boolean, 1 is TRUE and 0 is FALSE):

sqldf('select count(*) from df where result is "1-0" and white_queen_captured is 1 and black_queen_captured is 0')/sqldf('select count(*) from df where white_queen_captured is 1 and black_queen_captured is 0')*100

That returns only 14.97%, so as you’d expect White has a far worse winning percentage when they lose their Queen and never capture Black’s Queen than they normally have. Most of the games White won here would be deliberate Queen sacrifices or games where Black was inevitably going to lose their own Queen as well and resigned before it happened.

We can also take a look at the squares White’s Queen is most likely to finish on. In the command below, by the way, the ggsave at the end is saying to generate a graph 20 inches wide – I’ve made the graph super big because there are 65 variables it needs to have on the x axis (the 64 squares plus the “na” from above) and this spaces them out much more nicely:

ggplot(df, aes(x = white_queen_last_position)) + geom_bar() + ggsave("queen.png", width = 20, height = 10)

(Click the graph to view the full image)

The default position White’s Queen starts on, d1, is not surprisingly the most common square it ends up on, and the spikes for squares b3, c2, e3 and f3 (compared to other moves in these columns) are likely just because these squares are reachable in 1 move (if there’s no pawns/pieces in the way) from the Queen’s starting square. There are a few funny things in this graph though, such as that the Queen is far more likely to finish on the corner square a1 than the corner square h1, yet far more likely to finish on h3/h4/h5/h6 than a3/a4/a5/a6. Again, we can’t explain (only guess) the why behind some of the peculiarities in Queen captures and final positions without going a lot deeper into the games.

Number of captures made by each player

This is much easier than tracking the Queen’s above, and can also allow us to filter games where one player finished down a lot of material. We know that a capture will have an “x” in it (eg. the move Qxf3 meaning white captured a piece on f3), and as this has to be the only time an x appears in the game moves (there’s no column x or piece x), we can just count the number of x’s that appear. To get the captures for each player some regex needs to be used:

df$white_captures <- str_count(df$moves, pattern = "W\\d+\\.\\Dx") df$black_captures <- str_count(df$moves, pattern = "B\\d+\\.\\Dx")

Using these two new columns we can now check, for example, the number of games where White captured 3 or more more pieces than Black did:

sqldf('select count(*) from df where (white_captures - black_captures > 2)')

And the results of those:

sqldf('select count(*) from df where (white_captures - black_captures > 2) and result is "1-0"')/sqldf('select count(*) from df where (white_captures - black_captures > 2)')*100

That returns 92.31% – so the vast majority of the time White captures a lot more pieces than Black does, they win. Like with the games where a player lost their Queen and never captured their opponents yet won, if we searched for games where a player finished down significantly in material and won there would be some nice games in there.

There’s loads more I could look into with this data, and why a lot of the results are the way they are remain a bit of a mystery to me. However, as my first R project this is enough digging into this data set for now. I’m sure there are also loads of places where my code could be improved. Comments/criticism welcome.