Writing AI to play games is a special kind of crack for me. Seeing my AI face off against other people - or better yet, AIs written by other people! It’s so… I don’t know how to describe it. I cannot resist its siren call.

I once lost several months of my life writing python-based AI for an open source game called The Battle for Wesnoth. What started as a fun way to learn python quickly turned into a several month-long obsession of unhealthy proportions. Towards the end of it I had an AI with an ego-inflating 98% win rate versus the default AI across all in-game factions and a selection of the smaller maps.

Then, on the very edge of victory, it all went horribly wrong.

Pride comes before a fall

My downfall begins on a hot summer’s night in Munich. I can’t sleep and I’m thinking about Wesnoth. I long to capture the last 2% and have an AI to share that wins every match, but I’ve exhausted all of the small tweaks and improvements I can think of.

Tonight I decide the time has come for a significant rewrite. It will put the AI on a much more strategically sound basis and practically guarantee total domination! I refil my glass, sit down and get started.

It take several evenings to finish tidying the changes up, but finally the moment for a full run-through comes. I hit the button and send my newly revamped AI into a gauntlet of battles against its soon-to-be-vanquished opponents. But something is wrong. It’s losing a match! Now another! And another! After thirty minutes or so the final figure comes back: 72%. SEVENTY. TWO. PERCENT.

The road from 72% to 98% took me almost a month of occasional evenings the first time around. What do I do now? Revert the changes and try something else? Or keep working on them, possibly throwing my time away for nothing?

Questions whirl around in my sleep-deprived mind, taunting with their unanswerability. Are these changes fundamentally worse than the previous strategy? Or are subtle bugs causing my AI to make stupid mistakes? If I spend two more weeks fine tuning this version, will it ever overtake my previous plateau or am I wasting my time?

In one especially lucid moment I feel a sudden connection to the hopeless fate of the hill-climbing algorithm, never aware whether it is stuck in a local minimum or not, always caught between tenacious hope and the terrible, crushing despair of futility.

Charlie Brown says that the tears of adversity water the soul. I learned that a single number just isn’t enough feedback to meaningfully understand the performance of a complex system. I also developed a deep sympathy for anyone who spends their life A/B testing website changes.

A few weeks later the Wesnoth team discontinue the python AI interface and I am rehabilitated back into normal life, but the experience makes a lasting an impression on me. How can we write better AI? How can we test and understand the behaviour of code that interacts with an overwhelmingly complex, stochastic environment in unpredictable ways?

Back on the wagon

The seasons change and years pass peacefully by. Purely by chance I stumble upon the Wesnoth website again and notice they have quite a decent lua-based AI subsystem in place now.

Unable to resist, I clone and compile the latest version and start learning how to use the new lua interfaces. It’s not long before I have a skeletal AI - in every sense of the word - managing a 30% win rate with lots of low-hanging improvements still unexplored. Yet I am uneasy. I can feel the same problem out there, an undefeated nemesis lying in wait.

I live by the sea now, so one evening I take my notepad and pen down to the beach, write down the problem as clearly as I can and stare helplessly at the shimmering waves, waiting for inspiration.

Despite vicious sea wind’s tenacious attempts to strip the flesh from my bones I manage to sketch out an idea for a workflow that feels like it has potential. As it turns out - to my own surprise and joy - it’s absolutely brilliant. Sometimes the Feynmann algorithm is the only one you need.

Machine learning, or is it machine teaching?

The problem I wrote down is that I’m drowning in an ocean of raw data without an easy way to understand the stories within it. Each automated match consists of hundreds of moves from a wide range of different units and to get any meaningful comparison I really need to run dozens or even hundreds of matches per change. Abstracting all of that as a single percentage throws too much information away but watching fifty replays each time I tweak a number is never going to happen.

At the beach I realized I need a machine to watch them for me.

This is how I built one.

Step one: visualize a single battle

Realizing I needed something between clicking through an entire replay and a single win/loss figure I started looking through battle logs picking out other statistics that seemed strategically relevant.

In Wesnoth there are 5 key metrics that together give you a pretty good feeling for which way the battle is going. I logged just these over time for each battle and added some unicode sparklines for readability:

yt_simple vs ai_default_rca | winner: 2 units: ▄▄▄▄▄▄▄▄▃▃▃▂▂▂ 49 53 49 45 49 49 49 46 42 38 30 21 19 18 cost : ▃▄▄▄▄▄▄▃▃▃▂▂▂▂ 40 48 45 43 44 44 44 41 38 31 27 21 22 18 gold : ▄▁▄▄▄▄▄▅▅▃▃▄▃▃ 49 0 57 54 52 46 54 60 69 40 38 53 41 40 vills: ▁▁▄▄▄▅▅▅▅▄▄▃▃▃ 0 0 55 52 52 63 61 59 61 55 49 30 33 33 inc : ▃▂▄▄▄▅▅▅▅▄▄▃▃▃ 33 24 54 52 52 61 60 58 60 55 49 33 34 34

Even so, reading through fifty of these reports at a time isn’t very illuminating, even when they are sorted. Sorting is my favourite cheap visualization trick but here I needed something more.

Step two: the unreasonable effectiveness of normalized compression distances

Over the past year I’ve been fascinated by the usefulness of the normalized compression distance (NCD) as a similarity measure. Basically it states that the degree of similarity between two objects can be approximated by the degree to which you can better compress them by concatenating them into one object rather than compressing them individually.

You can use this to detect the similarity between music, images, DNA and all sorts of arbitrary things wth surprisingly good performance. It’s one of the coolest and most implausibly effective hacks I’ve come across and I thouroughly recommend reading Vitanyi, Cilibrasi and Cohen’s papers on the subject.

NCD was perfect for this problem. I built up a similarity matrix by compressing my visual output summaries then used that to cluster them into an unrooted binary tree. A small python script injects some extra details to dot’s SVG output and suddenly I am looking at this:

The nice thing about having these in SVG files is they are mildly interactive - moving the mouse over any particular node shows the run summary in the bottom-left, as you can see for run 11 in the image above. The NCD clustering does a really great job here – the clusters really do seem intuitively similar and semantically meaningful even after deeper investigation.

Of particular interest are the battles that were lost but are clustered close to several wins. These scream “everything was fine until something unexpected happened” - a bug or unforseen situation the AI has blindly wandered into again. Watching the replay for one of these battles almost always shows a (sometimes rarely-occurring) bug at work. Just fixing these improved my AI massively - machine assisted debugging at its finest!

Step three: profit! My new machine-learning boosted workflow

When I sit down to improve my AI it now looks like this:

Code up whichever improvement or brilliant/stupid idea I’ve been dying to try out Hit a key to run 50 battles in parallel using 8 cores, cluster the results and produce a prettified SVG file Get a feel for the overall shape of the results. Are there lots of quick wins, or are matches more often drawn out? Are there interesting clsuters of wins and losses? Tap on a few nodes to understand what is represented then pick one to look at in more detail, e.g. a loss surrounded by wins, or an example from the center of an interesting cluster of losses. Type in the number of that node into my terminal to launch Wesnoth and view the replay of the run from that node. I can click forwards and backwards through the match, watching for mistakes both obvious and subtle in the behaviour of my AI. Sometimes I look at a couple of runs for comparison, such as a similar win/loss or two similar lost runs. Sometimes I can see what the AI did wrong but don’t understand why. In this case I can pull up the AI debug log for that numbered run, jump to the turn I’m looking at in the replay viewer and read in depressing detail the reason my thief decided to go toe-to-toe with a mounted Lancer in waist-deep water instead of e.g. nipping onto that nice safe village and stabbing him in the back for 2x damage. Come up with a new idea to improve the existing code and repeat from step 1.

Working like this is so beautiful because there’s a smooth multi-level mapping from the complex emergent behaviour in the game all the way back to the actual lines of code that evoked it and it’s so frictionless to move between levels of detail that it’s joyful in and of itself! It’s also extremely effective.

Some Promising Results

Within a week my lua AI has reached a 98% win rate and I haven’t even brought out my best tricks yet.

Every time a change decreases the win rate I’m able to quickly find the group of replays demonstrating the negative impact and follow through from there to the debug logs to the code, where I can fix it. It’s a whole new world compared to the way I worked last time, with just a single figure telling me “better” or “worse”.

I would be surprised if writing AI is the only application for a workflow like this.

A/B testing might benefit from a similar setup – take the logged or recorded traces of each visitor’s interactions with the site and cluster them so that when a test decreases signups you can look at the clustering and see which behavioural subgroup is failing to sign up, then go and watch a few sample recordings to understand why they might have changed their behaviour.

I don’t write a website analytics package any more, but if you do and you’d like to experiment with this I’d be interested in working with you. This stuff is a lot of fun!

There must be a lot of other situations that would benefit, too. Do you know of any others? Let me know, I love learning new things!

Update: I’ve added a few more links and technical details to the Hacker News discussion.