Last updated on August 5, 2020

For those of you who don’t know me, my name is Ryan Saxe and I’ve been writing Limited content for StarCityGames.com for the last three years. This article is a result of my decision to pursue the intersection of my career (Machine Learning Scientist / Data Scientist) and my passion (Magic: the Gathering Draft).

For the last few months, I have been working with Draftsim to create a Magic: the Gathering draft agent. The purpose of this article is to explore the data provided by Draftsim, the architecture of my algorithm, the results for Throne of Eldraine, and how this model can be used to evaluate and understand future draft formats.

Think of this article as a teaser for what’s to come, as in the near future there will be a similar article for my bot’s learnings on Theros Beyond Death! Additionally, if you’re a mathematician, software developer, data scientist, or just curious, you can view the code for the agent on Github.



I am proud to say that my bot drafted a deck that I took to an undefeated record in a Magic Online event. In fact, I believe this is the first case of an MTG AI demonstrating success in a competitive setting.



There are two main reasons I set out to design this draft bot. First and foremost, I want to demonstrate that this is an approachable task with fruitful and deep areas to explore. While I am unsure how the bots on MTG Arena work, I believe there is a lot of room for improvement, and I wanted to explore what a more dynamic bot would look like.

Secondly, I attempted this task as a motivator to think about the fundamental mathematics that govern decision making in draft, which would be the foundation of my AI. This design is what differentiates my bot from the way others have approached this same problem. Furthermore, this approach creates a much more interpretable algorithm, and allows me to write an article like this one that truly delves into the inner machinations of how this bot makes decisions.

Let’s start with an overview of the data, then take a look at the three priors that I incorporated into my model. Finally, you’ll get to see just how well it performed when I put it all together.



What Does the Data Look Like?

To start this project, Draftsim gave me a file with over 80,000 drafts of Throne of Eldraine completed on their website. That’s well over 3-million data points (picks), and hence plenty of data for a machine learning algorithm. Unfortunately, after inspection of the drafts, about 35,000 were not sufficient (incomplete, duplicate, etc.), leaving approximately 2-million data points. Luckily, this is still sufficiently large data for a machine learning algorithm.

I then randomly split the data — keeping drafts together to avoid leakage — such that I use 80% of the data to train the model, and 20% of the data to test it (I excluded a validation set because I wanted the test set to be sufficiently large and I didn’t employ any hyperparameter tuning. This will likely change in the near future). Each data point consists of the pool of cards the player drafted so far, the pack they selected from, and the card they selected from that pack.



80,000 drafts just like yours on Draftsim helped make this analysis possible

However, this dataset has a significant problem: the environment doesn’t represent the real world. First, drafting on Draftsim is with bots that don’t navigate drafts like people do. Second, there’s neither risk (entry fee) nor incentive (prize pool). And finally, while the average skill-level of users across Draftsim is unknown, the sample size of this dataset is so large that it is unlikely to represent expert-level drafting at a high density.

Hence treating this dataset like a beautiful oracle, and attempting to create a bot that exactly mimics the users would not necessarily generate a sufficient Draft agent. In order to design an algorithm that would not succumb to the pitfalls of the dataset, I attempted to construct it with reasonable priors based on my expert knowledge of Magic: the Gathering draft. Throughout this article I believe I sufficiently demonstrate that I was successful in my endeavor by juxtaposing the tendencies of my bot’s behavior with human behavior from the dataset.



First Prior: Set of Archetypes

Every single limited format has a set of archetypes predefined by Wizards of the Coast, often correlated with color combinations. Sometimes there’s a novel addition like Clear the Mind in Ravnica Allegiance, but this nuance is less necessary to capture, and creating the architecture to enable an AI to discover archetypal subtleties is out of the scope of this current project. However, I do intend on exploring this in the future.

Instead, I started with an initial exploration of global archetypes:



These archetypal labels are computed via K-means clustering for fifteen clusters, because Throne of Eldraine has fifteen unique archetypes. I took all 45,000 drafts, grabbed their final pools, computed the total number of cards per color, and found fifteen clusters that perfectly separate into all color-combinations and mono-colored archetypes.

One thing I found surprising is that the mono-colored archetypes have a smaller slice of the pie than I expected, as I believe Throne of Eldraine rewards those archetypes significantly. I’m proud to demonstrate that this isn’t the case with my bot! I generated 5,000 drafts with eight copies of my bot, and computed the average number of drafters per table for each archetype. The difference between my bot’s distribution of archetypes and the data’s distribution of archetypes is staggering.



So, how did I get here? How did the bot deviate so far, and in what I believe is a good and functional manner, from the dataset? And how did clustering play an important role?



When I sit down to draft, I’m not taking cards randomly. Each card I take has some distribution explaining how good it is in each archetype. Merfolk Secretkeeper pulls me much harder to Dimir than it does to Simic. This is why a global pick order falls apart so quickly in draft; the context of colors and archetypal biases in a draft pool has a drastic impact on the prioritization of cards in the following packs.

So, in order to give my model a representation of archetypes to embed this nuance, I chunked the training dataset by the K-means clusters and trained a linear model for each archetype by feeding it only data from Pack 2 Pick 3 onwards for that specific archetype.



Second Prior: Bias to an Archetype Given a Draft Pool

What’s the math that governs how a person navigates a Draft? The value of cards in your current draft pool has a large impact on what the correct pick is in future packs. For example, even though Scorching Dragonfire is a better card than Outmuscle, if there are five green cards in a draft pool and zero red cards, Outmuscle is the correct pick.

Because of this, I believe that the bias towards an archetype is a function of archetypal card quality related to the draft pool. This means that draft decisions are not made by the general card quality in the pool by color, but rather by considering the total value a draft pool would provide to any specific archetype.



Show the scary math

Let D be a draft pool, and let C a be the value of card C in archetype a. Then the corresponding bias β a to archetype a for draft pool D is:

This bias is used to update the value of the cards in the pack that the bot is going to draft from according to the current draft pool. The proper pick out of any pack is the card with the highest value given the draft pool. Hence, in a format with n archetypes, the correct card C to pick out of pack P is:



Overcoming Bad Human Behavior

The introduction of this bias tells the bot to stick to its archetype in a formal and structured way. At Pack 3 Pick 1, the bias towards the proper archetype will be large enough that it should fight improper drafting where a player takes an enticing card, often a rare, that they can’t really cast.

Notice that this fact isn’t hard-coded into the bot, but rather an emergent phenomenon that it learns as a consequence of the math. Furthermore, it appears that the bot behaves better than humans in this case! In the test set, approximately only 80% of picks by humans are in the proper archetype for Pack 3 Pick 1. However, the bot takes cards in the proper archetype 91% of the time at Pack 3 Pick 1 in the same test set, which is a substantial improvement.

Believe it or not, people actually raredraft on draft simulators!



Deriving the Defining Cards for Each Archetype

Furthermore, this math incentivizes the model to learn deviations in the value of cards that correspond to their archetypes. This means that the model can learn inflated values for gold cards within their archetypes, but assign them generic values low enough such that they aren’t picked until the bias to that color combination is high enough.

For example, Wandermare might be great and have a high power level in a WG adventure deck, but just like a human drafter would, the bot hedges and doesn’t speculate on the Wandermare until the bias is sufficiently high.

This is a type of nuance that a generic pick order can never represent, but can easily be aggregated into a pick order for early picks! Below are the generic, unbiased, value of cards for the entire set — also known as the pick order for my bot at Pack 1 Pick 1, which achieved 70% accuracy (matching human picks) on the test set.



This pick order is actually different than the one that would correspond to the sum of the archetypal pick orders described in the previous section. This is because I used the earlier matrix just to initialize my bots “understanding” of archetypes. And I allowed the model to update those values in order to satisfy the math I just described.

Additionally, given that the values I used to initialize the model have an embedded representation of archetype and color, the gradients that correspond to these weights — the math that governs how to update/learn the value of cards in this system — are unlikely to maneuver these values in a manner that fundamentally shifts the archetypal vector (e.g. the Izzet vector will not represent aspects of Golgari, however it is possible for smaller deviations where the Mono-Red archetype represents aspects of Izzet). So, according to this methodology, what does the model learn?



Merfolk Secretkeeper is in the top five cards in the set for Dimir and Mono-Blue according to this archetypal representation. But It’s not a top five card in the set for those archetypes because of rares and efficient uncommons. What these lists above effectively represent are the cards with the largest delta from that archetype to the other archetypes. This is because this feature is used to compute pull towards archetype.

Merfolk Secretkeeper doesn’t create much of a bias towards Simic, but it certainly gets me thinking about Mono-Blue and Dimir.



Intuitively, the model learned what I call archetypal inflation. Look at the Selesnya section of the above table. Selesnya has the best adventure rare first, followed by the three uncommon adventure payoffs, and lastly the worst of the hybrid uncommons.

First off, why are any of these represented more than Lucky Clover (which is a couple more slots down, but not far behind)? It’s because Lucky Clover is solid in enough other places that it doesn’t need as incredible of an inflation factor for Selesnya. Wandermare, Mysterious Pathlighter, and Oakhame Ranger all are quite good in Selesnya and mediocre to unplayable elsewhere.

By definition, this means they should create a specific pull towards Selesnya. And Edgewall Innkeeper is incredibly high here because, while it does pull towards green in general — as can be seen by being in the top five for mono-green as well — it creates the largest pull to white as a secondary color thanks to the density of adventure cards.



But these results don’t just represent the obvious synergies like adventure for Selesnya or draw-two for Izzet. The best rares will jump to the top. For example Garruk, Cursed Huntsman, Embercleave, and Oko, Thief of Crowns surface because their value has to be so high.



And, as I expected, it learned to largely inflate the value of gold cards in their respective archetypes like Shinechaser and Steelclaw Lance. It’s also why hybrid cards like Arcanist’s Owl, Fireborn Knight, and Elite Headhunter are close to the top in all of their corresponding latent vectors.

And it’s also why Henge Walker is all the way at the top for mono-white. Mono-white has artifact synergies with Arcanist’s Owl and Flutterfox, but it also has a huge gap in the three-drop slot. The only good common three-drop for white is Ardenvale Tactician, and that card is taken so highly that it’s impossible to guarantee even one copy. And neither Knight of the Keep nor Lonesome Unicorn are particularly good cards. Hence, Henge Walker is likely largely inflated in mono-white because it plays an important role there while also being a colorless card that needs reasonable representation.



Inspecting this aspect of my bot is what I’m most excited for in future formats. Henge Walker at the top of mono-white raises an eyebrow, but actually makes a lot of sense upon inspection. Furthermore, note that mono-white is the only mono-colored archetype to contain multiple hybrid cards. This isn’t random. The specific incentive for mono-white is a density of uncommon hybrid cards, where other mono-colored archetypes have other incentives. Delving into the latent archetypal representations of Theros Beyond Death may reveal similar subtleties.



However, not all of these representations are as good. Cauldron Familiar at the top of mono-black is concerning. I believe this is because the math I implemented says nothing explicit about synergy. The algorithm can learn an archetypal representation of synergy, but it can’t easily represent smaller synergy pockets like Cauldron Familiar and Witch’s Oven.

Given that the Oven is not too far behind in this vector, I imagine that the algorithm learned to represent them both very highly in mono black, and let black bias increase their value. Unfortunately, this yields some odd cases of overfitting: where the algorithm learns exploitations of the dataset it’s trained on rather than the fundamentals in the game. I use techniques like regularization to try and fight overfitting, but I also intend to add synergy logic in the future to handle this case.



Just for completion, this is the last part of the top five learned archetypal weights. There is a clear problem with mono-red. It’s a lot more representative of Izzet than it likely should be. And I’m not entirely sure why this is the case.

My best guess is that in order to reprioritize the proper red card draw synergies (e.g. with multiple copies of Mad Ratter, Merchant of the Vale becomes an extremely high pick even if the deck is not Izzet), they all got additional representation in mono-red to enable that flexibility when navigating a draft. Importantly, this doesn’t mean that mono red is actually drafted like Izzet. Here is a draft log from the bot, where it drafted a pretty phenomenal mono-red deck that clearly demonstrates the ability to differentiate mono-red from base-red Izzet.



Below is the full learned matrix if you want to drill further down than just the top five picks per archetype. Notice that it isn’t perfect, and there are clear erroneous cases sprinkled throughout such as Seven Dwarves being represented highly in Dimir. These cases happen due to the complexity of the problem, and I intend to explore how to prevent them. However, this still demonstrates a reasonable abstract representation of archetypes, and I don’t believe undermines the interpretability of the algorithm.

Third Prior: Staying Open Decays over Time

Yea, obviously in my head it’s estimations not calculations, but that is pretty much how I see it. Lots of value early-mid pack 1, then a huge drop off after the early picks in pack 2 bc being open through then at least means you can get broken cards in any color in early pack 2. — Ben Stark (@BenS_MTG) January 9, 2020

Ben Stark’s article, Drafting the Hard Way, is considered to be one of the most influential and fundamental pieces of Limited theory. It can’t be summarized by an easily actionable heuristic like “take removal highly”. It’s a philosophy that describes draft navigation agnostic of format. It’s a card evaluation framework that embeds format context alongside staying open to greatly increase the probability that the drafter ends up in the proper archetype for their seat.

The math described in the previous section is good for teaching the bot how to commit to an archetype, but in consequence that leads to marrying early picks a high percentage of the time. Because of this, I updated my math with a learned function to simulate the concept of “Drafting the Hard Way.”



As always, discerning how to introduce math corresponding to an abstract concept is no easy feat. I knew that a function representing how to stay open would decay, as eventually a drafter is required to solidify into an archetype. But where would this decaying factor — let’s call it λ t where t is the current pick of the draft — fit into the math?



After spending some time thinking, I decided to model “Drafting the Hard Way” as a smoothing function over archetypal bias that decays over time. What this means is that, at Pack 1 Pick 2, there is still a strong bias towards every archetype even if the first pick was a fantastic red card. There is still, by definition, a larger bias towards red, but this math minimizes the delta between the value of red cards and other cards at Pack 1 Pick 2. Thereby the math describing how the bot picks cards is altered in the following way:







Effectively this is like simulating a draft pool with good cards in every archetype. The best card in any archetype is represented with a matrix weight between 1 and 1.2, with planeswalkers as exceptions (The Royal Scions, Garruk, and Oko are between 1.6 and 1.8). It’s important to note that this is learned without a representation of card type. The planeswalkers jumped to the top because people take them so highly. If The Scarab God — one of the best Draft cards of all time — was in the format, it would also have a comparably high weight.

Hence, when λ t > 1, the bot is behaving as if it had at least one of the best cards in every archetype. This clearly demonstrates that, for the first couple picks, the bot takes the best cards out of the pack. However, it’s not until the beginning of Pack 2 where this smoothing function decays under 1, which means the bot should be able to switch colors if necessary up until that point.

To demonstrate this, here are two draft logs where the bot pivots into the open archetype even though it started with extremely strong cards outside of it:



This begs the question: “how do I know that the bot wouldn’t be able to pivot without this logic”? To test this, I trained two versions of the bot, one with λ t and one without. The difference is so much larger than I expected!

In both drafts above, the bot without the structure for drafting the hard way can’t deviate when necessary. It picks Tuinvale Treefolk over Ardenvale Tactician Pack 1 Pick 3 in the draft starting with Oko, Thief of Crowns. I could see taking So Tiny, but this just goes to show the issue without the staying open logic as Garenbrig Paladin must have added too much to Oko’s green bias without the smoothing function.

In the other draft, the simpler bot takes Locthwain Paladin over Inspiring Veteran Pack 1 Pick 5, which I believe is an egregious pick, but I guess makes sense if there’s no way to fight the bias from the other cards in the pool.



If these were the only examples I gave, it would look like I cherry-picked them to illustrate my point. So, I created a table with 8 copies of the staying open bot, and a separate table with 8 copies of the simpler bot. I then simulated 5,000 full-table drafts, and gave the different tables of bots the identical simulated drafts to see what the different bots would do. It’s not a perfect exploration of the delta between these bots, but given how stark the difference is, I believe it is sufficient.



The bot with this learned decaying smoothing function could cast the card it first picked 64% of the time. The bot without it could cast the card it first picked 77% of the time. And, in the data they were trained on, the humans could cast their first pick 81% of the time.

I believe that an overwhelming majority of expert drafters would agree that both 81%, and even 77%, are much higher than optimal, and that my bot represents closer to the ideal percentage. However, it’s a very common mistake in less experienced players, and so it’s not surprising that the training dataset has this quality.

The fact that the model learned a decaying structure that both myself and Ben Stark believe properly represents this fundamental draft philosophy and this version of the model can largely deviate from the mistakes in the dataset using this technique is truly amazing. Furthermore, this demonstrates that introducing priors according to the fundamentals ofl limited theory into the math that governs draft AI is a fruitful pursuit, and I will continue to explore these types of structures and complexities!



How Well Did it Do?

Bot pick accuracy % by pick number

The above plot demonstrates my algorithm’s ability to successfully predict what a human would pick given a draft pool and a pack of options. The blue line is the accuracy of the bot defined as the percentage of picks in the dataset at a given time where the bot would have taken the same card as the human (e.g. Pack 1 Pick 1 = 70%). The orange line represents that accuracy inflated by counting the bot’s second choice as well.

This yields an average accuracy of 63%, and that increases to 83% when considering the second pick. I believe this performance is sufficiently good because it’s unlikely the entire population of people in the dataset abide by the same rules. Many will disagree on picks, and hence it is likely impossible for any agent to achieve accuracy above some threshold.

While I do believe that there’s lots of room for improvement, I’m not certain how far away the upper-bound-threshold on performance is. Overall, I believe that this article was a sufficient demonstration that the algorithm I created is not only sufficient at draft navigation, but structured in a way such that we can analyze and understand how it works. I have achieved my initial goals I set out for, and am excited to take this to the next level in the near future as I improve it over the course of Theros Beyond Death!

Keep an eye on Draftsim soon for my next article when I’ll have some very early Theros Beyond Death data to use in my model. And if this article sparked your technical interest, don’t forget to check it out on Github!

As a final bonus for you, I was asked if I could show what an entire draft table would look like using my bot. Check it out below: