Future Future League: Failures and Fixes

Tweet by SaffronOlive // Feb 13, 2017

standard

The last month has not been encouraging in regards to the state of Standard, but not so much as far as the gameplay itself—no matter what a Standard format looks and plays like, people are going to complain. While we can argue about how good or bad our current Standard is (or the recently retired Kaladesh Standard was), when it comes right down to it, I'm not convinced it either is much worse than Bant Company Standard, Mono Black Devotion Standard, or 34 Rhino Standard. As such, I don't think the actual Standard format is the problem. Could Aether Revolt or Kaladesh Standard be better? Of course, but so could every other Standard format in history, and we often take a short-sighted view of the game and forget some of the truly horrible Standard formats from the past, involving things like Affinity, Necropotence, and Tolarian Academy.

$ 0.00 $ 0.00 $ 0.00 $ 0.00 $ 0.00 $ 0.00

What is worrisome is that the past couple of months haven't been particularly kind to the actual design and development of Standard. First, we had the historic banning of three cards in Smuggler's Copter, Emrakul, the Promised End, and Reflector Mage, and banning cards in Standard has traditionally been Wizards saying, "Sorry everyone, we messed up." This alone could be written off as a one-time error in judgment rather than a systemic failing of design and development, especially since Emrakul, the Promised End was pushed because it's a "story card," and in this era of design, it seems that the worst sin Wizards can commit is having a story card that doesn't see play. Also, Smuggler's Copter is a new card type (and over the history of the game, just about every new card type has lead to bannings, and this is doubly true of colorless cards). However, when we combine the Standard bannings with Wizards' recent admission that they simply didn't realize Felidar Guardian formed an infinite combo with Saheeli Rai (a combo that they would not have allowed in the format had they known about it), it seems fair to ask ourselves what is happening in design and development.

$ 0.00 $ 0.00

It wasn't that long ago that whenever someone asked me about Wizards' design and development, it was the very last thing I'd worry about. Wizards may struggle (although it is improving and has improved) with communication (see: Sam Stoddard's infamous Twitter poll regarding the state of Standard), stumble around with reprints and bounce back and forth too quickly on things like rotation changes, and even slip up on making a strong digital platform, but if there is one thing Wizards was very, very good at, it was making Magic cards. In fact, I'd argue that Wizards is quite literally the best in the world.

This means two things. First, since it is mostly the same people working in research and development, we know the potential is there. I highly doubt that people like Aaron Forsythe and Mark Rosewater suddenly became worse at making Magic cards. Second, if we assume that R&D didn't suddenly lose their abilities like Charles Barkley and Shaun Bradley in Space Jam, it's very fair to ask: what's the problem?

The Future Future League

Let's talk a little bit about how Wizards tests Magic cards from new sets. It's basically a group of Wizards people who get together a couple of times a week to play some games with yet-to-be-released cards; then, there is also a smaller team that dedicates more time to this testing process. Right off the bat, the first thing that comes to mind is the Commander playgroup headed by Commander in Chief Sheldon Menery. It's basically a small group of friends that play together on a regular basis but have extremely outsized powers—they control the banned list that impacts all Commander players. The Future Future League (FFL) sounds similar, except instead of controlling the Commander banned list, they control the Standard format.

The group works approximately a year ahead, so when the FFL was testing Aether Revolt (and remember, they have all the unreleased sets in between), the data they were looking at were from last year's winter set, Oath of the Gatewatch. Each set tends to get about six months in the FFL, and while the FFL is pretty good "at figuring out week-one and week-two metagames," it's not good at figuring out "how the format will evolve past the Pro Tour" (which, remember, comes directly after week two). And, the league doesn't really test for non-Standard formats (although Eldrazi winter has caused them to think more about how mechanics and cards impact Modern, but they still don't playtest the format).

$ 0.00 $ 0.00 $ 0.00 $ 0.00

Maybe most frustrating, none of this actually matters in the end because the cards are changed after the FFL ends! One recent example of this (which is currently causing problems in real Standard) is vehicles. Apparently, they got a major upgrade after FFL testing ended, so people played one version of them in the FFL (or actually didn't play them, because they weren't good enough, although another article about the development and design of Kaladesh calls this into question, since at various points, the FFL made Smuggler's Copter a 3/4, or literally draw a card instead of looting, or added vigilance, which is confusing, since all of these versions of Smuggler's Copter sound much more powerful), but then another, version saw print. Another example of this is Aetherworks Marvel, which was a powered-down rare in the FFL and then suddenly became a "let's get Emrakul, the Promised End banned" mythic (trading places with a toned-down Multiform Wonder) after the league ended. So, not only is the process flawed to begin with by its very nature as a small group of friends / coworkers, but also the testers go into the process knowing that their work doesn't really matter (at least as much as it should) because things will change after it's done anyway!

$ 0.00 $ 0.00 $ 0.00 $ 0.00

This doesn't even mention that fact that this group of players missed Splinter Twin in Standard, while a Hall of Fame-level pro (Willy Edel) recognized this combo as soon as the card was spoiled. So, what the FFL didn't find in six months of testing took a pro player all of six seconds to identify? At the risk of coming across as harsh, this makes me question just how good the players of the FFL actually are at Magic. While I'm sure they are reasonable Magic players and great people, it seems likely that their actual skill in deck building and recognizing synergies is significantly behind that of a typical pro-level player.

So, let's refresh: the way Wizards tests Standard as a format is by having a bunch of good-but-not-great Magic-playing friends play games a couple of times a week with the knowledge that their "testing" doesn't really matter because the cards will be changed after they are done playing with them anyway? What could possibly go wrong? For a "real world" example of FFL testing, let's take a look at some of the lists players actually play in these events.

Future Future League Decks

Now, I should make it clear right away that it's pretty much impossible to do any data-driven analytics of the Future Future League because we don't have all of the decks and don't really have a good idea of what the metagame looks like. Instead, we get a smattering of lists that Sam Stoddard posts in his "Days of the Future Future League" articles each set. As such, the best way to break down the FFL is probably with some observations about the posted lists.

For Kaladesh (the most recent Days of the Future Future League article), we got a total of 13 FFL decks, running the gauntlet from aggro to control to combo. Easily the biggest "miss" in the bunch is Smuggler's Copter. While some FFL decks did play the looter scooter, only three and a half (one deck played two copies) actually included Smuggler's Copter, and most of the decks that did play the copter were relying on artifact synergies (like Toolcraft Exemplar and Inventor's Apprentice), rather than playing Smuggler's Copter as a good standalone card. While this could partly be because of the card changed over the testing process (and also because some of the lists were creature-light control decks that couldn't reliably crew a vehicle), it's especially striking to see decks like GB Delirium Aggro go copter-less and Bant Tempo run Wharf Infiltrator in the two-drop slot over the now-banned vehicle.

There are also a couple of weird curve-related issues. Maybe the best example is GW Tokens—a deck that definitely deserved testing after being one of the best over the past year, but the posted list is not only playing four copies of Angel of Invention but also four Cultivator of Blades, which must have led to some extremely clunky draws.

If we delve back further to the Eldritch Moon FFL, we find a downright shocking lack of Emrakul, the Promised End and Ishkanah, Grafwidow, two of the highest-impact cards from the set, and this is despite that fact that, at least at some points during the process, Ishkanah, Grafwidow was actually better than the finished version (making four Spider tokens instead of three and / or costing less mana). I mean, neither of the Delirium decks published had Emrakul, the Promised End, and between them, there's just a single copy of Ishkanah, Grafwidow. Actually, the only appearance of Emrakul, the Promised End in any of the 17 published deck lists was in Esper Control (somewhat ironically, one of the only slow, controlling decks that didn't really play the Eldrazi in the real world).

Taken as a whole, it's not that the FFL league decks are bad, but they don't seem to be built with the idea of trying to break the format. Rather than sitting down with card X and saying, "What are the most degenerate, powerful, and potentially broken things I can do?" they instead look like fun casual decks that play a lot of the new cards to get a feel for how they work. While this process isn't a bad thing, in and of itself, it's certainly a much different testing process than you'll see in the real world when pro teams gather in the hopes of breaking the format, where instead of, "Is Cultivator of Blades good enough to see play in kitchen-table tokens decks?" the process is more about "How fast can I get an Emrakul, the Promised End on the battlefield?" (answer: consistently Turn 4 or 5) or "What's the most busted thing I can blink with Felidar Guardian?" (answer: Saheeli Rai).

To me, clearly the biggest example of the difference between the FFL and real-world testing is Felidar Guardian. I mentioned in the beginning of the article that Wizards simply didn't realize it could go infinite with Saheeli Rai, and while this itself is pretty troubling for a bunch of reason s(especially considering the timing of the FFL, which is a year behind real life, meaning that they were testing the set exactly when they were banning Splinter Twin in Modern), what's more troubling is we know Wizards saw the card has combo potential, but the combo they mentioned involved double Felidar Guardian (or a Clone) with something to abuse enters-the-battlefield effects. This is the essence of semi-competitive deck building and testing (I know, because I build a ton of fun semi-competitive decks, and double Felidar Guardian and Panharmonicon was the first thing I thought of when I saw the card). The problem is the real world is testing competitively, so the world that the FFL is testing for is much, much different than the world Magic cards go into once the set is released.

The Solutions

Once again, I should preface this by saying that we're working with somewhat limited information and we don't know everything that goes on during the FFL testing process, so my solutions are based on the information we do have, mostly from the various articles we've been talking about. It's completely possible that some of these things already happen, at least to some extent. Also, while some of these solutions are pretty easy, others are more challenging and may not even be possible.

#1: Stop Changing Cards Post-Testing

This one had to be number one on the list, because it not only seems like a fairly easy change but is so obvious. Many of the problems we've seen in recent Standards come from changes that happen very late in the design and development process, even after FFL testing is finished. The idea that some cards, especially powerful mythics and rares like Aetherworks Marvel and various Vehicles, are sort of haphazardly thrown together at the last second is pretty scary. I'm sure it's difficult enough to produce a great Magic set with tons of testing, and it has to be nearly impossible if all the testing you do doesn't especially matter because someone is just going to change the card anyway.

$ 0.00 $ 0.00

This isn't just a recent problem. Skullclamp took on its current broken, banned form thanks to a late change that made it give the equipped creatures −1 power, and while it may still have been too good even without turning any random x/1 into a Divination, this certainly pushed the card (even more) over the edge. While getting rid of all late-in-the-process card changes is probably impossible (and even undesirable), this type of change should be reserved for extreme cases and not be the norm (which it seems to be at present, based on all of the Sam Stoddard articles).

#2: Create a Pro Testing Team

While it seems that the FFL is fairly good at loosely testing to make sure a format is somewhat functional and fun, it appears to be lacking in the sort of high-end, pro-level testing that Magic cards get once they are released into the wild. The easiest way to fix this problem would be to create a separate testing team of pro (or former pro) players. Rather than replacing the FFL, having a group of eight (or 16) pro players working alongside the current FFL would strengthen the testing process by adding a very different and much more competitive perspective. It seems exceedingly unlikely that a pro testing group would have failed to recognize the Copy Cat combo, and it's possible that a group of high-level players with the goal of breaking a format would have come to much different results with cards like Emrakul, the Promised End and Smuggler's Copter as well. Remember: Wizards employees cannot play in high-level tournaments, which means that, even if the player joins Wizards from the pro community, over the course of their time at Wizards, there will be a tendency to skew towards more casual play (if you follow Wizards people on Twitter, you'll see that a huge majority of their tweets about actual game play are about cubing, Commander, or limited, rather than how they built a busted tournament deck). Adding a pro testing team would not only give a different perspective but would also add in a group outside of the bubble that is the FFL playgroup, and having additional outside eyes on the cards, decks, and format would go a long way to helping identify things that the in-group may overlook (a good example of this is Thragtusk, which was "missed" by the FFL because everyone in the group was enamored with the power of Wolfir Silverheart).

$ 0.00 $ 0.00 $ 0.00 $ 0.00

Of course, there's a huge challenge here: how do you get pros to drop from the tournament scene to help test cards? This problem is complicated by the fact that the testing happens a year in advance, which means that to test a single set, a player would need to avoid tournament-level Magic for a significant period of time. That said, we've seen plenty of high-level players go to Wizards, and LSV recently gave up the Pro Tour to do commentary, so perhaps this is the model: get a group of eight high-level pros, defer that pro status for a year (much like LSV, who can return to the Pro Tour at platinum next year), and hire them for one year to work primarily as FFL testers. I expect that at least some players would jump at the opportunity to work at Wizards for a year, and the end result would be a much more rigorous, expansive, and productive testing process for new sets.

#3: Have "Break the Card" Days

This one could already be happening, but one way to break out of the "casual testing" rut would be to have some testing days where the players are given a specific card, with the goal being to do the most absurd thing possible with it (almost like Against the Odds: FFL). It seems like many of the most obvious misses over the past year could have been avoided if direct attempts to do the most broken thing possible with a card were part of the testing process. I'd like to think that most reasonably competent Magic players, if assigned Felidar Guardian and given a database of all of the cards in Standard, would have recognized the Saheeli combo within a relatively short period of time, and having players build a "how fast can we get an Emrakul, the Promised End on the battlefield" deck would likely have illuminated the power of the Eldrazi in a way that "I've got it as a one-of in my Esper Superfriends" deck never could.

While these extreme decks might not ever end up being played in the real world, they would help identify the ceiling of various cards, and this in turn would help to prevent power-level misses. It's also likely that interactions that would otherwise be missed in regular "build a good, fun deck" testing would come to light. Even better, this doesn't require any extra spending, people, or time; instead, it is simply a slight reorganization of the time that the FFL already spends testing. I'm pretty sure that out of six months of testing a set, there is plenty of time to dedicate a day or week to breaking new cards.

#4: Print More Answers

$ 0.00 $ 0.00 $ 0.00 $ 0.00 $ 0.00 $ 0.00

Last but not least, we have something I've been harping on for a long time, and while this doesn't directly change the FFL testing process, it should help the process along the way. By having more answers as a safety net, the job of testing new cards actually becomes much, much easier. Think of it this way: if something like Pithing Needle were in the format, Smuggler's Copter would probably not have been so dominant that it needed to be banned. If something like Tormod's Crypt or Rest in Peace were in the format, Emrakul, the Promised End would likely be fine.

$ 0.00 $ 0.00 $ 0.00 $ 0.00

The presence of ample answers is why Wizards doesn't really need to test for Modern or Legacy. If they happen to print something that is too good for either format, there is usually something floating around that can solve the problem and normalize the metagame. While I'm not saying we need to go back to the days of Choke, Blood Moon, and Meekstone, by adding some reasonable sideboard-able answers to Standard format, Wizards has a lot more wiggle room when it comes to designing, developing, and testing cards. While it will still be possible to go too far and print something that makes Standard miserable and has to be banned, it will take a really egregious mistake (like Skullclamp) and not just going slightly too far, which is where we are currently in our answerless Standard format.

Thankfully, Wizards recognizes that the lack of answers is a problem, which hopefully means we will see some changes coming down the line in a year or two (thanks to the lag between set design and set release). The question is whether the answers will really be good enough to help fix the problems we've had in Standard or just token attempts to pacify the "we need answers" crowd.

Conclusion

Anyway, that's all for today. What do you think? Why have we suddenly started having problems in the design / development end of Wizards, which has traditionally been the company's biggest strength? Do you have any other ideas of how the process for testing new cards could be improved to make sure the next Felidar Guardian gets identified before it sees print and the next Emrakul, the Promised End is toned down enough that we don't need more painful and unpopular Standard bannings? Let's me know in the comments! As always, leave your thoughts, ideas, opinions, and suggestions. You can reach me on Twitter @SaffronOlive or at SaffronOlive@MTGGoldfish.com.