Playing "Moneyball" on EA FIFA 16

I'm an EA FIFA enthusiast. More than that, I’ve played since 1994. Last year, I was talking with a friend about the game and he asked me if I was playing Career Mode. My first answer was: "That is boring ... I want to play the real game and not play as a manager". From that day, he decided to convince me to play Career Mode, and started asking me almost everyday. After few weeks I decided to give it a try. I selected Barcelona, played few seasons (simulating all the games), did some trades and I won La Liga and the Champions League.

Next day, I explain to him about my experience and he said that La Liga is actually too easy and I should try the English Premier League (EPL). That night I used the same approach, but now with Chelsea. Again, after some trades, I won the EPL and the Champions League. In the next day I told him, proudly, that the EPL was also too easy and he said that if I wanted a challenge I should play with a team from the second division. I answer saying I would select a team from the third division. Then he said, "You should play with a team from the fourth division ... actually you should play with Portsmouth."

I had a Barney Stinson moment and said ... "win the EPL with Portsmouth" ... challenge accepted.

Update: the result of the challenge I presented last year at the MTL Data meetup. In this blog post, I described my new challenge with FIFA 16: "Win the Champions League with Accrington Stanley".

Initial Strategy

Winning the Champions League is not hard, you just need the right team. I was pretty confident that with the following team I could win any championship.

However, when I selected the Accrington Stanley, right away I understood the challenge: the budget. In my two recent experiences I played with Barcelona (initial budget 100,000,000 euros) and Chelsea (initial budget 85,000,000 euros). But, Accrington Stanley's initial budget is 500,000 euros (which means, 200 times less than Barcelona's budget). For clarity, here is the distribution of initial budgets for all the 583 teams in the game.

At that time I realized I was facing the "Moneyball problem". In the "Moneyball" movie there is a scene where the Billy Beane says to the scouters: "The problem we're trying to solve is that there are rich teams and there are poor teams. Then there's fifty feet of cr*p, and then there's us. It's an unfair game".

I knew that in order to build a team I would need to scout "cheap" players. I had to scout young players with big potential (thinking on the long term) and unknown players from smaller leagues (for the short term).

I can say that I know football a bit. I follow the main leagues, competitions and players. However, how could I get familiar with smaller leagues and unknown players? I needed a data driven approach.

Dataset

Searching the web I was able to find some interesting website with stats on EA FIFA. However, the numbers are usually slightly different from one site to another. My first task was to collect, normalize and aggregate all these datasets.

As a result of the data aggregation, I ended up with a dataset with almost 12,000 players. The final dataset has more than 50 columns including:

Basic information: name, nationality, current club, league, ...

name, nationality, current club, league, ... Physical information: age, height, weight, ...

age, height, weight, ... Position on the field

Skills information (28 metrics): acceleration, agility, positioning, interception, vision, jumping, ...

acceleration, agility, positioning, interception, vision, jumping, ... General skills information (6 aggregate metrics): pace, dribbling, shooting, defending, passing and physicality.

pace, dribbling, shooting, defending, passing and physicality. Overall Rating: a number from 0-100 to ranking players performance

a number from 0-100 to ranking players performance Potential: a number from 0-100 as the overall rating that player can achieve in his career

Exploratory Data Analysis

My objective was to find players with a good overall rating and specially good potential. I started my analysis by performing some exploratory data analysis to understand what are the factors that influence a player’s overall rating and potential.

Initially I was looking in the 28 skill metrics, however, quickly I notice that the 6 general skill metrics are an aggregation on the 28. For example, in the following image, we can see Neymar's skill information. The numbers in green/yellow/red are actual numbers that quantifies his skill in each criteria. The advantage of the aggregation is the fact that simplifies the analysis and the usage in the game because instead of tracking many numbers, I just needed to track six.

To validate the aggregation, I used a regression to understand the weight of each metric. The result is the number in black next to each skill. For example, the “Dribbling” score is a combination of Agility (10%), Balance (5%), Reactions (5%), Ball Control (30%) and Dribbling (50%). Interesting to notice that I didn't round the number, the weights are actually pretty round numbers.

The results from the regression analysis gave me a useful insight. Just by looking the weights, if I was interested in a player with good Dribbling skill, the metrics “Balance” and “Reactions” are not as relevant as “Ball Control” and “Dribbling”. This helped me to find players based on abilities and to train specific skills.

Another interesting exploration was the relevance of the aggregation metrics per position. In the game, each position requires specific skills. The following table presents the importance distribution. For example, good CAM (Centre Attack Midfielder) needs excellent passing skills, good dribbling and shooting, however, pace, defense and physicality are not in fact relevant.

CB LB RB CDM CM LM RM CAM LW RW ST PAC 0% 2% 2% 1% 1% 2% 3% 1% 5% 2% 1% SHO 0% 1% 1% 1% 1% 1% 1% 12% 4% 6% 91% PAS 1% 5% 8% 22% 85% 41% 57% 67% 25% 14% 1% DRI 1% 3% 2% 1% 6% 54% 38% 19% 65% 76% 4% DEF 94% 88% 85% 72% 5% 1% 0% 0% 1% 1% 1% PHY 3% 2% 2% 3% 2% 1% 1% 1% 1% 2% 2%

(CB) Centre Back, (LB) Left Back, (RB) Right Back, (CDM) Centre Defence Midfielder, (CM) Centre Midfielder, (LM) Left Midfielder, (RM) Right Midfielder, (CAM) Centre Attack Midfielder, (LW) Left Wing, (RW) Right Wing, (ST) Striker

This analysis also helped me to focus in the relevant numbers and quickly spot the skills I should look for in a player per position.

Prediction

After the exploratory analysis, my next question was to discover how the video game calculates the overall score and potential.

Based on the previous results, it was clear that the predictions would depend on the position. I applied three different algorithms to predict the overall rating. Here the MAE (mean absolute error) and the RMSE (root mean square error) for the combination of position (CB, LB, RB, CDM, CM, LM, RM, CAM, LW, RW, ST), algorithm (Gradient Boosting, Linear Regression, Random Forest) and metrics used (age + 6 aggregated skills, age + 28 skills). The following charts show the errors (in unities of overall score) based on a cross validation for each combination.

From the results, I could observe that Linear Regression works better than the other two algorithms. I think this could make sense because probably EA decided for a simpler approach that gave an interesting result. Also, the calculation using the 28 skill metrics results in a smaller error than the 6 aggregated. In terms of positions, the bigger errors are for CM (Centre Midfielder), LW (Left Wing) and RW (Right Wing). This is consistent because the players in these three positions have a more diverse skill sets. For example, a CM need to help at the defense and to attack and the LW/RW are a mix of midfielders, wingers, and forwards. To analyze the error we have to keep in mind the variation on the overall score because of the aggregation on datasets.

The prediction of the potential is the most important question in the context of the game. I used the same approach, using cross validation with the same set algorithms and features. The results (presented below) show that the average error for all the positions is between 1.0 and 1.5 units of score. That was exactly what I needed. This error is low enough to give me a very good idea about player's potential.

Because the potential is not a descriptive metric as the overall score, the Gradient Boosting and the Random Forest presented better results than the Linear Regression in most of the cases. Anyhow, for the final implementation I used Ensemble methods to combine the three algorithms.

Back to the game

With the insights from the exploratory data analysis and the machine learning models for predictions, I went back to the game to test them. Here three scenarios where I used the data driven approach.

Buy the "right" player

During my manager career with Accrington Stanley I was able to find good players that fitted the budget. The values on EA FIFA are hard to analyze because they vary a lot based on the player's position, team, league, age, overall rating and amount of time on the field. Just as an illustration, here some numbers. On the left the super team that I mentioned earlier and on the right my team formation that played the Champions League Final.

These numbers are from when you start a new career.

The formation on the right illustrates the "Moneyball" concept. The players start with a medium score (mean 73.6), a low price (mean 10.5 M) and a average delta of 13.4 (the delta is the difference between the potential and the current overall score). However, the players have a very good predict potential (mean 87.0) which is really close the to mean potential of the team on the left (89.5).

Also, with the capacity to predict the potential and identify promising players, I was able to trade players with great delta but not a good final predict score. Here the top 3 players I traded during the seasons in terms of profit. Trading players was the best strategy to increase the amount of money available.

Scout young players

After few seasons, the older players start to retire and the video game starts to generate new random players. I know Suarez is a very effective striker and Thiago Silva is a great centre back. However, how do we evaluate a random player? Here an example of two random players and their predicted potential based on the machine learning model.

The player on the left (Kyle Tissot) is a 17 years old RW (Right Wing). His current overall score is 62 and he has a potential of 71.88. On the right (Emery Gagnon-Lapare) a 18 years old CM (Centre Midfielder). He has a potential of 72.19.

Players in the wrong position

Another interesting use case is the players in the wrong position. Based on the player's skills, I was able to evaluate if the position suggest by the video game was the most effective one. Let's consider the case of the player Kyle Tissot. As a RW his potential is 71.88, but, as we may recall from the table in the section "Relevant skill by position", a RW needs good dribbling skills. However, he has better passing skills than dribbling. In this case, his predict potential, for example, as CAM (Centre Attack Midfielder) is 75.52.

Final Result

The video game was a great experiment environment. Because it is a controlled universe, I was able to quickly test and refine my ideas. At the end of 5 seasons I won the Champions League with Accrington Stanley. Challenge completed! During all the seasons, I simulated all the games in the "World Class" level.

I would like to thank Vaughn DiMarco for reviewing drafts of this article