With the kickoff of the 2018 FIFA World Cup fast approaching, every soccer fan in the world is dying to know: Who will capture the coveted trophy?

If you’re not just a soccer fan but also a techie, I guess you have realized that Machine Learning and Artificial Intelligence are presently buzzwords too. Let us combine these two to predict which country will win the FIFA World Cup.

Disclaimer: This should in no way be used for betting or any financial decision. Should you choose to, who am I to stop you.(just don’t forget me if you hit a jackpot 😃).

A lot of factors are involved in the game of football and as such all of them cannot be scoped out in a machine learning model. This is just a hacker trying some cool shit with data…

Goal

The goal is to use Machine Learning to predict who is going to win the FIFA World Cup 2018. Predict the outcome of individual matches for the entire competition. Run simulation of the next matches i.e quarter finals, semi finals and finals.

These goals present a unique real-world Machine Learning prediction problem and involve solving various Machine Learning tasks: data integration, feature modelling and outcome prediction.

Data

I used two data sets from Kaggle. You can find them here. We will use results of historical matches since the beginning of the championship (1930) for all participating teams.

Limitation: FIFA ranking was created in the 90’s thus a huge portion of the dataset is lacking. So let’s stick to historical match records.

Environment and tools: jupyter notebook, numpy, pandas, seaborn, matplotlib and scikit-learn.

We are first going to do some exploratory analysis on the two datasets, do some feature engineering to select most relevant feature for prediction, do some data manipulation, choose a Machine Learning model and finally deploy it on the dataset.

Let the rubber hit the road!

First things first, import the necessary libraries and load the datasets into a

Dataframe.

Importing libraries

Loading the datasets…

Ensure the datasets are loaded in the dataframes by calling world_cup.head() and results.head() for both datasets as shown below:

Exploratory Analysis

After analyzing both datasets, the resulting dataset has data on past matches. The new (resulting) dataset will be useful for analysis of and predicticting future matches.

Exploratory analysis and feature engineering: which involve establishing which features are relevant for the Machine Learning model is the most time consuming part of any Data science project.

Let’s now add goal difference and outcome column to the results dataset.

Check out the new results dataframe.

Then we’ll work with a subset of the data. One that includes games played only by Nigeria. This will help us focus on what features are interesting for one country and later expand to countries participating in the world cup.

The first World Cup was played in 1930. Create a column for year and pick all the games played after 1930.

We can now visualize the most common match outcome for Nigeria throughout the years.

Getting the winning rate for every country that will participate in the world cup is a useful metric and we could use it to predict the most likely outcome of each match in the tournament.

Venue of the matches won’t matter that much.

Narrowing to the teams participating in the World Cup

Create a dataframe with all the participating teams.

We then further filter the results dataframe to show only teams in this years world cup from 1930 onwards as well as drop duplicates.

Create a year column and drop games before 1930 as well as columns that won’t affect match outcome for example date, home_score, away_score, tournament, city, country, goal_difference and match_year.

Modify the “Y” (prediction label) in order to simplify our model’s processing.

The winning_team column will show “2” if the home team has won, “1” if it was a tie, and “0” if the away team has won.

Convert home_team and away _team from categorical variables to continuous inputs, by setting dummy variables.

Using pandas, get_dummies() function. It replaces categorical columns with their one-hot (numbers ‘1’ and ‘0’) representations enabling them to be loaded into Scikit-learn model.

We then separate the X and Y set and split the data into 70 percent training and 30 percent test.

We will use logistic regression, a classifier algorithm. How does this algorithm work? It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function. Specifically the cumulative logistic distribution.

In other words logistic regression attempts to predict an outcome (a win or a loss) given a set of data points (stats) that likely influence that outcome.

The way this works in practice is you feed the algorithm one game at a time, with both the aforementioned “set of data” and the actual outcome of the game. The model then learns how each piece of data you feed it influences the outcome of the game positively, negatively and to what extent.

Give it enough (good) data and you have a model that you can use to predict future outcomes.

A model is as good as the data you give it.

Let’s have a look at our final dataframe:

Looks great. We are now ready to pass this to our algorithm:

Our model got a 57% accuracy on the training set and 55% accuracy on the test set. This doesn’t look great but let’s move on.

At this point we will create a dataframe that we will deploy our model.

We will start by loading the FIFA ranking as of April 2018 dataset and a dataset containing the fixture of the group stages of the tournament obtained from here. The team which is positioned higher on the FIFA Ranking will be considered “favourite” for the match and therefore, will be positioned under the “home_teams” column since there are no “home” or “away” teams in World Cup games. We then add teams to the new prediction dataset based on ranking position of each team. The next step will be to create dummy variables and and deploy the machine learning model.

Match Prediction

By now you are wondering will we ever get to the predictions? That has been too much code and talk, when will you show us the prediction? Just hold on tight we are almost there…

Deploying the model to the dataset

We start with deploying the model to the group matches.

Here are the results of group stages.