It’s mid-March and in the United States that can mean only one thing – it’s time for March Madness! Every year countless people fill out a bracket trying to pick which college basketball team will take it all. Do you have a favorite team to win in 2018?

In this blog post, we’ll show you how to create a March Madness predictor using Amazon SageMaker. Amazon SageMaker is a fully managed service that enables developers and data scientists to easily build, train and deploy machine learning models at scale. This post walks you through the creation process from scratch. For modeling we’ll leverage data from the public front page of kenpom.com (https://kenpom.com/), which has wonderful team-based efficiency statistics (from 2002 to present) and College Basketball Reference (https://www.sports-reference.com/cbb/) to pull historic scores (from 2011 to present). We provide instructions for downloading the Jupyter Notebook where you will process & explore the data, build models that predict the outcome of College basketball games, and finally create a Sagemaker endpoint.

Predict March Madness using Amazon Sagemaker

Build a March Madness predictor application supported by Amazon SageMaker

This post focuses on the modeling and hosting process using Amazon SageMaker. A future post will provide instructions on how to create an application built using your model. For example, you could use Amazon SageMaker endpoints to power a website that can automatically generate predictions on scheduled future games, simulate the 2018 NCAA tournament, and respond to user input for hypothetical matchups.

Xavier should watch out in the Sweet 16 for Gonzaga! While Duke & Michigan State are the second & third seeds, respectively, in the Midwest, both appear to be favorites over one-seed Kansas in a hypothetical Elite 8 matchup. While the model will mostly pick favorites to win, it is important to keep in mind that the likelihood of all of the favored teams winning is extremely remote, so use the predicted win probabilities to make smart upset picks!

Environment setup

To start let’s create an Amazon S3 bucket to store data used for modeling with Amazon SageMaker. Open the AWS Management Console at aws.amazon.com and then search for the Amazon S3 console.

In the Amazon S3 console, choose the create bucket icon and then choose a name and AWS Region for your bucket. Keep in mind that this name must be globally unique, and the Region you choose here must be the same Region that you use for Amazon SageMaker. I use the US West Oregon (us-west-2) Region for this exercise, so it’s easiest for you to follow along using that Region. Choose Next, and on the Set Properties tab enable versioning and add a tag specific to this project. For Key I use “public_blog” and for Value I use “cbb” throughout this project).

Versioning allows you to keep track of multiple versions of model files. The tags allow you to keep track of the different resources associated with this project. The use of versioning and tagging are both AWS best practices. You can change any of the other features as well, but they are not necessary for this exercise.

Choose Next to open the Set Permissions tab. Keep all of the default settings. Choose Next, and then Create Bucket. Choose the bucket you created, and the following screen should appear:

Awesome, you’ve set up an S3 bucket! Now, let’s move on to Amazon SageMaker and get set up for modeling.

Download the Jupyter Notebook from here or using the following AWS Command Line (AWS CLI) command replacing <local_file_directory> with your desired local directory. If you’ve never used the AWS CLI before, see this guide for installation and usage details.

aws s3 cp s3://aws-ml-blog/artifacts/bball/WP_Blog_CBB.ipynb <local_file_directory>

In the AWS Management Console search for Amazon SageMaker:

In the Amazon SageMaker console, choose Create notebook instance to create a Jupyter Notebook. Jupyter is a popular data science tool that’s built for experimentation, and is used for processing, exploring, modeling, and evaluating data during the data science process.

Next give your notebook instance a name and create a new IAM role giving Amazon SageMaker access to S3 by choosing Create a new role in the IAM Role selector. This allows you to communicate and transfer data between Amazon S3 and Amazon SageMaker, which is necessary during this exercise.

We’re going to give our Amazon SageMaker notebook access to two specific buckets: wp-public-blog-cbb (contains the data necessary for modeling) and the bucket you created in the first section of the post (I’ve named my bucket wp-cbb-mods).

Next choose Create role. Then add a tag that matches the tag created in the S3 section, and choose Create Notebook Instance on the next page.

You should now see the following at the top of the Amazon SageMaker console:

Nice work! It takes a few minutes for the Amazon SageMaker notebook to be available. When it’s available, choose Open in the Actions column. This will take you to the Jupyter environment, where we’ll spin up a new notebook.

Upload the WP_Blog_CBB.ipynb Jupyter Notebook you downloaded earlier by using the Upload option within the Jupyter console.

Choose the WP_log_CBB.ipynb file, rename the file if you want to, and choose Upload again.

Now choose that file within the Jupyter Notebook, and you should see a screen that looks like the following:

Amazon SageMaker deep dive

Now we’re ready to start working with the data. The following list describes the files used in this exercise (these descriptions can be found in the sample notebook as well):

kenpomnew_active.csv : Contains the most recent available data for the current season of the front page of https://kenpom.com. For more details on this data, I recommend visiting the site.

: Contains the most recent available data for the current season of the front page of https://kenpom.com. For more details on this data, I recommend visiting the site. KenPomHistoric_2002_2017.csv : Contains data from past seasons from the front page of https://kenpom.com. Note that these are reflective of teams at the end of the season.

: Contains data from past seasons from the front page of https://kenpom.com. Note that these are reflective of teams at the end of the season. historic_game_data.csv : Historic game results sourced from College Basketball Reference from 2010-2011 through yesterday’s games. This is updated nightly with new scores.

: Historic game results sourced from College Basketball Reference from 2010-2011 through yesterday’s games. This is updated nightly with new scores. kenpom_teams.csv : This is required for converting team names between the Kenpom files and the game results from College Basketball Reference. “Team” in both of the Kenpom files corresponds with the “kenpom_unique” column in this file.

All details and commentary about the code used in this blog can be found in the Jupyter Notebook provided. In this blog we’re just going to focus on specific sections of the notebook (notably data loading, building and evaluating models, and creating a hosted endpoint), but you can reference the sample notebook for full details.

One necessary step in the first code block (Step 0) is to change the “mybucket” variable reference to the S3 bucket you created at the start of this exercise. This allows you to save your model files in your S3 bucket. Note that the rest of the code will fail if you bypass this step.

The next few blocks of code deal with preprocessing the data – feel free to adapt that section to fit your use case or copy it directly. One key aspect is the join with the kenpom_teams.csv file (teams_data reference) shown in the following code. This converts the team names to be joined across the College Basketball Reference and Kenpom files. Skipping this step will lead to significant data loss when joining KenPom efficiency data with historic game outcomes since the team names don’t match across files. Another optional pre-processing step taken was to randomize home and away teams. Home teams are more likely to win (over 64% in the dataset) and the data comes in an away-then-home structure inherently. This randomization had a positive impact on the model area under the curve (AUC).

After completing data preprocessing, performing some exploratory data analysis (EDA), and splitting the data into training, validation, and testing, we’re ready for modeling.

In the “Step 4 – Modeling!” section of the Amazon SageMaker notebook the data used for training versus validation is referenced, then the location of the algorithms used for modeling is specified (in this case we’ll use the built-in Amazon SageMaker XGBoost algorithm).

In this exercise we’re building two models: one to predict the expected difference in points between the two teams (referred to as the “difference” model) and another to predict the total points scored (referred to as the “total” model). Creating both a difference and total model implicitly generates an expected score and also allows you to generate an implied probability of winning, which creates fun opportunities for displaying the output. For example, if the difference model predicts 10 and the total model predicts 150, the implied score of the game would be 80-70.

We’ll start the modeling process with the difference model and perform the exact same steps with the total model.

Next, we define parameters for the Amazon SageMaker model run. This is a small dataset so only one instance is required. To look at how this works using a larger dataset, see this example for training XGBoost on multiple instances with Amazon SageMaker. We use the IAM role defined in the Amazon SageMaker setup to allow our notebook instance to access the training and validation data from Amazon Simple Storage Service (S3), and also upload the output in the same S3 bucket.

Finally, the fit function is called, which actually spins up the required instance and executes the training job. This should take roughly 6-8 minutes total with this amount of data using XGBoost.

When the run is completed, you should see output that shows model statistics including the score on the validation set based on the eval_metric that was chosen, in this case RMSE.

After repeating the same steps for the total model, it’s time to deploy our models with hosted Amazon SageMaker endpoints! We’ll use these endpoints to generate predictions with our test set to ensure that the models generalize to new data, and (as seen in the website) can power user-facing applications. Endpoint creation could not be easier with Amazon SageMaker—a simple function call stands up a persistent endpoint.

A key difference between the endpoint creation and training is that the endpoint will run until it is explicitly deleted, while the modeling instance shuts down immediately after the model run is completed. To avoid an unexpected bill when you’re done using your Amazon SageMaker endpoint, make sure you delete it using the console or the delete_endpoint function. If you delete these endpoints, any applications built on top of them will no longer be supported.

The following code block gets data in the proper format (csv in this case) and generates predictions from the test set using the Amazon SageMaker endpoint.

A series of functions have been created in the notebook to evaluate the performance of the model, including generating a confusion matrix and accuracy percentage for win-loss binary predictions and R squared and RMSE scores for the difference and total models.

The overall accuracy of the model win-loss predictions is 77.7%. Note that since the data used is reflective of the end of the season, this number is inflated because it introduces bias by informing games early in the season with end-of-season overall aggregate statistics. Historic data was not available point-in-time (only end of season), but using data pulled from this year (February/March 2018), overall accuracy was 72% using point-in-time data compared to 74% using end-of-season, so it appears that this impact is relatively small.

This also means that this model does not account for new information well since it is using season averages. As mentioned in the introduction, one of Virginia’s best players got injured after the end of the season and will miss the NCAA tournament this season. Season averages will not account for his absence, so Virginia’s predictions will likely overrate their expected performance. When making selections, it’s important to supplement this model with your own judgement & understanding of the model limitations.

You should expect the accuracy of the model to be slightly different based on new data piping through and the use of different training and test sets. However, without significant changes it should be within a few percentage points of 77.7%.

There are many other model-specific improvements that could be made to this model, some options include the following:

Temporal performance. How has a team performed in the most recent week/month, etc.? Since the efficiency statistics are season-long, injuries and changes in eligibility that dictate changes in performance will be captured slowly over time in the aggregate, season-long data. Additionally, using end of season data introduces bias when predicting performance on games earlier in the season.

Neutral game identification. NCAA tournament games have been identified as neutral court, as well as all conference tournament games starting on March 6 this year, but many are not properly identified. In addition, some conference tournaments have home court advantages while others do not. Proper home/away/neutral court identification would undoubtedly improve model performance.

In-season rescaling. Currently teams are rescaled from 0-1 across all years included in the dataset (2011-2018 seasons). This could be altered to just rescale within a given season.

Testing other algorithms.

Appending additional data.

Feel free to be creative and add any solutions that help the model performance!

Model evaluation

The raw model file was pulled down from Amazon S3 to plot the feature importance for both models. It is often extremely useful to plot the importance of features to see if the model matches your intuition about influential variables. In the following chart, the numeric features represent the order of the dataset (starting from 1). “2” and “12” are reflective of the overall efficiency of both teams (the difference between a team’s offensive and defensive points per 100 possessions), so it makes sense that these are the two most influential predictors in the difference model.

In the total model, there is a more even distribution of feature importance, and the two most influential variables in the difference model are outside of the top five in the total model. This is logical given that efficiency doesn’t necessarily translate to the magnitude of points scored. Plotting feature importance in this fashion can help train intuition, and also can be used to debug when predictions are returning illogically.

When evaluating errors, it became quite clear that the difference prediction doesn’t have a linear relationship with a team’s likelihood of winning. The following chart shows the observed winning percentage of games in the test set based on the predicted difference from the difference model, and very closely resembles a sigmoid function.

The next section of code in the sample workbook generates a logistic regression model that converts predicted differences into a team’s win probability, which can be useful when making selections in your bracket.

Converting the predicted differences into probabilities also allows for the generation of the area under the curve, or AUC score, which in this case is 86.1%.

2018 March Madness prediction

The final section of the notebook simulates the NCAA tournament based on your dataset, allowing you to simulate every possible matchup for a given team, a specific matchup between two teams in the tournament, and finally simulate the most likely possible outcome for this year’s tournament based on your model. Of course, upsets will happen in the tournament, some of them will have extremely long odds while others may be more of a toss-up, and this output should help guide your picks. In addition to printing out the results, this code block also saves the simulated NCAA tournament in CSV format to your S3 bucket for potential external use.

This model uses season-long team averages and will subsequently overrate teams that have experienced recent drastic events (such as injuries & transfers) that are likely to change their expected future performance. This is especially relevant for the 2018 tournament. Virginia is the overall top seed in the tournament, and after the season ended one of their best players was lost for the rest of the year due to injury. This will likely overrate Virginia’s expected performance in the tournament, which is important to keep in mind when making selections!

Conclusion

To recap, we walked you through the creation of working models with endpoints capable of generating the expected difference in points between two teams, as well as the total amount of points scored in a game. This allows you to calculate the implied score of that game, and finally a logistic regression model that transforms those difference predictions into the likelihood a team wins the game. And we’ve done all of this in a single Jupyter Notebook!

About the Author

Wesley Pasfield is a Data Scientist with AWS Professional Services. He has an MS from Northwestern and has worked on problems across numerous industries including sports, gaming, consumer electronics, and retail. He is currently working to help customers enable machine learning and artificial intelligence use cases on AWS.