Source:Wikimedia Commons

Predicting which NHL games are playoff matches using Python and the FeatureTools library.

One of the main time investments I’ve seen data scientists make when building data products is manually performing feature engineering. While tools such as auto-sklearn and Auto-Keras have been able to automate much of the model fitting process when building a predictive model, determining which features to use as input to the fitting process is usually a manual process. I recently started using the FeatureTools library, which enables data scientists to also automate feature engineering.

FeatureTools can be used on shallow (classic) machine learning problems, where data is available in a structured format. The library provides functionality for performing deep feature synthesis, which approximates the transformations that a data scientists would explore when performing feature engineering. The outcome of using this tool is that you can transform data from a narrow and deep representation, to a shallow and wide representation.

I’ve found that this technique works best when you have many records per item that need a prediction. For example, if you are predicting if customers will churn, the input could be a collection of session events for each customer. If you only have a single record per user, then deep feature synthesis won’t be very effective. To show how this approach works, I’ll use the NHL data set available on Kaggle. This data set includes a table of games records, and a table of play records that describe each game in more detail. The goal of the predictive model I’m building is to identify which games are playoff matches, based on the plays made during the game. With no domain knowledge applied, I was able to able to build a logistic regression classifier with a high accuracy (94%) of predicting with games were playoff games. The complete Python notebook is available on github here.

The remainder of this post walks through the notebook, showing how to translate from the provided Kaggle tables into an input we can use for the FeatureTools library. The first step is to load the necessary libraries. We’ll use pandas to load the tables, framequery to manipulate the data frames, hashlib to translate strings to integers, feature tools to perform deep feature synthesis, and sklearn for model fitting.

import pandas as pd

import framequery as fq

import hashlib

import featuretools as ft

from featuretools import Feature

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_auc_score

Next, we’ll load the data into pandas data frames and drop string fields that will not be used for building predictions. The result is two data frames: game_df specifies if a game was a regular or playoff match, and plays_df has details about the plays made each game.

# game data

game_df = pd.read_csv("game.csv") # play data

plays_df = pd.read_csv("game_plays.csv") # drop some of the string type fields

plays_df = plays_df.drop(['secondaryType', 'periodType',

'dateTime', 'rink_side'], axis=1).fillna(0) # convert the remaining strings to integer types via hashing

plays_df['event'] = plays_df['event'].apply(hash)

plays_df['description'] = plays_df['description'].apply(hash)

The plays data is in a narrow but deep format, meaning that each game is composed of a number of different plays with only a few features. Our goal is to transform this data into a shallow but wide format, where each game is described by a single row with hundreds of different attributes. Here’s the input data sets.