Installation

The first step in getting started with Deep Learning is setting up an environment. I covered setting up Jupyter on an AWS EC2 instance in my past post. We’ll install two additional libraries for Python: tensorflow and keras. Also, it’s useful to spins up a larger machine, such as t2.xlarge, when working on deep learning problems. Here’s the steps I used to set up a Deep Learning environment on EC2. However, this configuration does not support GPU acceleration.

# Jupyter setup

sudo yum install -y python36

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

sudo python36 get-pip.py

pip3 install --user jupyter # Deep Learning set up

pip3 install --user tensorflow

pip3 install --user keras

pip3 install --user matplotlib

pip3 install --user pandas # Launch Jupyter

jupyter notebook --ip Your_AWS_Prive_IP

Once you have connected to Jupyter, you can test your installation by running the following commands:

import keras

keras.__version__

The output should print that the TensorFlow backend is being used.

Classification with Keras

To get started with deep learning, we’ll build a binary classifier that predicts which users are most likely to purchase a specific game, given past purchases. We’ll use the data set that I presented in my post on recommender systems. The rows in the data set contains a label indicating if the player purchased the game, and a list of other games with values of 0 or 1 indicating purchases of other titles. The goal is predicting which users will purchase the game. The complete notebook for the code presented in this section is available here.

The general process for building models with Keras is:

Set up the structure of the model Compile the model Fit the model Evaluate the model

I’ll discuss each of these steps in more detail below. First, we need to include the necessary libraries for keras and plotting:

import pandas as pd

import matplotlib.pyplot as plt

import tensorflow as tf

import keras

from keras import models, layers

keras.__version__

Next, we download the data set and create training and test data sets. I’ve held out 5000 samples that we’ll use as a holdout data set. For the training data set, I split the data frame into input variables (x) and labels (y).



" df = pd.read_csv( https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv ") train = df[5000:]

test = df[:5000] x = train.drop(['label'], axis=1)

y = train['label']

Now we can create a model to fit the data. The model below uses three layers of fully-connected neurons with relu activation functions. The input structure is specified in the first layer, which needs to match the width of the input data. The output is specified as a signal neuron with a sigmoid activation, since we are preforming binary classification.

model = models.Sequential()

model.add(layers.Dense(64, activation='relu', input_shape=(10,)))

model.add(layers.Dropout(0.1))

model.add(layers.Dense(64, activation='relu'))

model.add(layers.Dropout(0.1))

model.add(layers.Dense(64, activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))

Next, we specify how to optimize the model. We’ll use rmsprop for the optimizer and binary_crossentropy for the loss function. Instead of using accuracy for the metric, we’ll use ROC AUC since the data set has a large class imbalance. In order to use this metric, we can use the auc function provided by tensorflow.

def auc(y_true, y_pred):

auc = tf.metrics.auc(y_true, y_pred)[1]

keras.backend.get_session().run(

tf.local_variables_initializer())

return auc



model.compile(optimizer='rmsprop',

loss='binary_crossentropy',metrics=[auc]

The last step is to train the model. The code below shows how to fit the model using the training data set, 100 training epochs with a batch size of 100, and a cross validation split of 20%.

history = model.fit(x,

y,

epochs=100,

batch_size=100,

validation_split = .2,

verbose=0)

The progress of the model will be display during training if verbose is set to 1 or 2. To plot the results, we can use matplotlib to display the loss values of the training and test data sets:

loss = history.history['loss']

val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1) plt.figure(figsize=(10,6))

plt.plot(epochs, loss, 'bo', label='Training loss')

plt.plot(epochs, val_loss, 'b', label='Validation loss')

plt.legend()

plt.show()

The resulting plot is shown below. While the loss value for the training data set continued to decrease with more epochs, the loss on the test data set flattened out after about 10 epochs.

Plotting the loss values for the binary classifier.

We can also plot the value of the AUC metric after each epoch, as shown below. Unlike the loss value, the AUC metric of the model on the test data set continued to improve with additional training.

Plotting the AUC metric for the binary classifier.

A final step is evaluating the performance of the model on the holdout data set. The loss value and AUC metric can be calculated for the holdout data using the code shown below, which results in an AUC of ~0.82.

x_test = test.drop(['label'], axis=1)

y_test = test['label'] results = model.evaluate(x_test, y_test, verbose = 0)

results

This section discussed building a simple classifier using a deep learning model with the Keras framework. Generally, deep learning won’t perform as well as XGBoost on shallow learning problems like this, but it’s still a useful approach to explore. In the next section, I discuss how custom loss functions can be used to improve model training.

Custom Loss Functions

One of the great features of deep learning is that it can be applied to both deep problems with perceptual data, such as audio and video, and shallow problems with structured data. For shallow learning (classic ML) problems, you can often see improvements over shallow approaches, such as XGBoost, by using a custom loss function that provides a useful singal.

However, not all shallow problems can benefit from deep learning. I’ve found custom loss functions to be useful when building regression models that need to create predictions for data with different orders of magnitude. For example, predicting housing prices in an area where the values can range significantly. To show how this works in practice, we’ll use the Boston housing data set provided by Keras:

This data set includes housing prices for a suburb in Boston during the 1970s. Each record has 13 attributes that describe properties of the home, and there are 404 records in the training data set and 102 records in the test data set. In R, the dataset can be loaded as follows: boston_housing.load_data() . The labels in the data set represent the prices of the homes, in thousands of dollars. The prices range from $5k to $50k, and the distribution of prices is shown in the histograming on the left. The original data set has values with similar orders of magnitude, so custom loss functions may not be useful for fitting this data. The histogram on the right shows a transformation of the labels which may benefit from using a custom loss.

The Boston data set with original prices and the transformed prices.

To transform the data, I converted the labels back into absolute prices, squared the result, and then divided by a large factor. This results in a data set where the difference between the highest and lowest prices is 100x instead of 10x. We now have a prediction problem that can benefit from the use of a custom loss function. The Python code to generate these plots is shown below.

# Original Prices

plt.hist(y_train)

plt.title("Original Prices")

plt.show() # Transformed Prices

plt.hist((y_train*1000)**2/2500000)

plt.title("Transformed Prices")

plt.show()

Loss Functions in Keras

Keras includes a number of useful loss function that be used to train deep learning models. Approaches such as mean_absolute_error() work well for data sets where values are somewhat equal orders of magnitude. There’s also functions such as mean_squared_logarithmic_error() which may be a better fit for the transformed housing data. Here are some of the loss functions provided by Keras:

mean_absolute_error()

mean_absolute_percentage_error()

mean_squared_error()

mean_squared_logarithmic_error()

To really understand how these work we’ll need to jump into the Python losses code. The first loss function we’ll explore is the mean squared error, defined below. This function computes the difference between predicted and actual values, squares the result (which makes all of the values positive), and then calculates the mean value. Note that the function uses backend operations that operate on tensor objects rather than Python primitives.

def mean_squared_error(y_true, y_pred):

return K.mean(K.square(y_pred - y_true), axis=-1)

The next built-in loss function we’ll explore calculates the error based on the difference between the natural log of the predicted and target values. It is defined here and shown below. The function uses the clip operation to make sure that negative values are not passed to the log function, and adding 1 to the clip result makes sure that all log transformed inputs will have non-negative results. This function is similar to the one we will define.

def mean_squared_logarithmic_error(y_true, y_pred):

first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)

second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)

return K.mean(K.square(first_log - second_log), axis=-1)

The two custom loss functions we’ll explore are defined in the Python code segment below. The first function, mean log absolute error (MLAE), computes the difference between the log transform of the predicted and actual values, and then averages the result. Unlike the built-in function above, this approach does not square the errors. One other difference from the log function above is that this function is applying an explicit scaling factor to the data, to transform the housing prices back to their original values (5,000 to 50,0000) rather than (5, 50). This is useful, because it reduces the impact of adding +1 to the predicted and actual values.

from keras import backend as K # Mean Log Absolute Error

def MLAE(y_true, y_pred):

first_log = K.log(K.clip(y_pred*1000, K.epsilon(), None) + 1.)

second_log = K.log(K.clip(y_true*1000, K.epsilon(), None) + 1.)

return K.mean(K.abs(first_log - second_log), axis=-1) # Mean Squared Log Absolute Error

def MSLAE(y_true, y_pred):

first_log = K.log(K.clip(y_pred*1000, K.epsilon(), None) + 1.)

second_log = K.log(K.clip(y_true*1000, K.epsilon(), None) + 1.)

return K.mean(K.square(first_log - second_log), axis=-1)

Like the Keras functions, the custom loss functions need to operate on tensor objects rather than Python primitives. In order to perform these operations, you need to get a reference to the backend using the from statement. In my system configuration, this returns a reference to tensorflow.

The second function computes the square of the log error, and is similar to the built in function. The main difference is that I’m scaling the values, which is specific to the housing data set.

Evaluating Loss Functions

We now have four different loss functions that we want to evaluate the performance of on the original and transformed housing data sets. This section will walk through loading the data, compiling a model, fitting the model, and evaluating performance. The complete code listing for this section is available on github.

After following the installation steps in the prior section, we’ll load the data set and apply our transformation to skew housing prices. The last two operations can be commented out to use the original housing prices.

# load the data set

from keras.datasets import boston_housing

(x_train, y_train), (x_test, y_test) = boston_housing.load_data() # transform the training and test labels

y_train = (y_train*1000)**2/2500000

y_test = (y_test*1000)**2/2500000

Next, we’ll create a Keras model for predicting housing prices. I’ve used the network structure from the sample problem in “Deep Learning with R”. The network includes two layers of fully-connected relu activated neurons, and an output layer with no transformation.

# The model as specified in "Deep Learning with R"

model = models.Sequential()

model.add(layers.Dense(64, activation='relu',

input_shape=(x_train.shape[1],)))

model.add(layers.Dense(64, activation='relu'))

model.add(layers.Dense(1))

To compile the model, we’ll need to specify an optimizer, loss function, and a metric. We’ll use the same metric and optimizer for all of the different loss functions. The code below defines a list of loss functions, and for the first iteration the model uses mean squared error.

# Compile the model, and select one of the loss functions

losses = ['mean_squared_error', 'mean_squared_logarithmic_error',

MLAE, MSLAE] model.compile(optimizer='rmsprop',

loss=losses[0],

metrics=['mae'])

The last step is to fit the model and then evaluate the performance. I used 100 epochs with a batch size of 5, and a 20% validation split. After training the model on the training data set, the performance of the model is evaluated using the mean absolute error on the test data set.

# Train the model with validation

history = model.fit(x_train,

y_train,

epochs=100,

batch_size=5,

validation_split = .2,

verbose=0) # Calculate the mean absolute error

results = model.evaluate(x_test, y_test, verbose = 0)

results

After training the model, we can plot the results using matplotlib. The plot below shows the loss values for the training and testing data sets.

loss = history.history['loss']

val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1) plt.figure(figsize=(10,6))

plt.plot(epochs, loss, 'bo', label='Training loss')

plt.plot(epochs, val_loss, 'b', label='Validation loss')

plt.legend()

plt.show()

Loss values for the training and validation data sets.

I trained four different models with the different loss functions, and applied this approach to both the original housing prices and the transformed housing prices. The results for all of these different combinations are shown below.

Performance of the Loss Function of the Housing Price Data Sets

On the original data set, applying a log transformation in the loss function actually increased the error of the model. This isn’t really surprising given that the data is somewhat normally distributed and within a single order of magnitude. For the transformed data set, the squared log error approach outperformed the mean squared error loss function. This indicates that custom loss functions may be worth exploring if your data set doesn’t work well with the built-in loss functions.

The model training histories for the four different loss functions on the transformed data set are shown below. Each model used the same error metric (MAE), but a different loss function. One surprising result was that the validation error was much higher for all of the loss functions that applied a log transformation.