Andrew Beam does a great job showing that small datasets are not off limits for current neural net methods. If you use the regularisation methods at hand – ANNs is entirely possible to use instead of classic methods.

Let’s see how this holds up on up on some benchmark datasets.

%matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot') seed = 123456 np.random.seed(seed)

Let’s start with the iris dataset that you nicely can pull with the pandas read_csv function right of the internets.

target_variable = 'species' df = ( pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv') # Rename columns to lowercase and underscores .pipe(lambda d: d.rename(columns={ k: v for k, v in zip( d.columns, [c.lower().replace(' ', '_') for c in d.columns] ) })) # Switch categorical classes to integers .assign(**{target_variable: lambda r: r[target_variable].astype('category').cat.codes}) )

There’s three classes and 150 datapoints. Not really B̫͕̟̱I̜̼͈̖̫G͉ d̙͕a͇͍͕̝̟t̪̝̹̻͉̭ͅa.

df[target_variable].value_counts().sort_index().plot.bar()

We create a feature matrix X and a target y from the Pandas dataframe. And since an ANN needs the features to be normalized, let’s do some min-max-scaling before anything else.

y = df[target_variable].values X = ( # Drop target variable df.drop(target_variable, axis=1) # Min-max-scaling (only needed for the DL model) .pipe(lambda d: (d-d.min())/d.max()).fillna(0) .as_matrix() )

We split into a training set and a test set.

from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split, cross_val_score X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=seed )

Import some keras goodness (and perhaps run pip install keras first if you need it).

from keras.models import Sequential from keras.callbacks import EarlyStopping, ModelCheckpoint from keras.layers import Dense, Activation, Dropout from keras import optimizers

And set up a thee layer deep and 128 wide net. Nothing fancy, not even sure this would qualify as deep learning – but throw in some dropout between them to help it to not overfit.

Learning rate for the optimization method Adam might be something to tune on other datasets but here 0.001 seems to work nicely.

m = Sequential() m.add(Dense(128, activation='relu', input_shape=(X.shape[1],))) m.add(Dropout(0.5)) m.add(Dense(128, activation='relu')) m.add(Dropout(0.5)) m.add(Dense(128, activation='relu')) m.add(Dropout(0.5)) m.add(Dense(len(np.unique(y)), activation='softmax')) m.compile( optimizer=optimizers.Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'] )

EarlyStopping helps us stop the training when the validation set is not improving any more – which helps us avoid overfitting. And to keep the checkpoint just before overfitting occurs, ModelCheckpoints let’s us save the best model before decline in validation set performance.

m.fit( # Feature matrix X_train, # Target class one-hot-encoded pd.get_dummies(pd.DataFrame(y_train), columns=[0]).as_matrix(), # Iterations to be run if not stopped by EarlyStopping epochs=200, callbacks=[ # Stop iterations when validation loss has not improved EarlyStopping(monitor='val_loss', patience=25), # Nice for keeping the last model before overfitting occurs ModelCheckpoint( 'best.model', monitor='val_loss', save_best_only=True, verbose=1 ) ], verbose=2, validation_split=0.1, batch_size=256, )

# Load the best model m.load_weights("best.model") # Keep track of what class corresponds to what index mapping = ( pd.get_dummies(pd.DataFrame(y_train), columns=[0], prefix='', prefix_sep='') .columns.astype(int).values ) y_test_preds = [mapping[pred] for pred in m.predict(X_test).argmax(axis=1)]

Now we can assess the performance on the test set. Below is a confusion matrix showing all predictions held up to reality.

pd.crosstab( pd.Series(y_test, name='Actual'), pd.Series(y_test_preds, name='Predicted'), margins=True )

Predicted 0 1 2 All Actual 0 18 0 0 18 1 0 16 0 16 2 0 0 16 16 All 18 16 16 50

print 'Accuracy: {0:.3f}'.format(accuracy_score(y_test, y_test_preds))

Accuracy: 1.000

Actually a perfect score. We will now do the same with an good old xgboost ( conda install xgboost ) with the nice sklearn api.

Finding the right hyperparameters is a task well suited for an Bayesian approach that can test the alternatives in an effective way without any gradient. GridSearch and such takes alot of time – this way we instead give it a parameter space and a “budget”. It will then in it’s most cost effective way test the hyperparameters of xgboost under those constraints.

from xgboost.sklearn import XGBClassifier params_fixed = { 'objective': 'binary:logistic', 'silent': 1, 'seed': seed, } # The space to search space = { 'max_depth': (1, 5), 'learning_rate': (10**-4, 10**-1), 'n_estimators': (10, 200), 'min_child_weight': (1, 20), 'subsample': (0, 1), 'colsample_bytree': (0.3, 1) } reg = XGBClassifier(**params_fixed) def objective(params): """ Wrap a cross validated inverted `accuracy` as objective func """ reg.set_params(**{k: p for k, p in zip(space.keys(), params)}) return 1-np.mean(cross_val_score( reg, X_train, y_train, cv=5, n_jobs=-1, scoring='accuracy') )

For this we use skopt ( pip install scikit-optimize ). I’ve given it 50 iterations to explore the hyperparameter space. Might be some more performance to squeeze out, but probably not.

from skopt import gp_minimize res_gp = gp_minimize(objective, space.values(), n_calls=50, random_state=seed) best_hyper_params = {k: v for k, v in zip(space.keys(), res_gp.x)} print "Best accuracy score =", 1-res_gp.fun print "Best parameters =", best_hyper_params

Best accuracy score = 0.96 Best parameters = {'colsample_bytree': 1.0, 'learning_rate': 0.10000000000000001, 'min_child_weight': 5, 'n_estimators': 45, 'subsample': 1, 'max_depth': 5}

from skopt.plots import plot_convergence plot_convergence(res_gp)

Now let’s fix these hyperparameters and evaluate on the test set. This is exactly the same test set as Keras got to be clear.

params = best_hyper_params.copy() params.update(params_fixed) clf = XGBClassifier(**params)

clf.fit(X_train, y_train) y_test_preds = clf.predict(X_test)

pd.crosstab( pd.Series(y_test, name='Actual'), pd.Series(y_test_preds, name='Predicted'), margins=True )

Predicted 0 1 2 All Actual 0 18 0 0 18 1 0 15 1 16 2 0 2 14 16 All 18 17 15 50

print 'Accuracy: {0:.3f}'.format(accuracy_score(y_test, y_test_preds))

Accuracy: 0.940

The deep (?) net got all datapoints right while xgboost missed three of them. On the other hand if you change the seed and rerun the code it might as well be xgboost coming up on top so I wouldn’t read to much into it.

Let’s generalize the code above so that we can plug in any dataset of choosing and see if this holds for harder problems. While we’re at it, I’ve added a boostrap on the accuracy statistic to give a feel about the uncertanty. I’ve put the code in this gist since it’s more or less just a repetition of the code above.

Telecom churn dataset (n=2325)

compare_on_dataset( 'https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv?cm_mc_uid=06267660176214972094054&cm_mc_sid_50200000=1497209405&cm_mc_sid_52640000=1497209405', target_variable='churn', lr=0.0005, patience=5 )

ANN

Predicted 0 1 All Actual 0 1478 270 1748 1 218 359 577 All 1696 629 2325

Accuracy: 0.790 Boostrapped accuracy 95 % interval 0.770223752151 0.809810671256

Xgboost

Predicted 0 1 All Actual 0 1563 185 1748 1 265 312 577 All 1828 497 2325

Accuracy: 0.806 Boostrapped accuracy 95 % interval 0.78743545611 - 0.825301204819

Churn is a bit harder problem, but both methods do well though.

Three class wine dataset (n=59)

compare_on_dataset( 'https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv', target_variable='wine', lr=0.001 )

ANN

Predicted 0 1 2 All Actual 0 18 1 0 19 1 0 19 0 19 2 0 0 21 21 All 18 20 21 59

Accuracy: 0.983 Boostrapped accuracy 95 % interval 0.931034482759 1.0

Xgboost

Predicted 0 1 2 All Actual 0 19 0 0 19 1 0 19 0 19 2 0 0 21 21 All 19 19 21 59

Accuracy: 1.000 Boostrapped accuracy 95 % interval 1.0 1.0

A very easy dataset that both methods have no problem with. Since the sample size is so small the boostrap will not be much of use to us here.

German Credit Data (n=1000)

compare_on_dataset( 'https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/german_credit.csv', target_variable='creditability', lr=0.001, patience=5 )

ANN

Predicted 0 1 All Actual 0 43 45 88 1 28 214 242 All 71 259 330

Accuracy: 0.779 Boostrapped accuracy 95 % interval 0.727272727273 0.830303030303

Xgboost

Predicted 0 1 All Actual 0 37 51 88 1 25 217 242 All 62 268 330

Accuracy: 0.770 Boostrapped accuracy 95 % interval 0.715151515152 0.824242424242

So sometimes the ANN comes out on top, and sometime it’s xgboost . I think it’s fair to say that ANNs, controlled for overfitting/overtraining works kinda great even small data. At least pair with xgboost .