Thanks to the non-linearity of the activation functions, our model can estimate almost any mapping from inputs to output. That is, if the number of layers and neurons in them are sufficient. The only thing we have to do now is finding the optimal weights and biases for each layer (collectively called parameters).

As in any machine learning problem, we train the model by minimizing some loss function on the training set. We know that the loss function depends on predictions from our model, but using the equations mentioned earlier we can also see that predictions depend (more indirectly) on weights and biases.

Because we can define our loss as a function of the parameters, we can also differentiate it with respect to those variables — and that gives us information about how the loss functions change as we change the parameter values. We can then decrease the loss function iteratively, by making tiny updates to the parameters, proportional to the derivatives of the loss function with respect to those parameters.

However, there is much more to training neural networks, and machine learning in general, than just minimizing the loss function on the training set (although this alone can be a challenge). We would like the model to generalize well to new, unseen data. This is one of the practical differences between neural nets and the previously mentioned XGBoost model — they’re not as good “out-of-the-box”, and typically require more effort to make them generalize well.

Another difference is that we need to standardize the input feature values. This is done using the RobustScalerForPandas (defined in code cell [20]), which is a wrapper that makes sure that the transformed frame is a pandas.DataFrame. This wrapper inherits from the sklearn.preprocessing.RobustScaler , and it’s interesting to check if there are scaling strategies leading to better results.

The MLP model is defined in code cell [21]; its architecture is specified by parameters passed at initialization, but the activation functions and the optimization procedure are hard-coded. Also, note that there’s a considerable class imbalance in our data set, so maybe it might be worth playing with the class_weight argument passed to the model.fit method?

But even without much tweaking of the hyperparameters, the MPL model performs pretty well — we get AUC scores of about 0.90, which is pretty similar to what we got using XGBoost.

3. Simple siamese neural network