This visualization uses TensorFlow.js to train a neural network on the titanic dataset and visualize how the predictions of the neural network evolve after every training epoch. The colors of each row indicate the predicted survival probability for each passenger. Red indicates a prediction that a passenger died. Green indicates a prediction that a passenger survived. The intensity of the color indicates the magnitude of the prediction probability. As an example, a bright green passenger represents a strong predicted probability for survival. We also plot the loss of our objective function on the left of the table with D3.js . The code for this visualization is hosted on github.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. ( Description Source )

1. What features are used in training the model?

I use everything but the ticket and cabin to predict the survival column. The omitted features can be used for better predictions, but I did not use them to reduce the feature engineering I had to do to get this up and running. You can see how I preprocessed the data here.

2. Where is the test/validation set?

I don't use one here. I train on the entire dataset. This visualization is about understanding how predictions change when training a neural network, not about how the network generalizes. Although, from my experiments, if you are observing an accuracy <= 0.78 you are most likely not overfitting. Any accuracy higher than that is most likely the result of overfitting.

3. What neural architecture do you use?

I use a single hidden layer neural network with a sigmoid output layer. I also use the lecun normal kernel initializer for more consistent results. You can see more information on the model and training code here.

4. What optimizer do you use?

I use adam with the default Keras parameters.

5. Why do the predictions become horrible after I sort the data?

How the data is sorted and displayed is the same order it is batched for training. If your batch size is less than the entirety of dataset, you will be using mini-batch gradients. Due to the small scale of this dataset and imbalance in classes, this network is prone to being lead to a point in gradient space from which it cannot escape if the batches are too small and the consecutive gradients calculated are from the same class. For example, try sorting by male and changing your batch size to 20. The network will train on batches of 20 males, who mostly died, until it gets to the females. By that point, the representation is so biased towards dead males that it takes a lot of gradient energy to get to a location that also captures female survival.

6. Where can I find a more thorough analysis of using neural networks on this dataset?

Check out my python notebook on Kaggle.

7. What's with predicting something that has already happened?

This is often a confusing topic for people new to machine learning. The idea is that the survival of a passenger is a function of the other observed variables. "Given the observed variables, what features of a passenger lead to survival or death?" With our neural network, we aim to learn this function. In real predictive analytics problems, we aim to learn a function $$f(\text{observed variables}) = \text{target variables}$$ that can generalize to new data not seen in the training set. This requires many considerations which are largely related to the bias-variance tradeoff.

Any more questions?

If you have any questions, submit an issue on this repo and I'll get back to you.