A Brief Explanation of Modelling

Intro to Decision Trees

The basic idea here is simple — when learning the training data (usually termed ‘being fitted’ to the training data), the Regression Tree searches over all of the independent variables, and then over all the values of each one of the independent variables, to find the variable and value which best split the data into two similar groups (in mathematical terms, the tree always chooses the split which minimizes the weighted average variance of the two resulting nodes), and then calculates the score (based on the chosen metric) and the average value of the dependent variable for each of the groups. The tree then repeats this process recursively until there are no more splits to perform (unless max_depth was explicitly specified, like in the picture below). Each of the nodes at the last level of the tree is called a ‘leaf’, and is each associated with the average value of the dependent variable across all of the observations that are in the group that made it to that leaf.

As a side note, this is a great example of a ‘greedy’ algorithm — at every split, it checks all the options and then chooses the one that seems like the best option at that point, with hopes of eventually achieving an overall good result. After the tree was fitted to the training data, any observation for which we wish to predict a value for the dependent variable simply has to traverse the tree until it reaches a leaf — an end node — and then it gets assigned the leaf’s corresponding dependent variable value.

A visualized example from our dataset, with max_depth set to 3 for ease of visualization.

Let’s take a closer look at the tree shown here: in each node, the first element is the node’s split rule (the independent variable and its value), the second element is the Mean Squared Error (MSE) of all the observations in that node, and the third element is the number of observations in that node (‘samples’) — the size of the group. The last element, ‘value’, is the natural logarithm of our target/dependent variable — ‘SalePrice’. As we can see, it seems like the greedy approach of making the locally best split at every node does indeed generally decrease the MSE as the tree expands, and each leaf has an associated ‘SalePrice’ value.

Bias — Variance Tradeoff

So let’s think back to our objective in supervised learning. On the one hand, we would like our model to be able to capture the relationships between the independent variables and the dependent variable as it gets fitted to the training data, so that it can then make accurate predictions. However, the data the model will have to predict the dependent variable for will necessarily be different than the data the model was trained on. In our case, it is the Kaggle test set. Therefore, we would like our model to capture the general relationships between the independent variables and the dependent variable, so that it can generalize to unseen data and predict well. This is sometimes known as the ‘bias — variance tradeoff’.

Bias —Variance Tradeoff

If our model doesn’t learn enough from the training set, then it will have a high bias — often called ‘underfitting’ — meaning it did not capture all the information that is available in the training set, and therefore its predictions would not be as good. However, if our model learns our training data too well, it will capture the specific relationships between the independent variables and the dependent variable in the training set instead of the general ones — it will have a high variance, often called ‘overfitting’ — and therefore it will generalize to unseen data poorly and again its predictions would not be as good. Clearly, we must seek a balance between the model’s bias and variance.

Decision Tree Overfit

Imagine we fit a Regression Tree to our training set. What will the tree look like? As you probably guessed, it will keep splitting until there is only a single observation in every leaf (as there are no more splits to perform as that point). In other words, the tree will build a unique path for each of the observations in the training set, and will give the leaf at the end of the path the dependent value of its associated observation.

If I were to then drop the dependent variable from my training set, and ask my tree to predict the dependent variable value for each of the observations in the training set, what would happen? As you might imagine, it would do so perfectly — achieving basically 100% accuracy and 0 MSE — as it has already learned the dependent variable value associated with each of the observations in the training set.

However, if I were to ask the tree to predict the dependent variable value for unseen observations — ones it was not trained on — it would likely perform poorly, as any unseen observation would end up getting assigned a dependent variable value from a leaf which was constructed for a single specific observation in the training set. This is an example of ‘overfitting’. It is possible to fiddle with the tree’s parameters in order to reduce the overfit — for example, limit the tree’s max_depth — but it turns out there’s a better solution!

The Solution — Random Forest

In ML, we often design meta-models which combine the prediction of several smaller models to generate a better final prediction. This is generally called ‘ensembling’. Specifically, several decision trees are often combined together in an ensemble method called ‘Bootstrap Aggregating’, or ‘Bagging’ for short. The resulting meta-model is called a ‘Random Forest’.

Random Forests are simple but effective. When being fitted to a training set, many decision trees are constructed, just like the one above — only each tree is fitted on a random subset of the data (a ‘bootstrap sample’, meaning drawn with replacement from the entire dataset), and can only consider a random subset of the independent variables (‘features’) at every split. Then, to generate a prediction for a new observation, the Random Forest simply averages the predictions of all its trees and returns that as its prediction.

But wait, Oren! All we’re doing is building many weaker trees and then taking their average — why would this work!?

Well, the short answer is that it works very well, and you should try reading up more on Random Forests if you’re interested in the statistical explanation. I’m not very good at statistics, but I’ll try to give a basic explanation — the bootstrap sampling and the feature subset are meant to make the trees as uncorrelated as possible (although they are all still based on the same data set and feature set), allowing each tree to discover slightly different relationships in the data. This results in their average having much less variance — less overfit — than any single tree, and therefore better generalization and prediction overall.

In simpler terms, for an unseen observation each of the decision trees predict the dependent variable value of the leaf the observation ends up at, meaning the value of the most-similar training set observation in that specific tree-space. As we remember, each tree is constructed differently on different data and therefore each tree will define similarity in a different way and predict a different value, and therefore for a given unseen observation the average of all the trees is basically the average of all the values of a lot of observations in the training set which are somehow similar to it.

One consequence of this property is that while Random Forests are very good at prediction when the test set is somewhat similar to the training set (in the same range of values), which is usually the case, they are terrible at prediction when the test set is different in some fundamental way from the training set (different range of values), like in Time Series problems for example (where the training set is from one time period and the test set is from a different time period).

Since in our case the test set and the training set have the same range of values, we should be good to go!