In my opinion, most Machine Learning tutorials aren’t beginner-friendly enough.

Last month, I wrote an introduction to Neural Networks for complete beginners. This post will adopt the same strategy, meaning it again assumes ZERO prior knowledge of machine learning. We’ll learn what Random Forests are and how they work from the ground up.

Ready? Let’s dive in.

1. Decision Trees 🌲

A Random Forest 🌲🌲🌲 is actually just a bunch of Decision Trees 🌲 bundled together (ohhhhh that’s why it’s called a forest). We need to talk about trees before we can get into forests.

Look at the following dataset:

The Dataset

If I told you that there was a new point with an x x x coordinate of 1 1 1, what color do you think it’d be?

Blue, right?

You just evaluated a decision tree in your head:

That’s a simple decision tree with one decision node that tests x < 2 x < 2 x<2. If the test passes ( x < 2 x < 2 x<2), we take the left branch and pick Blue. If the test fails ( x ≥ 2 x \geq 2 x≥2), we take the right branch and pick Green.

The Dataset, split at x=2

Decision Trees are often used to answer that kind of question: given a labelled dataset, how should we classify new samples?

Labelled: Our dataset is labelled because each point has a class (color): blue or green.

Classify: To classify a new datapoint is to assign a class (color) to it.

Here’s a dataset that has 3 classes now instead of 2:

The Dataset v2

Our old decision tree doesn’t work so well anymore. Given a new point ( x , y ) (x, y) (x,y),

If x ≥ 2 x \geq 2 x ≥ 2 , we can still confidently classify it as green.

, we can still confidently classify it as green. If x < 2 x < 2 x < 2 , we can’t immediately classify it as blue - it could be red, too.

We need to add another decision node to our decision tree:

Pretty simple, right? That’s the basic idea behind decision trees.

2. Training a Decision Tree

Let’s start training a decision tree! We’ll use the 3 class dataset again:

The Dataset v2

2.1 Training a Decision Tree: The Root Node

Our first task is to determine the root decision node in our tree. Which feature ( x x x or y y y) will it test on, and what will the test threshold be? For example, the root node in our tree from earlier used the x x x feature with a test threshold of 2 2 2:

Intuitively, we want a decision node that makes a “good” split, where “good” can be loosely defined as separating different classes as much as possible. The root node above makes a “good” split: all the greens are on the right, and no greens are on the left.

Thus, our goal is now to pick a root node that gives us the “best” split possible. But how do we quantify how good a split is? It’s complicated. I wrote an entire blog post about one way to do this using a metric called Gini Impurity. ← I recommend reading it right now before you continue - we’ll be using those concepts later in this post.

Welcome back!

Hopefully, you just read my Gini Impurity post. If you didn’t, here’s a very short TL;DR: We can use Gini Impurity to calculate a value called Gini Gain for any split. A better split has higher Gini Gain.

Back to the problem of determining our root decision node. Now that we have a way to evaluate splits, all we have to do to is find the best split possible! For the sake of simplicity, we’re just going to try every possible split and use the best one (the one with the highest Gini Gain). This is not the fastest way to find the best split, but it is the easiest to understand.

Trying every split means trying

Every feature ( x x x or y y y ).

or ). All “unique” thresholds. We only need to try thresholds that produce different splits.

For example, here are the thresholds we might select if we wanted to use the x x x coordinate:

x Thresholds

Let’s do an example Gini Gain calculation for the x = 0.4 x = 0.4 x=0.4 split.

Split Left Branch Right Branch x = 0.4 x = 0.4 x = 0 . 4

First, we calculate the Gini Impurity of the whole dataset:

G i n i t i a l = ∑ i = 1 3 p ( i ) ∗ ( 1 − p ( i ) ) = 3 ∗ ( 1 3 ∗ 2 3 ) = 2 3 \begin{aligned} G_{initial} &= \sum_{i=1}^3 p(i) * (1 - p(i)) \\ &= 3 * (\frac{1}{3} * \frac{2}{3}) \\ &= \boxed{\frac{2}{3}} \\ \end{aligned} G i n i t i a l ​ ​ = i = 1 ∑ 3 ​ p ( i ) ∗ ( 1 − p ( i ) ) = 3 ∗ ( 3 1 ​ ∗ 3 2 ​ ) = 3 2 ​ ​ ​

Then, we calculate the Gini Impurities of the two branches:

G l e f t = 0 ∗ 1 + 1 ∗ 0 + 0 ∗ 1 = 0 G_{left} = 0 * 1 + 1 * 0 + 0 * 1 = \boxed{0} G l e f t ​ = 0 ∗ 1 + 1 ∗ 0 + 0 ∗ 1 = 0 ​

G r i g h t = 3 8 ∗ 5 8 + 2 8 ∗ 6 8 + 3 8 ∗ 5 8 = 21 32 \begin{aligned} G_{right} &= \frac{3}{8} * \frac{5}{8} + \frac{2}{8} * \frac{6}{8} + \frac{3}{8} * \frac{5}{8} \\ &= \boxed{\frac{21}{32}} \\ \end{aligned} G r i g h t ​ ​ = 8 3 ​ ∗ 8 5 ​ + 8 2 ​ ∗ 8 6 ​ + 8 3 ​ ∗ 8 5 ​ = 3 2 2 1 ​ ​ ​

Finally, we calculate Gini Gain by subtracting the weighted branch impurities from the original impurity:

Gain = G i n i t i a l − 1 9 G l e f t − 8 9 G r i g h t = 2 3 − 1 9 ∗ 0 − 8 9 ∗ 21 32 = 0.083 \begin{aligned} \text{Gain} &= G_{initial} - \frac{1}{9} G_{left} - \frac{8}{9} G_{right} \\ &= \frac{2}{3} - \frac{1}{9} * 0 - \frac{8}{9} * \frac{21}{32} \\ &= \boxed{0.083} \\ \end{aligned} Gain ​ = G i n i t i a l ​ − 9 1 ​ G l e f t ​ − 9 8 ​ G r i g h t ​ = 3 2 ​ − 9 1 ​ ∗ 0 − 9 8 ​ ∗ 3 2 2 1 ​ = 0 . 0 8 3 ​ ​

Confused about what just happened? I told you you should’ve read my Gini Impurity post. It’ll explain all of this Gini stuff.

We can calculate Gini Gain for every possible split in the same way:

Split Left Branch Right Branch Gini Gain x = 0.4 x = 0.4 x = 0 . 4 0.083 0.083 0 . 0 8 3 x = 0.8 x = 0.8 x = 0 . 8 0.048 0.048 0 . 0 4 8 x = 1.1 x = 1.1 x = 1 . 1 0.133 0.133 0 . 1 3 3 x = 1.3 x = 1.3 x = 1 . 3 0.233 0.233 0 . 2 3 3 x = 2 x = 2 x = 2 0.333 0.333 0 . 3 3 3 x = 2.4 x = 2.4 x = 2 . 4 0.191 0.191 0 . 1 9 1 x = 2.8 x = 2.8 x = 2 . 8 0.083 0.083 0 . 0 8 3 y = 0.8 y = 0.8 y = 0 . 8 0.083 0.083 0 . 0 8 3 y = 1.2 y = 1.2 y = 1 . 2 0.111 0.111 0 . 1 1 1 y = 1.8 y = 1.8 y = 1 . 8 0.233 0.233 0 . 2 3 3 y = 2.1 y = 2.1 y = 2 . 1 0.233 0.233 0 . 2 3 3 y = 2.4 y = 2.4 y = 2 . 4 0.111 0.111 0 . 1 1 1 y = 2.7 y = 2.7 y = 2 . 7 0.048 0.048 0 . 0 4 8 y = 2.9 y = 2.9 y = 2 . 9 0.083 0.083 0 . 0 8 3

All Thresholds

After trying all thresholds for both x x x and y y y, we’ve found that the x = 2 x = 2 x=2 split has the highest Gini Gain, so we’ll make our root decision node use the x x x feature with a threshold of 2 2 2. Here’s what we’ve got so far:

Making progress!

2.2: Training a Decision Tree: The Second Node

Time to make our second decision node. Let’s (arbitrarily) go to the left branch. We’re now only using the datapoints that would take the left branch (i.e. the datapoints satisfying x < 2 x < 2 x<2), specifically the 3 blues and 3 reds.

To build our second decision node, we just do the same thing! We try every possible split for the 6 datapoints we have and realize that y = 2 y = 2 y=2 is the best split. We make that into a decision node and now have this:

Our decision tree is almost done…

2.3 Training a Decision Tree: When to Stop?

Let’s keep it going and try to make a third decision node. We’ll use the right branch from the root node this time. The only datapoints in that branch are the 3 greens.

Again, we try all the possible splits, but they all

Are equally good.

Have a Gini Gain of 0 (the Gini Impurity was already 0 and can’t go any lower).

It doesn’t makes sense to add a decision node here because doing so wouldn’t improve our decision tree. Thus, we’ll make this node a leaf node and slap the Green label on it. This means that we’ll classify any datapoint that reaches this node as Green.

If we continue to the 2 remaining nodes, the same thing will happen: we’ll make the bottom left node our Blue leaf node, and we’ll make the bottom right node our Red leaf node. That brings us to the final result:

Once all possible branches in our decision tree end in leaf nodes, we’re done. We’ve trained a decision tree!

3. Random Forests 🌲🌳🌲🌳🌲

We’re finally ready to talk about Random Forests. Remember what I said earlier?

A Random Forest is actually just a bunch of Decision Trees bundled together.

That’s true, but is a bit of a simplification.

3.1 Bagging

Consider the following algorithm to train a bundle of decision trees given a dataset of n n n points:

Sample, with replacement, n n n training examples from the dataset. Train a decision tree on the n n n samples. Repeat t t t times, for some t t t .

To make a prediction using this model with t t t trees, we aggregate the predictions from the individual decision trees and either

Take the majority vote if our trees produce class labels (like colors).

if our trees produce class labels (like colors). Take the average if our trees produce numerical values (e.g. when predicting temperature, price, etc).

This technique is called bagging, or bootstrap aggregating. The sampling with replacement we did is known as a bootstrap sample.

Bagged Decision Trees predicting color

Bagged decision trees are very close to Random Forests - they’re just missing one thing…

3.2 Bagging → Random Forest

Bagged decision trees have only one parameter: t t t, the number of trees.

Random Forests have a second parameter that controls how many features to try when finding the best split. Our simple dataset for this tutorial only had 2 2 2 features ( x x x and y y y), but most datasets will have far more (hundreds or thousands).

Suppose we had a dataset with p p p features. Instead of trying all features every time we make a new decision node, we only try a subset of the features, usually of size p \sqrt{p} p ​ or p 3 \frac{p}{3} 3p​. We do this primarily to inject randomness that makes individual trees more unique and reduces correlation between trees, which improves the forest’s performance overall. This technique is sometimes referred to as feature bagging.

4. Now What?

That’s a beginner’s introduction to Random Forests! A quick recap of what we did:

Introduced decision trees , the building blocks of Random Forests.

, the building blocks of Random Forests. Learned how to train decision trees by iteratively making the best split possible.

Defined Gini Impurity, a metric used to quantify how “good” a split is.

Saw that a random forest = a bunch of decision trees.

Understood how bagging combines predictions from multiple trees.

combines predictions from multiple trees. Learned that feature bagging is the difference between bagged decision trees and a random forest.

A few things you could do from here:

Read about Information Gain, a metric similar to Gini Impurity that can also be used to quantify how “good” a split is.

Experiment with scikit-learn’s DecisionTreeClassifier and RandomForestClassifier classes on real datasets.

Try writing a simple Decision Tree or Random Forest implementation from scratch. I’m happy to give guidance or code review! Just tweet at me or email me.

Read about Gradient Boosted Decision Trees and play with XGBoost, a powerful gradient boosting library.

Read about ExtraTrees, an extension of Random Forests, or play with scikit-learn’s ExtraTreesClassifier class.

That concludes this tutorial. I like writing about Machine Learning (but also other topics), so subscribe if you want to get notified about new posts.

Thanks for reading!