Information Gain, like Gini Impurity, is a metric used to train Decision Trees. Specifically, these metrics measure the quality of a split. For example, say we have the following data:

The Dataset

What if we made a split at x = 1.5 x = 1.5 x=1.5?

An Imperfect Split

This imperfect split breaks our dataset into these branches:

Left branch, with 4 blues.

Right branch, with 1 blue and 5 greens.

It’s clear this split isn’t optimal, but how good is it? How can we quantify the quality of a split?

That’s where Information Gain comes in.

Confused? Not sure what Decision Trees are or how they’re trained? Read the beginning of my introduction to Random Forests and Decision Trees.

Information Entropy

Before we get to Information Gain, we have to first talk about Information Entropy. In the context of training Decision Trees, Entropy can be roughly thought of as how much variance the data has. For example:

A dataset of only blues would have very low (in fact, zero) entropy.

would have very (in fact, zero) entropy. A dataset of mixed blues, greens, and reds would have relatively high entropy.

Here’s how we calculate Information Entropy for a dataset with C C C classes:

E = − ∑ i C p i log ⁡ 2 p i E = -\sum_i^C p_i \log_2 p_i E = − i ∑ C ​ p i ​ lo g 2 ​ p i ​

where p i p_i pi​ is the probability of randomly picking an element of class i i i (i.e. the proportion of the dataset made up of class i i i).

The easiest way to understand this is with an example. Consider a dataset with 1 blue, 2 greens, and 3 reds: . Then

E = − ( p b log ⁡ 2 p b + p g log ⁡ 2 p g + p r log ⁡ 2 p r ) E = -(p_b \log_2 p_b + p_g \log_2 p_g + p_r \log_2 p_r) E = − ( p b ​ lo g 2 ​ p b ​ + p g ​ lo g 2 ​ p g ​ + p r ​ lo g 2 ​ p r ​ )

We know p b = 1 6 p_b = \frac{1}{6} pb​=61​ because 1 6 \frac{1}{6} 61​ of the dataset is blue. Similarly, p g = 2 6 p_g = \frac{2}{6} pg​=62​ (greens) and p r = 3 6 p_r = \frac{3}{6} pr​=63​ (reds). Thus,

E = − ( 1 6 log ⁡ 2 ( 1 6 ) + 2 6 log ⁡ 2 ( 2 6 ) + 3 6 log ⁡ 2 ( 3 6 ) ) = 1.46 \begin{aligned} E &= -(\frac{1}{6} \log_2(\frac{1}{6}) + \frac{2}{6} \log_2(\frac{2}{6}) + \frac{3}{6} \log_2(\frac{3}{6})) \\ &= \boxed{1.46} \\ \end{aligned} E ​ = − ( 6 1 ​ lo g 2 ​ ( 6 1 ​ ) + 6 2 ​ lo g 2 ​ ( 6 2 ​ ) + 6 3 ​ lo g 2 ​ ( 6 3 ​ ) ) = 1 . 4 6 ​ ​

What about a dataset of all one color? Consider 3 blues as an example: . The entropy would be

E = − ( 1 log ⁡ 2 1 ) = 0 E = -(1 \log_2 1) = \boxed{0} E = − ( 1 lo g 2 ​ 1 ) = 0 ​

Information Gain

It’s finally time to answer the question we posed earlier: how can we quantify the quality of a split?

Let’s consider this split again:

An Imperfect Split

Before the split, we had 5 blues and 5 greens, so the entropy was

E b e f o r e = − ( 0.5 log ⁡ 2 0.5 + 0.5 log ⁡ 2 0.5 ) = 1 \begin{aligned} E_{before} &= -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) \\ &= \boxed{1} \\ \end{aligned} E b e f o r e ​ ​ = − ( 0 . 5 lo g 2 ​ 0 . 5 + 0 . 5 lo g 2 ​ 0 . 5 ) = 1 ​ ​

After the split, we have two branches.

Left Branch has 4 blues, so E l e f t = 0 E_{left} = \boxed{0} Eleft​=0​ because it’s a dataset of all one color.

Right Branch has 1 blue and 5 greens, so

E r i g h t = − ( 1 6 log ⁡ 2 ( 1 6 ) + 5 6 log ⁡ 2 ( 5 6 ) ) = 0.65 \begin{aligned} E_{right} &= -(\frac{1}{6} \log_2 (\frac{1}{6}) + \frac{5}{6} \log_2 (\frac{5}{6})) \\ &= \boxed{0.65} \\ \end{aligned} E r i g h t ​ ​ = − ( 6 1 ​ lo g 2 ​ ( 6 1 ​ ) + 6 5 ​ lo g 2 ​ ( 6 5 ​ ) ) = 0 . 6 5 ​ ​

Now that we have the entropies for both branches, we can determine the quality of the split by weighting the entropy of each branch by how many elements it has. Since Left Branch has 4 elements and Right Branch has 6, we weight them by 0.4 0.4 0.4 and 0.6 0.6 0.6, respectively:

E s p l i t = 0.4 ∗ 0 + 0.6 ∗ 0.65 = 0.39 \begin{aligned} E_{split} &= 0.4 * 0 + 0.6 * 0.65 \\ &= \boxed{0.39} \\ \end{aligned} E s p l i t ​ ​ = 0 . 4 ∗ 0 + 0 . 6 ∗ 0 . 6 5 = 0 . 3 9 ​ ​

We started with E b e f o r e = 1 E_{before} = 1 Ebefore​=1 entropy before the split and now are down to 0.39 0.39 0.39! Information Gain = how much Entropy we removed, so

Gain = 1 − 0.39 = 0.61 \text{Gain} = 1 - 0.39 = \boxed{0.61} Gain = 1 − 0 . 3 9 = 0 . 6 1 ​

This makes sense: higher Information Gain = more Entropy removed, which is what we want. In the perfect case, each branch would contain only one color after the split, which would be zero entropy!

Recap

Information Entropy can be thought of as how unpredictable a dataset is.

A set of only one class (say, blue ) is extremely predictable: anything in it is blue. This would have low entropy.

) is extremely predictable: anything in it is blue. This would have entropy. A set of many mixed classes is unpredictable: a given element could be any color! This would have high entropy.

The actual formula for calculating Information Entropy is:

E = − ∑ i C p i log ⁡ 2 p i E = -\sum_i^C p_i \log_2 p_i E = − i ∑ C ​ p i ​ lo g 2 ​ p i ​

Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.

Want to learn more? Check out my explanation of Gini Impurity, a similar metric, or my in-depth guide Random Forests for Complete Beginners.