Imagine you are out trying to buy tomatoes. You go to a store, and in front of you there are a bunch of tomatoes. Tomatoes in a variety of shapes and sizes and colors, ranging from small and round, to big and oblong, from cherry red to unripe green. How do you decide which tomato is the tastiest? Well you know from experience what the best tomatoes look, feel and smell like. You know what qualities are the most important when judging a tomato for it’s tastiness, and based on those qualities you make a decision. Your decision process might look like this:

Decision process for choosing a good tasty tomato

This process is similar to the way Decision Trees work when they are trying to solve classification problems.

In this article, we will go through and discuss the following points:

What is a Decision Tree How it works on an intuitive and mathematical level How to build a Decision Tree Implement a Decision Tree on a simple example Visualize a trained Decision Tree

What are Decision Trees?

Lets answer this question through the perspective of Computer Science. A decision tree is an algorithm that works as a simple classifier. It evaluates the attributes of any object, and based on those attributes and the over all data, it classifies those objects into certain categories. In the case of selecting tomatoes, the object was a tomato, the attributes were it’s size and ripeness, the data was the previous experience you had with tomatoes, and the categories were “tasty” and “not tasty”

What can they do and why are the useful?

As mentioned earlier, Decision Trees can work as classifies. They can be used to solve binary classification problems, where the problem is to classify object into just 2 categories.

Decision trees are a useful ML algorithm, because unlike other solutions, like Neural Networks, they are not just a black box. What this means is that decision can trees not only solve a problem, but show you the reasoning behind the decision process. Neural Networks act as a black box, we can control the inputs to it and modify it to get the desired outputs, but we do not know how the Neural Network is solving that problem.

A Basic and Simplified Example

Let us first outline a basic example that you might want to solve with a DT (Decision Tree). Suppose you want to classify the gender of actors based on the number of times they were the protagonists in the movie. This can work if we assume that there are more movie with male protagonists than females. I.E, there is a high probability that an actor with few protagonist roles is a female, and vice versa. In this example our “Target Variable” is the gender of the actor, that is what we want to guess. They are also sometimes called “Labels“.

Lets randomly generate our mock dataset and sort it by “Number of Movies as Protagonist”:

Number of Movies as Protagonist Gender 4 F 5 M 7 F 8 F 10 F 12 M 13 F 15 M 16 M 18 M 20 M

Let’s visualize our data on a number line:

Next, let’s clarify the concept of “trees”. In computer science, a tree is a structure which contains a set hierarchy. With the top most level of the tree being the root node, and the nodes at the bottom being leaves. DT work by assigning a rule at each level of hierarchy, and sorting the data based on those rules. You can think of it reducing the possible number of answers by eliminating some wrong answers at each level. In the below image, data is being split into two sets at every node, and it keeps getting split until a specified depth. The goal is to simplify the data at each level, till the data is classified with a high degree of confidence.

A simple representation of a Decision Tree

Let’s see how a DT would handle our mock dataset:

X: Number of Movies as Protagonist

As you can see above, the DT splits the dataset based on the rule of X < 14. It has made this split according to the data itself, we will see how this is performed later on. If X < 14 is false, the actors are more likely to be male (in this dataset, if X > = 14, then all actors are male). If X < 14 is true, then the actors are more likely to be female. If we feed in new data about an actor to this DT, the DT will predict if that actor is a male or female, based on the above split. The height of our mock DT is 1.



This is an extremely simplified version of a DT. A more formalized example will be detailed later on.

The Math Behind Decision Trees

I’ll try and explain the intuition behind Decision Trees, as well as explain the math that makes it all tick. If you are not interested in the mathematical side of it, you can skip. The paragraphs discussing math will be highlighted in the same color this paragraph is highlighted in.

As mentioned before, DT work by increasing the possibility of correct classifications by splitting the dataset based on some rules. We want to increase this probability by decreasing the Entropy of the system.

Entropy can be defined as the degree of chaos in the system. Think of it this way, if you randomly chose an actor from our unordered mock dataset, and then guessed the gender of that actor, the probability that you guessed the correct answer is low, because the entropy of the system within which you are operating in is high. Hence, we want to decrease the entropy of the system.

Entropy is formally defined as:

The quantitative measure of disorder or randomness in a system.

By decreasing entropy we are increasing the chance of a correct classification, and this can be thought as gaining knowledge, and this metric is known as Information Gain (IG).

Entropy is defined mathematically by the formula:

S: Entropy Value

N: The total number of states in the system

Pi: The probability that the system is in the ith state

Let’s look at our mock data to understand the parameters detailed above. In our mock data, a single data point’s target variable can be either male or female. Hence, the data can be in 1 of 2 states

N = 2

The probability that a data point is in the male state is:

P1 = 6/11

The probability that a data point is in the female state is:

P1 = 5/11

Insert values into our formula:

S0 = – (P1 *Log2(P1) + P2*Log2(P2)) = 0.994

This is the entropy value of our system pre-split. Lets call is S0. This value in itself does not tell us much about the system. Lets compare this value, to the entropy values of the data in our 2 split branches (see diagram above).

Split 1 (X<14) : P1 = 2/7, P2 = 5/7, S1 = 0.863

Split 2 (X>=14) : P1 = 4/4, P2 = 0/0, S2 = 0

As you can see, in both splits we managed to reduce entropy. In split 2 we reduced entropy to 0, which means we have perfect knowledge of that state of that split. We can make a prediction of the data in that split with an accuracy of 100%. This is the ideal state that we want from out DT (In practice reducing entropy to 0 across the whole DT can lead to overfitting).

We still need a way to quantify our progress in the reduction of entropy across out DT. This is where the Information Gained (IG) metric comes in:





q: number of groups after split

S0: Entropy value of DT before split

Si: The entropy value of the ith split

Ni: The number of datapoints in the ith split

N: The total number of datapoints in the DT before the split

For our example:

q=2

S0 = 0.994

S1 = 0.863

S2 = 0

N = 11

N1 = 7

N2 = 4

Plugging the values into the equation for IG:

IG = 0.394.

In a perfect world, S1 = S2 = 0. This would mean we have perfect knowledge of our data across our splits.

If S1 =S2 = 0, then IG = S0. We want the value of IG to trend to S0.

To make a DT work, we need to Maximize IG after every split, and Minimize S of every split. The way to do this will not be discussed for now.

How to Build a Decision Tree

We know the mathematics behind DT, now we just have to implement it in code. The pseudocode for this is:

def build_DT(S):

create node t

if the stop value has been reached:

assign a predictive model to t

else:

Find the best split L(based on the value of S and IG)

t.left = build_DT(L.left)

t.right = build(L.right)

return t

The stop value mentioned above can be anything from reaching a desired total entropy value in the DT, to achieving a max height. Usually this value is based on the max height a DT is allowed to build to, and it is put in place to prevent overfitting.

Implementing a Decision Tree

Finally we can now implement a DT. Instead of building out the functions to implement a DT, we will use the sklearn library. The full code for this implementation can be found here in my repository.

We will create a mock dataset again. This time we donate male as the value 1, and female 0 in our “Gender” column, and our features column will be “P_Movies” (Number of movies as protagonist). This time we will increase the range of our feature column and randomize the dataset a bit more:

data = pd.DataFrame({‘P_Movies’: [17,64,18,20,38,49,55,25,29,31,33],

‘Gender’: [1,0,1,0,1,0,0,1,1,0,1]})

data =data.sort_values(‘P_Movies’)





Gender P_Movies 1 17 1 18 0 20 1 25 1 29 0 31 1 33 1 38 0 49 0 55 0 64

Next, lets define our DT and our input vectors.

#define Decision Tree

dt = DecisionTreeClassifier(criterion = ‘entropy’)

#Define input vectors

#X is the features in this dataset

X = data[‘P_Movies’].values.reshape(-1, 1)

#Y is the vector with our Target Variables

Y = data[‘Gender’].values

#start fitting process

dt.fit(X, Y)



Once the data is fitted we can make predictions based on new data. However, the data that we entered into the DT is highly randomized and the dataset itself is very small, so any predictions made by the DT will seem random as well. We also did not set a height limit for our DT, which means the DT overfit, making it even worse.

d = np.array([7, 15, 43, 45])

d=d.reshape(-1, 1)



dt.predict(d)

Output: array([1, 1, 1, 0])



In the above code, we define our input array and enter the information regarding to “P_Movies” about 4 actors. The DT takes in the array and makes prediction about their gender. 1 corresponds to male, 0 corresponds to female. Again, as our data is highly randomized, these predictions will seem random too.

Visualizing the Decision Tree

We can use helper functions to visualize the DT we developed above.

The tree diagram above visualizes the DT we just trained. We shall read the tree to see how it classifies the given data.

Each node above that is not a leaf node signifies data before a split has occurred. In each node the variables are:

P_Movies: The value at which the data at that node will be split

Entropy: The entropy value of the system at that node

Samples: The number of samples in that node before the split occurs

Values: How many instances of each class exists in the data at that node. For example if Vale = [5, 6], it means that at that specific node, there are 5 data points that have classification 0 (female) and 6 that have classification 1 (male).

Any datapoint that is in a brown leaf node is classified as female (0), and datapoint in a dark blue leaf node is classified as male (1). Now we can read the table and based on the splits figure out how a new data point would be classified as. If it lands in a brown leaf node the datapoint is classified as a female, and if it lands in a dark blue leaf node, the datapoint is classified as a male.

You will also notice that the entropy value of all leaf nodes is 0. Our tree only terminated when entropy reached 0 because we did not specify any height limit. Usually a height limit is set in place. If entropy across all leaf nodes = 0, it means our DT simply memorized the data rather than sorting the data based on some underlying trend. We want to set a height limit so that entropy remains small but not reach 0 across all leaf nodes.

Conclusion

DT are best used when you need to classify data based on discrete values and classify them into discrete categories (such as a binary classification of male and female). They offer the value of being transparent in their decision making.. But they cannot be used for more complex problems, where the data is continuous. There are other performance related reason to choose or not choose a DT as well, but for now it is important to keep the above facts in mind.

Some of the helper code was provided my mlcourse.ai , I highly recommend that you check them out if you want a more hands on experience on machine learning.

As always, if you have any questions, suggestions, requests, corrections, anything, please comment below. Any and all feedback is highly appreciated. Thank you for reading and I hope you found this article helpful.