There are two types of nodes in above figure. One with gray background and one with white background. Nodes with gray background color are terminal nodes. Each edge represents a decision, for i.e, weather is sunny or not?. In above figure, we have four possible classes and each terminal node represents one of them. In above example, features are - work(yes/no), weather(i.e sunny), friends busy?(yes/no). Once we have a decision tree(we haven't seen how to construct them yet) we can predict the outcome. For example, if our input is work(no),weather(rainy),freinds busy?(no) then we will opt for Go to movies.



But, how do we construct decision tree? We start with root node and split dataset on a feature that results in largest information gain. We repeat this process of splitting on child node unitl we get nodes which are pure means they contain samples of a once class. We need a way to compute the impurity at each node. Before going into that let us first define information gain. Here, our objective is to maximize the information gain at each split. We define information gain as follows :



\begin{align} IG(D_p,f) = I(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} I(D_j) \end{align} Here, f is the feature to perform the split, D p and D j are the dataset of the parent and jth child node, I is the impurity measure (i.e entropy, gini index), N p is the total number of smples at the parent node and N j is the number of samples in the jth child node. Information gain is the difference between the impurity of parent node and the sum of the child node impurities. However, for simplicity and to reduce the combinatorial search space, most machine learning libraries implement binary decision tree, that is, each parent node has two child nodes. Say two child nodes are, D left and D right then information gain will be:



\begin{align} IG(D_p, f) = I(D_p) - \frac{N_{left}}{N_p} I(D_{left}) - \frac{N_{right}}{N_p} I(D_{right}) \end{align}



Commonly used impurity measures or splitting criteria are Gini Impurity , Entropy and Classification Error. Here, we'll only discuss about entropy. For all non-empty classes (p(i | t) not equals to zero), entropy can be defined as





\begin{align} I_H(t) = -\sum_{i=1}^{C} p(i | t) log_2 p(i | t) \end{align}



Where p(i | t) is the part of the samples that belong to the class i for a particular node t and C is the number of classes. The entropy is 0 if all samples at a node belong to the same class and the entropy is maximum if the classes are distributed uniformly. For example, if we have 100 samples and 2 classes then entropy will be maximum if 50 samples belong to one class and remaining belong to second class.