A Beginner's guide to Regression Trees using Sklearn | Decision Trees #python #machinelearning #datascience #sklearn



April 3, 2020 10 mins read



Classification Error rate Entropy Gini Index

Classification Error Rate for Classification Trees

$E = 1 - max(p(k))$

$Where\ p(k)\ is\ the\ proportion\ of\ training\ observations\ in\ the\ mth\ region\ that\ are\ from\ the\ kth\ class$

Entropy

Gini index

Entropy in Classification tree

zero

$H(s) =\displaystyle \sum_{x \epsilon X} p(x) log_2 \frac{1}{p(x)}$

$where\ p(x)\ is\ the\ proportion\ of\ occurring\ of\ some\ event$

1/2

Information Gain in classification trees

S

A

$IG(S, A) = H(S) - H(S, A)$

$IG(S, A) = H(S) - \displaystyle \sum_{i=0}^{n} P(x) * H(x)$

$where\ H(S)\ is\ the\ Entropy\ of\ entire\ Set$

$and\ \sum_{i=0}^{n} p(x) * H(x)\ is\ the\ Entropy\ after\ applying\ feature\ x\ where\ P(x)\ is\ the\ proportion\ of\ event\ x$

Information gain is the value of entropy that we removed after adding a node to the tree.

Gini Index in Classification Trees

$G = \displaystyle \sum_{k=1}^{K} P(k)(1 - P(k))$

$Where\ P(k)\ is\ the\ proportion\ of\ training\ instances\ with\ class\ k$

Minimum value that the Gini index can have is 0.

$(1 * (1 - 1)) = 0$

Gini Gain in Classification Trees

| Outlook | Temperature | Humidity | Wind | Played | |----------|-------------|----------|--------|--------| | Sunny | Hot | High | Weak | No | | Sunny | Hot | High | Strong | No | | Overcast | Hot | High | Weak | Yes | | Rain | Mild | High | Weak | Yes | | Rain | Cold | Normal | Weak | Yes | | Rain | Cold | Normal | Strong | No | | Overcast | Cold | Normal | Strong | Yes | | Sunny | Mild | High | Weak | No | | Sunny | Cold | Normal | Weak | Yes | | Rain | Mild | Normal | Weak | Yes | | Sunny | Mild | Normal | Strong | Yes | | Overcast | Mild | High | Strong | Yes | | Overcast | Hot | Normal | Weak | Yes | | Rain | Mild | High | Strong | No |

Using Entropy and Information Gain to create Decision tree nodes

Calculate the overall entropy

Total yes cases = 9 Total No cases = 5 Total Cases = 14

$\frac{9}{14}\log _{2} \frac{14}{9} + \frac{5}{14}\log _{2} \frac{14}{5}$

$= 0.940$

More the value of entropy inclined toward 1, more is the randomness in the data.

wind

$IG(S, Wind) = H(S) - \sum _{i=0}^{n} P(x) * H(x)$

Total Weak wind cases = 8 Total Strong wind cases = 6 Total Cases = 14

Entropy for Weak wind

$H(S_{weak}) = \frac{6}{8} \log_{2}\frac{8}{6} + \frac{2}{8} \log_{2}\frac{8}{2}$

$= 0.811$

Entropy for Strong wind

$H(S_{strong}) = \frac{3}{6} \log_{2}\frac{6}{3} + \frac{3}{6} \log_{2}\frac{6}{3}$

$= 1.00$

$IG(S, Wind) = H(S) - P(S_{weak}) * H(S_{weak}) - P(S_{strong}) * H(S_{strong})$

$= 0.940 - \frac {8}{14} (0.811) - \frac{6}{14}(1.00)$

$= 0.048$

IG

IG

Using Gini Index and Gini Gain to create Decision tree nodes

Calculate the overall Gini impurity

Total yes cases = 9 Total No cases = 5 Total Cases = 14

$GI(S) = \frac{9}{14}(1 - \frac{9}{14}) + \frac{5}{14}(1 - \frac{5}{14})$

$= 0.46$

wind

Total Weak wind cases = 8 Total Strong wind cases = 6 Total Cases = 14

Gini Index for Weak wind

$GI(S_{weak}) = \frac{6}{8}(1 - \frac{6}{8}) + \frac{2}{8}(1 - \frac{2}{8})$

$= \frac{3}{8}$

Gini Index for Strong wind

$GI(S_{Strong}) = \frac{3}{6}(1 - \frac{3}{6}) + \frac{3}{6}(1 - \frac{3}{6})$

$= \frac{1}{2}$

$GG(S_{wind}) = GI(S) - \frac{8}{14} * GI(S_{weak}) - \frac{6}{14} * GI(S_{strong})$

$GG(S_{wind}) = 0.46 - 0.214 - 0.214$

$= 0.0314$

GG

GG

Classification Trees using Sklearn

Import everything

import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer from sklearn.metrics import classification_report , confusion_matrix from sklearn.tree import DecisionTreeClassifier % matplotlib inline

Load data and split it into training and testing data

X , y = load_breast_cancer ( return_X_y = True ) X_train , X_test , y_train , y_test = train_test_split ( X , y , random_state = 0 )

Initialize the model with default criterion (Gini Index)

clf = DecisionTreeClassifier () clf . fit ( X_train , y_train )

Predict the model and check the Reports

predictions = clf . predict ( X_test ) print ( confusion_matrix ( y_test , predictions )) print ( classification_report ( y_test , predictions ))

Using Entropy as the criterion for predicting classification Model

clf_entropy = DecisionTreeClassifier ( criterion = 'entropy' ) clf_entropy . fit ( X_train , y_train ) predictions_entropy = clf_entropy . predict ( X_test ) print ( confusion_matrix ( y_test , predictions_entropy )) print ( classification_report ( y_test , predictions_entropy ))

Classification Trees are the trees in which we classify the values as the output of the model. For example, predicting that a customer is worthy of giving out the loan or not using the data of his salary and other credit history.As we already know from our previous discussion on Regression Trees, that tree algorithms are Greedy in nature which means they tend to choose the better node now, rather than choosing a node that will create a better tree later. Also, in contrast to the regression tree model where the predicted response of any given node is the mean of all the observations in the region, in classification trees predicted response is the most commonly occurring observation in the region to which it belongs. Read the following post for more details.In that post, we also discussed RSS( Residual Sum of Squares) which is the value that we want to maximize while building the regression tree. Of course, we can’t use RSS in classification trees. We have similar terms in Classification trees that we can use. We can choose either one of these. These terms are used to quantify the split that we make for choosing the feature as the node of the tree.Classification Error rate is simply the fraction of the training observations in that region that do not belong to the most common class. Mathematically,Classification error rate is not used generally because it is not sensitive for tree-growing, therefore,oris used instead.It’s the measure of amount of uncertainty in the data(Randomness). Higher the uncertainty, higher is the entropy. The value of entropy iswhen there is no uncertainty in some event. For example, if we are tossing a coin having heads on both sides. Mathematically, entropy is given byFor example, for a simple coin toss, the probability isThis is the value gained for a given setwhen some featureis selected as a node of the tree. While selecting any node for the tree generation we want to maximize the Information Gain at that given point. Information gain is given as the change in the Entropy before and after selecting any given feature as the node of the tree. Mathematically, it is given as,We will discuss it further while creating the model using Information Gain.This is the default metric that the Sklearn Decision Tree classifier tends to increase. It is used to quantify the split made in the tree at any given moment of node selection. Mathematically, gini index is given by,For example, A coin having heads on both sides will give Gini Index as 0.Gini index also tells about the purity of node selection. If a node selected is very pure the value of Gini index will be less.As we have information gain in the case of entropy, we have Gini Gain in case of the Gini index. It is the amount of Gini index we gained when a node is chosen for the decision tree. We will take an example to understand these terms in little more detail. Let’s consider the following data source,The following data contains whether we played golf when the following weather conditions were present. Let’s try to use Entropy and Gini index and try to create the first node for the Decision tree using them.As we have already evaluated the value of total entropy, let’s calculate the information gain while choosing each and every feature separately. Let’s start with thefeature.Total Information Gain can be calculated as follows,Similarly, we will calculatefor other features as well and select the one which produces the highest value of. We will continue this process until leaf nodes are reached for every branch created.As we have already evaluated the value of total Gini impurity, let’s calculate the Gini gain while choosing each and every feature separately. Let’s start with thefeature.Total Gini Gain can be calculated as follows,Similarly, we will calculatefor other features as well and select the one which produces the highest value of. This is the basic understanding of Classification Trees.We will be using sklearn to train a model of breast cancer data.As you can see Entropy performed better in this case. But it really depends upon you which you want to choose. Thanks for reading and do subscribe for more such posts.