Greedy nature of decision trees

Striker salary dataset

Experience

Goals last season

Salary

Experience

Experience

Equation of Regression Tree

$f(x) =\displaystyle \sum_{m=1}^{M} C_m.1(X \epsilon R_m)$

$where\ R_1, R_2...R_m\ are\ the\ different\ regions$

Predictions in Regression Trees

Prediction using Stratification of feature space

Prediction using Tree Pruning

Prediction using stratification of feature space

Residual Sum of Squares(RSS)

$For\ every\ y_i\ in\ the\ given\ region$

$RSS = \displaystyle \sum_{j=1}^{J} \sum_{i \epsilon R_j} (y_i - \widehat{y}_{R_j})^ {2}$

$and\ \widehat{y}_{R_j}\ is\ mean\ value\ of\ training\ observations\ in\ Jth\ box.$

hitter's dataset

Value of the branch feature (node). The cutoff value at which the reduction in RSS is minimum.

Age

Age

R1

hitters

R2

number of hits

117+

R2

R1

Xj

S

$\{X|X_j < S\}\ and\ \{X|X_j \geq S\}$

$\displaystyle \sum_{i:x_i\epsilon R_1(j, s)}(y_i - \widehat{y}_{R_1})^ {2} + \sum_{i:x_i\epsilon R_2(j, s)}(y_i - \widehat{y}_{R_2})^ {2}$

Disadvantages of predicting using stratification

This algorithm might produce good results in the training data but is likely to overfit the data, leading to poor test set performance. This happens because the tree is too complex.

Machine learning models try to find a sweet spot between the overfit and underfit so that the error rate should be minimized and at the same time, it should be able to predict correctly.

Predicting using Tree Pruning

Pre-pruning or early stopping

Post Pruning

A practical approach to Tree Pruning using sklearn | Decision Trees #machinelearning #sklearn #python #datascience



April 5, 2020 5 mins read



Regression Tree analysis using Sklearn

import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error

boston

print ( load_boston (). DESCR )

X , y = load_boston ( return_X_y = True ) X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.33 , random_state = 42 )

clf = DecisionTreeRegressor () clf . fit ( X_train , y_train ) predictions = clf . predict ( X_test ) print ( mean_squared_error ( y_test , predictions )) print ( np . sqrt ( mean_squared_error ( y_test , predictions )))

18.59688622754491 4.312410721110051

Finding the relation between Tree depth and Mean Square Error

mses = [] for depth in range ( 1 , ( clf . tree_ . max_depth + 1 )): d_tree_reg = DecisionTreeRegressor ( max_depth = depth ) d_tree_reg . fit ( X_train , y_train ) tree_predictions = d_tree_reg . predict ( X_test ) mses . append ( mean_squared_error ( y_test , tree_predictions )) tree_depths = [ depth for depth in range ( 1 , ( clf . tree_ . max_depth + 1 ))] plt . figure ( figsize = ( 10 , 6 )) plt . grid () plt . plot ( tree_depths , mses ) plt . xlabel ( "Tree Depth" ) plt . ylabel ( "Mean Square Error" )

6

Mean Square Error

Regression tree-based models produce the result (leaf node) as a real value. For example, a model that try to predict the salary of baseball players using the data like, years of experience and number of hits that the hitter made in the last season.Almost all the Decision Tree algorithm uses Greedy approach, that is, at every node generation, we don’t care about the final tree becoming better. We are only worried about making the best split at any given time. Let’s understand this by taking theexample. To simplify the calculations, let’s assume that we only have three features in the dataset,andand salary is the feature that we want to predict. On this random data, we decided to fit the decision tree model. After initially analyzing the data, we found thatwas the most usable factor which was clearly dividing the whole dataset into different parts. So, rather than thinking of ahead in time, we will directly choose theas the root node at this point. We will discuss the above example much in detail during this post.As we already know the equation of the linear regression model which is equal to the equation of a straight line. We also have an equation for Regression Tree.There are a few different ways in which we can predict a regression tree. In this post, we are going to discuss,As we have already discussed that while fitting trees we almost every time use the greedy approach. We will use the same approach here as well. At any given time of node(branch feature) determination, we want to reduce the value of the. Mathematically the value is given by,Let’s try to understand it using the. First, we plot the whole dataset onto a plan. For simplicity let’s consider that all the points are confined in a rectangle. In regression trees, at each point, we have to predict two values.While working on the hitter’s dataset, we found thatwas dividing the region in a better way. So, the branch feature(root node) that we chose was. Also, we found that the cutoff value of 4.5 years was giving optimal results. We will apply the recursion and apply the same steps again. Since we now have two regions,with salaries ofwith experience < 4.5 years andwith experience >= 4.5 years. We now chose from these two available regions and try to divide them. The only other available feature is. We found that there is a clear distinction between the salaries of hitters hittinghits last year in region. Finally, we chose to leave theas it is as there was no clear advantage of dividing it further. This is what our Regression tree will look like.Here the leaf nodes show the mean salary of all the hitters in the given region. On the same lines, the region partition will look something like this.Mathematically, In, we first select the predictorand the cutoff valuesuch that,leads to the maximum reduction in the RSS. So, the equation that we want to minimize is,This process of dividing one of the available region into smaller ones is repeated until we reach and endpoint or a proper regression value.Therefore, using this algorithm to create a model and then trying to understand it is quite tough when the number of features are high.Tree Pruning isn’t only used for regression trees. We also make use of it in the classification trees as well. As the word itself suggests, the process involves cutting the tree into smaller parts. We can do pruning in two ways.This means stopping before the full tree is even created. We can keep making the tree until the next node doesn’t change the error rate change drastically. Whenever we see a very minor change in the error rate after building the next node, we can stop building further.Creating the full tree up until we have very few data points in each set and then track backing such that the change in the error rate is not much. Since, Tree pruning itself is a vast topic I have created a separate post for it. Please read that post if you want to know more about it.Let’s import the required librariesWe are using thedata, which is the house pricing data of the Boston region. You can know more about the data by runningSplit the data into training and testing set.Fit the Regression Tree model and get the predictions.Results:We can train the model with different depth values using a simple Python loop and then plot a relation between them.We can clearly see that if we define the depth of the tree equivalent to, we will get the best results and minimum. Please share on social media and subscribe to the newsletter to read more such posts.