R snippets, and kindly contributed to Want to share your content on R-bloggers? [This article was first published on, and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Classification trees are known to be unstable with respect to training data. Recently I have read an article on stability of classification trees by Briand et al. (2009). They propose a quantitative similarity measure between two trees. The method is interesting and it inspired me to prepare a simple test data based example showing instability of classification trees.

I compare stability of logistic regression and classification tree on Participation data set from Ecdat package. The method works as follows:

Divide the data into training and test data set; Generate a random subset of training data and build logistic regression and classification tree using them; Apply the models on test data to obtain predicted probabilities; Repeat steps 2 and 3 many times; For each observation in test data set calculate standard deviation of obtained predictions for both classes of models; For both models plot kernel density estimator of standard deviation distribution in test data set.

The code performing the above steps is as follows:

library ( party )

library ( Ecdat )

data ( Participation )

set.seed ( 1 )

shuffle Participation [ sample ( nrow ( Participation )) , ]

test shuffle [ 1 : 300 , ]

train shuffle [ 301 : nrow ( Participation ) , ]

reps 1000

p.tree p.log vector ( “list” , reps )

for ( i in 1 : reps ) {

train.sub train [ sample ( nrow ( train ))[ 1 : 300 ] , ]

mtree ctree ( lfp ~ ., data = train.sub )

mlog glm ( lfp ~ ., data = train.sub, family = binomial )

p.tree [[ i ]] sapply ( treeresponse ( mtree, newdata = test )

function ( x ) { x [ 2 ] })

p.log [[ i ]] predict ( mlog, newdata = test, type = “response” )

}

plot ( density ( apply ( do.call ( rbind , p.log ) , 2 , sd ))

main = “” , xlab = “sd” )

lines ( density ( apply ( do.call ( rbind , p.tree ) , 2 , sd )) , col = “red” )

legend ( “topright” , legend = c ( “logistic” , “tree” )

col = c ( “black” , “red” ) , lty = 1 )

And here is the generated comparison. As it can be clearly seen logistic regression gives much more stable predictions in comparison to classification tree.