Precise Demo on Decision Tree for Machine Learning and Data Analytics in R.Open RStudio and begin typing in the same things as below.Learn by Practice!!

#### Decision Tree is like taking decision on split for each variable in #### order to predict the output #### Importing the required libraries.MASS is used for importing birthwt #### dataset library(MASS) library(rpart) library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.2.2

#### Setting the seed so the we get same results each time we run decision #### tree set.seed(123) #### Storing the data set named "birthwt" into DataFrame DataFrame <- birthwt #### To read about the dataset use following command by uncommenting #### help("birthwt") #### Lets check out the structure of the data str(DataFrame)

## 'data.frame': 189 obs. of 10 variables: ## $ low : int 0 0 0 0 0 0 0 0 0 0 ... ## $ age : int 19 33 20 21 18 21 22 17 29 26 ... ## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... ## $ race : int 2 3 1 1 1 3 1 3 1 1 ... ## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ... ## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ... ## $ ht : int 0 0 0 0 0 0 0 0 0 0 ... ## $ ui : int 1 0 0 1 1 0 0 0 0 0 ... ## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ... ## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

#### Check the dimention of this data frame dim(DataFrame)

## [1] 189 10

#### Check first 3 rows head(DataFrame,3)

## low age lwt race smoke ptl ht ui ftv bwt ## 85 0 19 182 2 0 0 0 1 0 2523 ## 86 0 33 155 3 0 0 0 0 3 2551 ## 87 0 20 105 1 1 0 0 0 1 2557

#### Check the precentage of unique values apply(DataFrame,2,function(x) round(length(unique(x))/nrow(DataFrame),3)*100)

## low age lwt race smoke ptl ht ui ftv bwt ## 1.1 12.7 39.7 1.6 1.1 2.1 1.1 1.1 3.2 69.3





#### Seems like variables low,race,smoke,ptl,ht,ui,ftv are categorical variables cols<-c("low","race","smoke","ptl","ht","ui","ftv") for(i in cols){ DataFrame[,i]=as.factor(DataFrame[,i]) } #### Check the data set again str(DataFrame)

## 'data.frame': 189 obs. of 10 variables: ## $ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ age : int 19 33 20 21 18 21 22 17 29 26 ... ## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... ## $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ... ## $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ... ## $ ptl : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ... ## $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ... ## $ ftv : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ... ## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

#### Lets create the train and test data set #### target variable is low library(caTools) ind<-sample.split(Y = DataFrame$low,SplitRatio = 0.8) trainDF<-DataFrame[ind,] testDF<-DataFrame[!ind,] #### Fitting the model DecisionTreeModel<- rpart(low ~ . - bwt, data = trainDF, method = 'class') #### Let's plot the decision tree prp(DecisionTreeModel)

#### Let's check the summary of the model summary(DecisionTreeModel)

## Call: ## rpart(formula = low ~ . - bwt, data = trainDF, method = "class") ## n= 151 ## ## CP nsplit rel error xerror xstd ## 1 0.10638297872 0 1.0000000000 1.000000000 0.1210540342 ## 2 0.06382978723 1 0.8936170213 1.127659574 0.1247856487 ## 3 0.02127659574 4 0.7021276596 1.085106383 0.1236513199 ## 4 0.01000000000 5 0.6808510638 1.085106383 0.1236513199 ## ## Variable importance ## lwt ptl smoke race ftv age ui ht ## 25 23 20 11 10 5 4 2 ## ## Node number 1: 151 observations, complexity param=0.1063829787 ## predicted class=0 expected loss=0.3112582781 P(node) =1 ## class counts: 104 47 ## probabilities: 0.689 0.311 ## left son=2 (126 obs) right son=3 (25 obs) ## Primary splits: ## lwt < 106 to the right, improve=4.995690108, (0 missing) ## ptl splits as LRLL, improve=4.460222652, (0 missing) ## smoke splits as LR, improve=2.705656281, (0 missing) ## ftv splits as RLLRL-, improve=2.379528492, (0 missing) ## ui splits as LR, improve=1.834745110, (0 missing) ## Surrogate splits: ## age < 14.5 to the right, agree=0.841, adj=0.04, (0 split) ## ptl splits as LLLR, agree=0.841, adj=0.04, (0 split) ## ## Node number 2: 126 observations, complexity param=0.06382978723 ## predicted class=0 expected loss=0.253968254 P(node) =0.8344370861 ## class counts: 94 32 ## probabilities: 0.746 0.254 ## left son=4 (78 obs) right son=5 (48 obs) ## Primary splits: ## smoke splits as LR, improve=3.121031746, (0 missing) ## ptl splits as LRL-, improve=2.657743458, (0 missing) ## ui splits as LR, improve=2.657743458, (0 missing) ## ftv splits as RLLRL-, improve=2.298656586, (0 missing) ## age < 28.5 to the right, improve=1.569585988, (0 missing) ## Surrogate splits: ## ptl splits as LRL-, agree=0.659, adj=0.104, (0 split) ## lwt < 183 to the left, agree=0.635, adj=0.042, (0 split) ## ht splits as LR, agree=0.635, adj=0.042, (0 split) ## ftv splits as LLLRL-, agree=0.635, adj=0.042, (0 split) ## ## Node number 3: 25 observations, complexity param=0.06382978723 ## predicted class=1 expected loss=0.4 P(node) =0.1655629139 ## class counts: 10 15 ## probabilities: 0.400 0.600 ## left son=6 (13 obs) right son=7 (12 obs) ## Primary splits: ## race splits as LLR, improve=2.5128205130, (0 missing) ## lwt < 100.5 to the left, improve=2.3472222220, (0 missing) ## age < 18.5 to the left, improve=1.9206349210, (0 missing) ## ui splits as RL, improve=0.5714285714, (0 missing) ## smoke splits as RL, improve=0.2051282051, (0 missing) ## Surrogate splits: ## smoke splits as RL, agree=0.76, adj=0.500, (0 split) ## age < 27 to the right, agree=0.68, adj=0.333, (0 split) ## lwt < 87.5 to the right, agree=0.60, adj=0.167, (0 split) ## ptl splits as LRLL, agree=0.60, adj=0.167, (0 split) ## ht splits as LR, agree=0.60, adj=0.167, (0 split) ## ## Node number 4: 78 observations ## predicted class=0 expected loss=0.1666666667 P(node) =0.5165562914 ## class counts: 65 13 ## probabilities: 0.833 0.167 ## ## Node number 5: 48 observations, complexity param=0.06382978723 ## predicted class=0 expected loss=0.3958333333 P(node) =0.3178807947 ## class counts: 29 19 ## probabilities: 0.604 0.396 ## left son=10 (38 obs) right son=11 (10 obs) ## Primary splits: ## ptl splits as LRL-, improve=4.1267543860, (0 missing) ## ftv splits as RRLRL-, improve=1.7959401710, (0 missing) ## age < 28.5 to the right, improve=0.9688596491, (0 missing) ## race splits as LRL, improve=0.5651709402, (0 missing) ## lwt < 132 to the right, improve=0.2916666667, (0 missing) ## Surrogate splits: ## ui splits as LR, agree=0.833, adj=0.2, (0 split) ## ftv splits as LLLLR-, agree=0.812, adj=0.1, (0 split) ## ## Node number 6: 13 observations ## predicted class=0 expected loss=0.3846153846 P(node) =0.08609271523 ## class counts: 8 5 ## probabilities: 0.615 0.385 ## ## Node number 7: 12 observations ## predicted class=1 expected loss=0.1666666667 P(node) =0.07947019868 ## class counts: 2 10 ## probabilities: 0.167 0.833 ## ## Node number 10: 38 observations, complexity param=0.02127659574 ## predicted class=0 expected loss=0.2894736842 P(node) =0.2516556291 ## class counts: 27 11 ## probabilities: 0.711 0.289 ## left son=20 (29 obs) right son=21 (9 obs) ## Primary splits: ## ftv splits as LRLR--, improve=1.6698931240, (0 missing) ## race splits as LRL, improve=1.3642978410, (0 missing) ## lwt < 148.5 to the left, improve=0.5674763833, (0 missing) ## age < 26.5 to the right, improve=0.5482456140, (0 missing) ## ## Node number 11: 10 observations ## predicted class=1 expected loss=0.2 P(node) =0.06622516556 ## class counts: 2 8 ## probabilities: 0.200 0.800 ## ## Node number 20: 29 observations ## predicted class=0 expected loss=0.2068965517 P(node) =0.1920529801 ## class counts: 23 6 ## probabilities: 0.793 0.207 ## ## Node number 21: 9 observations ## predicted class=1 expected loss=0.4444444444 P(node) =0.05960264901 ## class counts: 4 5 ## probabilities: 0.444 0.556





####Predictions PredictionsWithClass<- predict(DecisionTreeModel, testDF, type = 'class') t<-table(predictions=PredictionsWithClass, actual=testDF$low) ###Accuray metric sum(diag(t))/sum(t)

## [1] 0.6578947368

###Plotting ROC curve and calculating AUC metric library(pROC) PredictionsWithProbs<- predict(DecisionTreeModel, testDF, type = 'prob') auc<-auc(testDF$low,PredictionsWithProbs[,2]) plot(roc(testDF$low,PredictionsWithProbs[,2]))