Introduction to Random Forest in R

Let’s learn from precise Demo on Random Forest in R for Machine Learning and Data Analytics .Open your RStudio and begin typing in the same things as below.Learn by Practice

Importing the libraries

We need to import the libraries like randomForest in order to use the random forest algorithm in R.

#### Setting the seed so the we get same results each time we run random Forest set.seed(123) #### Importing the library MASS for birthwt dataset and library randomForest for #### randomForest model library(MASS,quietly = TRUE) library(randomForest,quietly = TRUE)

## Warning: package 'randomForest' was built under R version 3.2.2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

Reading the data

Let’s read the data so that we can implement Random forest algorithm in R.

#### Storing the data set named "birthwt" into DataFrame named "DataFrame" DataFrame <- birthwt #### Type help("birthwt") to know about the data set #### Lets check out the structure of the data str(DataFrame)

## 'data.frame': 189 obs. of 10 variables: ## $ low : int 0 0 0 0 0 0 0 0 0 0 ... ## $ age : int 19 33 20 21 18 21 22 17 29 26 ... ## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... ## $ race : int 2 3 1 1 1 3 1 3 1 1 ... ## $ smoke: int 0 0 1 1 1 0 0 0 1 1 ... ## $ ptl : int 0 0 0 0 0 0 0 0 0 0 ... ## $ ht : int 0 0 0 0 0 0 0 0 0 0 ... ## $ ui : int 1 0 0 1 1 0 0 0 0 0 ... ## $ ftv : int 0 3 1 2 0 0 1 1 1 0 ... ## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Data Exploration

Before we begin to apply Random forest in R,let’s first explore the data set.

#### Check the dimention of this data frame dim(DataFrame)

## [1] 189 10

#### Check first 3 rows head(DataFrame,3)

## low age lwt race smoke ptl ht ui ftv bwt ## 85 0 19 182 2 0 0 0 1 0 2523 ## 86 0 33 155 3 0 0 0 0 3 2551 ## 87 0 20 105 1 1 0 0 0 1 2557

#### Check summary of data summary(DataFrame)

## low age lwt race ## Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000 ## Median :0.0000 Median :23.00 Median :121.0 Median :1.000 ## Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847 ## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000 ## Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000 ## smoke ptl ht ui ## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 ## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 ## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 ## Mean :0.3915 Mean :0.1958 Mean :0.06349 Mean :0.1481 ## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 ## Max. :1.0000 Max. :3.0000 Max. :1.00000 Max. :1.0000 ## ftv bwt ## Min. :0.0000 Min. : 709 ## 1st Qu.:0.0000 1st Qu.:2414 ## Median :0.0000 Median :2977 ## Mean :0.7937 Mean :2945 ## 3rd Qu.:1.0000 3rd Qu.:3487 ## Max. :6.0000 Max. :4990

Categorical Variables

We need to convert the categorical variables into factor variables in order to use the random forest in R

#### Check the number of unique values apply(DataFrame,2,function(x) length(unique(x)))

## low age lwt race smoke ptl ht ui ftv bwt ## 2 24 75 3 2 4 2 2 6 131

#### Seems like variables low,race,smoke,ptl,ht,ui,ftv are categorical variables #### For converting into categorical:use as.factor #### Converting into factor cols<-c("low","race","smoke","ptl","ht","ui","ftv") for(i in cols){ DataFrame[,i]=as.factor(DataFrame[,i]) } #### Lets check the data set again str(DataFrame)

## 'data.frame': 189 obs. of 10 variables: ## $ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ age : int 19 33 20 21 18 21 22 17 29 26 ... ## $ lwt : int 182 155 105 108 107 124 118 103 123 113 ... ## $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ... ## $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ... ## $ ptl : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ... ## $ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ... ## $ ftv : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ... ## $ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Data Partition

We need to partition the data into training and testing data .Testing data is required in order to test the accuracy of random forest model .

#### Lets create the train and test data set.Target variable is low library(caTools) ind<-sample.split(Y = DataFrame$low,SplitRatio = 0.7) trainDF<-DataFrame[ind,] testDF<-DataFrame[!ind,] #### Random Forest parameters #1. mtry= random number of variables selected at each split. #2. ntree=no.of trees to grow #3. nodesize=minimum size of terminal nodes

Model Fitting for Random Forest in R

Let’s now fit the random forest model in R using the randomForest function.

#### Fitting the model modelRandom<-randomForest(low~.,data = trainDF,mtry=3,ntree=20) #### looking at summary of model modelRandom ## ## Call: ## randomForest(formula = low ~ ., data = trainDF, mtry = 3, ntree = 20) ## Type of random forest: classification ## Number of trees: 20 ## No. of variables tried at each split: 3 ## ## OOB estimate of error rate: 3.79% ## Confusion matrix: ## 0 1 class.error ## 0 89 2 0.02197802 ## 1 3 38 0.07317073

Variable Importance in Random Forest in R

Let’s check which of the predictor variables in the random forest model have high importance in predictions.We can use the importance function for this purpose.

#### Plotting the importance of each variables #### higher value of mean decrease accuracy or mean decrease gini score implies #### higher the importance of the variable in the model importance(modelRandom)

## MeanDecreaseGini ## age 2.43397938 ## lwt 1.93633860 ## race 0.49444115 ## smoke 1.44438665 ## ptl 1.77731653 ## ht 0.13828860 ## ui 0.08424019 ## ftv 0.84281661 ## bwt 46.56713167

varImpPlot(modelRandom)

Predictions

Let’s now check what the Random forest model is predicting for the test data set and then compare these predicted values with actual values.

#### Predictions PredictionsWithClass<- predict(modelRandom, testDF, type = 'class') t<-table(predictions=PredictionsWithClass, actual=testDF$low) t

## actual ## predictions 0 1 ## 0 39 0 ## 1 0 18

Accuracy Metric

From the above confusion matrix we can calculate the accuracy metric.Diagonal numbers are correct predictions and non-diagonal are incorrect predictions done by the random forest model in R.

#### Accuracy metric sum(diag(t))/sum(t)

## [1] 1

ROC Curve

Let’s plot the ROC curve for random forest model in R.The function auc calculates the auc values .

#### Plotting ROC curve and calculating AUC metric library(pROC,quietly = TRUE)

## Type 'citation("pROC")' for a citation.

## ## Attaching package: 'pROC'

## The following objects are masked from 'package:stats': ## ## cov, smooth, var

PredictionsWithProbs<- predict(modelRandom, testDF, type = 'prob') auc<-auc(testDF$low,PredictionsWithProbs[,2]) plot(roc(testDF$low,PredictionsWithProbs[,2]))

## ## Call: ## roc.default(response = testDF$low, predictor = PredictionsWithProbs[, 2]) ## ## Data: PredictionsWithProbs[, 2] in 39 controls (testDF$low 0) < 18 ## cases (testDF$low 1). ## Area under the curve: 1

Best Random Forest Model

Let’s tune the mtry parameter of random forest model using the tuneRF function in R.

#### To find the best mtry bestmtry<-tuneRF(trainDF,trainDF$low,ntreeTry = 200,stepFactor = 1.2, improve = 0.01,trace = T,plot = T)

## mtry = 3 OOB error = 0.76% ## Searching left ... ## Searching right ...

bestmtry

## mtry OOBError ## 3.OOB 3 0.007575758