That’s an intimidating tree for new-comers. Let’s break it down a bit:

Features: X0 (AFP). X1 (CEA). X2 (CA125). X3 (CA50) Layer 1: CEA ≤ 3.25, gini 0.492, spread = 144, 111

The root node shows the gini index of the whole data set, prior to branching. The lower the gini score, the more pure the data is. The worst-mixed data would give a gini index of 0.5.

To refresh, there are 144 noncancerous and 111 cancerous patients in our data.The Gini Index for this would be 0.492 which means it is very mixed. But don’t worry, the tree will lower the gini indices as new branches and nodes are formed.

Gini Index = 1−((144/255)^2)+((111/255)^2)= 0.4916

The regression model told us CEA is the most predictive feature with the highest coefficient and the lowest pvalue. The decision tree agrees with this by placing CEA on the root node. Every other node is derivative of the root node’s split.The algorithm chose to split at CEA level 3.25 because that point splits the target variable into cancerous and noncancerous more purely than any other point in any other attribute. The instances of CEA values lower than 3.25 (180 samples) are more likely to be non cancerous; the instances above 3.25 (75 samples) are more likely to be cancerous. Refer to the connecting internal nodes below the root to see how the instances are further divided.

The tree’s second layer analyzes both new buckets of data (the 180 samples below CEA 3.25 and the 75 samples above) in the same way it did the root node:

It runs the ID3 algorithm, finds the attribute which divides the target variable to its maximum purity, determines the optimal cut-off point, and splits.

The second layer node of CEA level above 3.25 split is based on CA125 levels above 38.6. This split results in another internal node of 72 samples and our first leaf node of 3 samples. This leaf node has a gini index of 0 because all 3 samples in this node are classified as being noncancerous. The way the algorithm will think about classifying future data based on this particular leaf node would be:

If: CEA ≥ 3.25 AND: CA125 ≥ 38.65 → Patient = noncancerous (0)

The process continues until the tree ends in all leaf nodes and there is a decision for every series of splits.

Random Forest

Instead of stopping there and basing our model off of the tree’s leaves, we will be implementing a random forest: taking random samples, forming many decision trees and taking the average of those decisions to form a more refined model. We are taking the averages of 1000 tree samples in this model.

#Importing

from sklearn import metrics

from sklearn.model_selection import train_test_split as tts #Dividing into training(70%) and testing(30%)X_train, X_test, y_train, y_test = tts(X, y, test_size=0.3, random_state=None) #Running new regression on training data

treeclass = RandomForestClassifier(n_estimators=1000)

treeclass.fit(X_train, y_train) #Calculating the accuracy of the training model on the testing data

y_pred = treeclass.predict(X_test)

y_pred_prob = treeclass.predict_proba(X_test)

accuracy = treeclass.score(X_test, y_test)

print(‘The accuracy is: ‘ + str(accuracy *100) + ‘%’)

This 71% accuracy compares to the 74% accuracy of the logistic model.