A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Decision tree is a very simple and powerful tool in Machine Learning. Let’s take a look at an example from becominghuman.ai:

This tree consists of the following components:

Questions/conditions are Nodes.

Yes/No options represent Edges.

End actions are Leafs of the tree.

In this post, I will use this supervised machine learning method — Decision Tree Classification — to predict whether an income of an adult is greater than 50K/year or not.

Getting the data

The adult census data comes from University of California, Irvine.

Listing of attributes:



>50K, <=50K.



age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: continuous.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Install dependencies

import pandas as pd

import numpy as np

from sklearn import tree

import graphviz

from sklearn.model_selection import cross_val_score

The Decision Tree feature is from sklearn.

Data preparation

# Load dataset

df = pd.read_csv('adult.csv', sep=',')

len(df) # 32561

There are lots row that contains question marks “?”. I will start removing them.

# Remove invalid data from table

df = df[(df.astype(str) != ' ?').all(axis=1)]

len(df) # 30162

About 2399 rows was being removed.

Next, I will change the income column to binary for predicting purpose, and there are few columns that doesn’t really contribute to my prediction method.

# Create a new income_bi column

df['income_bi'] = df.apply(lambda row: 1 if '>50K'in row['income'] else 0, axis=1) # Remove redundant columns

df = df.drop(['income','fnlwgt','capital-gain','capital-loss','native-country'], axis=1)

In order to building the predicting model, we also need to transfer the categorical values to numeric values.

# Use one-hot encoding on categorial columns

df = pd.get_dummies(df, columns=['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex'])

Building Decision Tree model

We have about 30,000 rows, I will divide the training set and test set by 80/20.

# shuffle rows

df = df.sample(frac=1) # split training and testing data

d_train = df[:25000]

d_test = df[25000:] d_train_att = d_train.drop(['income_bi'], axis=1)

d_train_gt50 = d_train['income_bi'] d_test_att = d_test.drop(['income_bi'], axis=1)

d_test_gt50 = d_test['income_bi'] d_att = df.drop(['income_bi'], axis=1)

d_gt50 = df['income_bi'] # number of income > 50K in whole dataset:

print("Income >50K: %d out of %d (%.2f%%)" % (np.sum(d_gt50), len(d_gt50), 100*float(np.sum(d_gt50)) / len(d_gt50))) # Income >50K: 7508 out of 30162 (24.89%)

We have about 24.89% people with the salary is greater than 50K/year.

Now, we start training the model. I will explain to you why I choose max_depth=7 by the end of this article.

# Fit a decision tree

t = tree.DecisionTreeClassifier(criterion='entropy', max_depth=7)

t = t.fit(d_train_att, d_train_gt50)

If you want to visualize the decision tree, you can use graphviz tool.

# Visualize tree

dot_data = tree.export_graphviz(t, out_file=None, label='all', impurity=False, proportion=True,

feature_names=list(d_train_att), class_names=['lt50K', 'gt50K'],

filled=True, rounded=True)

graph = graphviz.Source(dot_data)

graph

After we the model, we can the accuracy of it. The result shows ~82% which is really good.

t.score(d_test_att, d_test_gt50)

# 0.820030995738086

We can go further by evaluating a score by cross-validation.

scores = cross_val_score(t, d_att, d_gt50, cv=5) # Show avarage score and +/- two standard deviations away (covering 95% or scores)

print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std()*2)) # Accuracy: 0.83 (+/- 0.00)

Start predicting

We can prepare the prediction template by saving the first row of the data frame after we modified it.

# Create a sample csv for prediction

df.iloc[[0]].to_csv('prediction.csv', sep=',', encoding='utf-8', index=False)

Now we have a CSV file with the data we need to start predicting. You can modify the row values to get a suitable user profile.

# Prepare user profile

sample_df = pd.read_csv('prediction.csv', sep=',')

sample_df = sample_df.drop(['income_bi'], axis=1) # Start predicting

predict_value = sample_df.iloc[0]

y_predict = t.predict([predict_value.tolist()])

y_predict[0] #0

For the user that I tested, their salary is less than 50K.

How to choose the right depth for the decision tree

The right method is to test few depths in order to find the right max-depth for your model.

for max_depth in range(1, 20):

t = tree.DecisionTreeClassifier(criterion='entropy', max_depth=max_depth)

scores = cross_val_score(t, d_att, d_gt50, cv=5)

print("Max depth: %d, Accuracy: %0.2f (+/- %0.2f)" % (max_depth, scores.mean(), scores.std()*2)) # Results

Max depth: 1, Accuracy: 0.75 (+/- 0.00)

Max depth: 2, Accuracy: 0.82 (+/- 0.01)

Max depth: 3, Accuracy: 0.81 (+/- 0.01)

Max depth: 4, Accuracy: 0.82 (+/- 0.01)

Max depth: 5, Accuracy: 0.82 (+/- 0.01)

Max depth: 6, Accuracy: 0.82 (+/- 0.01)

Max depth: 7, Accuracy: 0.83 (+/- 0.00)

Max depth: 8, Accuracy: 0.83 (+/- 0.00)

Max depth: 9, Accuracy: 0.83 (+/- 0.01)

Max depth: 10, Accuracy: 0.82 (+/- 0.01)

Max depth: 11, Accuracy: 0.82 (+/- 0.01)

Max depth: 12, Accuracy: 0.82 (+/- 0.01)

Max depth: 13, Accuracy: 0.82 (+/- 0.01)

Max depth: 14, Accuracy: 0.81 (+/- 0.01)

Max depth: 15, Accuracy: 0.81 (+/- 0.01)

Max depth: 16, Accuracy: 0.81 (+/- 0.01)

Max depth: 17, Accuracy: 0.80 (+/- 0.01)

Max depth: 18, Accuracy: 0.80 (+/- 0.01)

Max depth: 19, Accuracy: 0.80 (+/- 0.00)

As you can see, the max-depth from 7–9 yields the best results — 83%. That’s why I choose 7 as max_depth for the training model.

In this post, we learnt how to build a Decision Tree model and classify / predict the adult salary from their characteristics — you can check the notebook of this project from my github.