"Yeah, they all said that to me...", Bob replied as we were at Starbucks sipping on our dark roast coffee. Bob is a friend of mine and was the owner of a multi-million dollar company, that's right, "m-i-l-l-i-o-n". He used to tell me stories about how his company's productivity and growth has sky rocketed from the previous years and everything has been going great. But recently, he's been noticing some decline within his company. In a five month period, he lost one-fifth of his employees. At least a dozen of them throughout each department made phone calls and even left sticky notes on their tables informing him about their leave. Nobody knew what was happening. In that year, he was contemplating about filing for bankruptcy. Fast-forward seven months later, he's having a conversation with his co-founder of the company. The conversation ends with, "I quit..."

That is the last thing anybody wants to hear from their employees. In a sense, it’s the employees who make the company. It’s the employees who do the work. It’s the employees who shape the company’s culture. Long-term success, a healthy work environment, and high employee retention are all signs of a successful company. But when a company experiences a high rate of employee turnover, then something is going wrong. This can lead the company to huge monetary losses by these innovative and valuable employees.

Companies that maintain a healthy organization and culture are always a good sign of future prosperity. Recognizing and understanding what factors that were associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth. These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.

Original Notebook found here

"You don't build a business. You build people, and people build the business." - Zig Ziglar

Business Problem

Bob's multi-million dollar company is about to go bankrupt and he wants to know why his employees are leaving.

Client

Bob (The CEO of Company X)

Objective

My goal is to understand what factors contribute most to employee turnover and create a model that can predict if a certain employee will leave the company or not.

OSEMN Pipeline

I’ll be following a typical data science pipeline, which is call “OSEMN” (pronounced awesome).

Obtaining the data is the first approach in solving the problem. Scrubbing or cleaning the data is the next step. This includes data imputation of missing or invalid data and fixing column names. Exploring the data will follow right after and allow further insight of what our dataset contains. Looking for any outliers or weird data. Understanding the relationship each explanatory variable has with the response variable resides here and we can do this with a correlation matrix. Modeling the data will give us our predictive power on whether an employee will leave. INterpreting the data is last. With all the results and analysis of the data, what conclusion is made? What factors contributed most to employee turnover? What relationship of variables were found?

Part 1: Obtaining the Data

The data was found from the “Human Resources Analytics” dataset provided by Kaggle’s website.

Click Here for my Kaggle Kernel

Note: THIS DATASET IS SIMULATED

import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as matplot import seaborn as sns %matplotlib inline df = pd.DataFrame.from_csv( '../input/HR_comma_sep.csv' , index_col= None )

Part 2: Scrubbing the Data

Typically, cleaning the data requires a lot of work and can be a very tedious procedure. This dataset from Kaggle is super clean and contains no missing values. But still, I will have to examine the dataset to make sure that everything else is readable and that the observation values match the feature names appropriately.

# Check to see if there are any missing values in our data set df.isnull().any()

df.head()

Part 3: Exploring the Data

3a. Statistical Overview:

The dataset has:

About 15,000 employee observations and 10 features

employee observations and features The company had a turnover rate of about 24%

Mean satisfaction of employees is 0.61

turnover_Summary = df.groupby( 'turnover' ) turnover_Summary.mean()

turnover_rate = df.turnover.value_counts() / 14999 turnover_rate

3b. Correlation Matrix & Heatmap

Stop and Think:

What features affect our target variable the most (turnover)?

What features have strong correlations with each other?

Can we do a more in depth examination of these features?

Summary:

From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.

For the negative(-) relationships, turnover and satisfaction are highly correlated. I'm assuming that people tend to leave a company more when they are less satisfied.

corr = df.corr() corr = (corr) sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values) sns.plt.title( 'Heatmap of Correlation Matrix' ) corr

3c. Distribution Plots (Satisfaction & Evaluation & AverageMonthlyHours)

Summary: Let's examine the distribution on some of the employee's features. Here's what I found:

Satisfaction - There is a huge spike for employees with low satisfaction and high satisfaction.

- There is a huge spike for employees with and Evaluation - There is a bimodal distrubtion of employees for low evaluations (less than 0.6 ) and high evaluations (more than 0.8 )

- There is a bimodal distrubtion of employees for low evaluations (less than ) and high evaluations (more than ) AverageMonthlyHours - There is another bimodal distribution of employees with lower and higher average monthly hours (less than 150 hours & more than 250 hours )

- There is another bimodal distribution of employees with lower and higher average monthly hours (less than & more than ) The evaluation and average monthly hour graphs both share a similar distribution.

Employees with lower average monthly hours were evaluated less and vice versa.

average monthly hours were evaluated and If you look back at the correlation matrix, the high correlation between evaluation and averageMonthlyHours does support this finding.

Stop and Think:

Is there a reason for the high spike in low satisfaction of employees?

Could employees be grouped in a way with these features?

Is there a correlation between evaluation and averageMonthlyHours?

t up the matplotlib figure f, axes = plt.subplots(ncols= 3 , figsize=( 15 , 6 )) sns.distplot(df.satisfaction, kde= False , color= "g" , ax=axes[ 0 ]).set_title( 'Employee Satisfaction Distribution' ) sns.distplot(df.evaluation, kde= False , color= "r" , ax=axes[ 1 ]).set_title( 'Employee Evaluation Distribution' ) sns.distplot(df.averageMonthlyHours, kde= False , color= "b" , ax=axes[ 2 ]).set_title( 'Employee Average Monthly Hours Distribution' )

3d. Salary V.S. Turnover

Summary: This is not unusual. Here's what I found:

Majority of employees who left either had low or medium salary.

or salary. Barely any employees left with high salary

salary Employees with low to average salaries tend to leave the company.

Stop and Think:

What is the work environment like for low, medium, and high salaries?

What made employees with high salaries to leave?

ax = plt.subplots(figsize=(15, 4)) sns.countplot(y= "salary" , hue= 'turnover' , data=df).set_title( 'Employee Salary Turnover Distribution' );

3e. Department V.S. Turnover

Summary: Let's see more information about the departments. Here's what I found:

The sales, technical, and support department were the top 3 departments to have employee turnover

were the top 3 departments to have employee turnover The management department had the smallest amount of turnover

Stop and Think:

If we had more information on each department, can we pinpoint a more direct cause for employee turnover?

color_types = [ '#78C850' , '#F08030' , '#6890F0' , '#A8B820' , '#A8A878' , '#A040A0' , '#F8D030' , '#E0C068' , '#EE99AC' , '#C03028' , '#F85888' , '#B8A038' , '#705898' , '#98D8D8' , '#7038F8' ] sns.countplot(x= 'department' , data=df, palette=color_types).set_title( 'Employee Department Distribution' ); plt.xticks(rotation=-45)

f, ax = plt.subplots(figsize=(15, 5)) sns.countplot(y= "department" , hue= 'turnover' , data=df).set_title( 'Employee Department Turnover Distribution' );

3f. Turnover V.S. ProjectCount

Summary: This graph is quite interesting as well. Here's what I found:

More than half of the employees with 2,6, and 7 projects left the company

projects left the company Majority of the employees who did not leave the company had 3,4, and 5 projects

All of the employees with 7 projects left the company

left the company There is an increase in employee turnover rate as project count increases

Stop and Think:

Why are employees leaving at the lower/higher spectrum of project counts?

Does this means that employees with project counts 2 or less are not worked hard enough or are not highly valued, thus leaving the company?

Do employees with 6+ projects are getting overworked, thus leaving the company?

ax = sns.barplot(x= "projectCount" , y= "projectCount" , hue= "turnover" , data=df, estimator=lambda x: len (x) / len (df) * 100 ) ax.set(ylabel= "Percent" )

3g. Turnover V.S. Evaluation

Summary:

There is a biomodal distribution for those that had a turnover.

Employees with low performance tend to leave the company more

performance tend to leave the company more Employees with high performance tend to leave the company more

performance tend to leave the company more The sweet spot for employees that stayed is within 0.6-0.8 evaluation

fig = plt.figure(figsize=( 15 , 4 ),) ax=sns.kdeplot(df.loc[(df[ 'turnover' ] == 0 ), 'evaluation' ] , color= 'b' ,shade= True ,label= 'no turnover' ) ax=sns.kdeplot(df.loc[(df[ 'turnover' ] == 1 ), 'evaluation' ] , color= 'r' ,shade= True , label= 'turnover' ) plt.title( 'Employee Evaluation Distribution - Turnover V.S. No Turnover' )

3h. Turnover V.S. AverageMonthlyHours

Summary:

Another bi-modal distribution for employees that turnovered

Employees who had less hours of work (~150hours or less) left the company more

left the company more Employees who had too many hours of work (~250 or more) left the company

left the company Employees who left generally were underworked or overworked.

fig = plt.figure(figsize=( 15 , 4 )) ax=sns.kdeplot(df.loc[(df[ 'turnover' ] == 0 ), 'averageMonthlyHours' ] , color= 'b' ,shade= True , label= 'no turnover' ) ax=sns.kdeplot(df.loc[(df[ 'turnover' ] == 1 ), 'averageMonthlyHours' ] , color= 'r' ,shade= True , label= 'turnover' ) plt.title( 'Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover' )

3i. Turnover V.S. Satisfaction

Summary:

There is a tri-modal distribution for employees that turnovered

distribution for employees that turnovered Employees who had really low satisfaction levels (0.2 or less) left the company more

left the company more Employees who had low satisfaction levels (0.3~0.5) left the company more

left the company more Employees who had really high satisfaction levels (0.7 or more) left the company more

fig = plt.figure(figsize=( 15 , 4 )) ax=sns.kdeplot(df.loc[(df[ 'turnover' ] == 0 ), 'satisfaction' ] , color= 'b' ,shade= True , label= 'no turnover' ) ax=sns.kdeplot(df.loc[(df[ 'turnover' ] == 1 ), 'satisfaction' ] , color= 'r' ,shade= True , label= 'turnover' ) plt.title( 'Employee Satisfaction Distribution - Turnover V.S. No Turnover' )

3j. Satisfaction VS Evaluation

Summary: This is by far the most compelling graph. This is what I found:

There are 3 distinct clusters for employees who left the company

Cluster 1 (Hard-working and Sad Employee):

Satisfaction was below 0.2 and evaluations were greater than 0.75. Which could be a good indication that employees who left the company were good workers but felt horrible at their job.

Question: What could be the reason for feeling so horrible when you are highly evaluated? Could it be working too hard? Could this cluster mean employees who are "overworked"?

Cluster 2 (Bad and Sad Employee):

Satisfaction between about 0.35~0.45 and evaluations below ~0.58. This could be seen as employees who were badly evaluated and felt bad at work.

Question: Could this cluster mean employees who "under-performed"?

Cluster 3 (Hard-working and Happy Employee):

Satisfaction between 0.7~1.0 and evaluations were greater than 0.8. Which could mean that employees in this cluster were "ideal". They loved their work and were evaluated highly for their performance.

Question: Could this cluster mean that employees left because they found another job opportunity?

sns.lmplot(x= 'satisfaction' , y= 'evaluation' , data=df, fit_reg=False, hue= 'turnover' )

3k. Turnover V.S. YearsAtCompany

Summary: Let's see if theres a point where employees start leaving the company. Here's what I found:

More than half of the employees with 4 and 5 years left the company

years left the company Employees with 5 years should highly be looked into

Stop and Think:

Why are employees leaving mostly at the 3-5 year range?

year range? Who are these employees that left?

Are these employees part-time or contractors?

ax = sns.barplot(x= "yearsAtCompany" , y= "yearsAtCompany" , hue= "turnover" , data=df, estimator=lambda x: len (x) / len (df) * 100 ) ax.set(ylabel= "Percent" )

Part 4: Modeling the Data

I'll be using a logistic regression algorithm to model the data. Since our class is imbalanced, I would not worry too much about the accuracy of the model. Instead, we should be more focused on the precision and recall.

﻿ from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve from sklearn.preprocessing import RobustScaler df[ 'sales' ] = (df[ 'department' ] == 1 ).astype( 'int' ) df[ 'accounting' ] = (df[ 'department' ] == 2 ).astype( 'int' ) df[ 'hr' ] = (df[ 'department' ] == 3 ).astype( 'int' ) df[ 'technical' ] = (df[ 'department' ] == 4 ).astype( 'int' ) df[ 'support' ] = (df[ 'department' ] == 5 ).astype( 'int' ) df[ 'management' ] = (df[ 'department' ] == 6 ).astype( 'int' ) df[ 'it' ] = (df[ 'department' ] == 7 ).astype( 'int' ) df[ 'product_mng' ] = (df[ 'department' ] == 8 ).astype( 'int' ) df[ 'marketing' ] = (df[ 'department' ] == 9 ).astype( 'int' ) df.drop( 'department' , axis= 1 , inplace= True ) df[ 'low' ] = (df[ 'salary' ] == 1 ).astype( 'int' ) df[ 'medium' ] = (df[ 'salary' ] == 2 ).astype( 'int' ) df.drop( 'salary' , axis= 1 , inplace= True ) target_name = 'turnover' X = df.drop( 'turnover' , axis= 1 ) robust_scaler = RobustScaler() X = robust_scaler.fit_transform(X) y=df[target_name] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.15 , random_state= 123 , stratify=y) from sklearn.metrics import roc_auc_score from sklearn.metrics import classification_report logis = LogisticRegression(class_weight = "balanced" ) logis.fit(X_train, y_train) print ( "



---Logistic Model---" ) logit_roc_auc = roc_auc_score(y_test, logis.predict(X_test)) print ( "Logistic AUC = %2.2f" % logit_roc_auc) print(classification_report(y_test, logis.predict(X_test)))

from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_ test , model.predict_proba(X_ test )[:,1]) plt.figure() plt.plot(fpr, tpr, label= 'ROC Cure (area = %0.2f)' % logit_roc_auc) plt.plot([0,1], [0,1], 'k--' ) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel( 'False Positive Rate' ) plt.ylabel( 'True Positive Rate' ) plt.title( 'ROC Graph' ) plt.legend(loc= "lower right" ) plt.show()

5. Interpreting the Data

With all of this information, this is what Bob should know about his company and why his employees probably left:

Employees generally left when they are underworked (less than 150hr/month or 6hr/day) Employees generally left when they are overworked (more than 250hr/month or 10hr/day) Employees with either really high or low evaluations should be taken into consideration for high turnover rate Employees with low to medium salaries are the bulk of employee turnover Employees that had 2,6, or 7 project count was at risk of leaving the company Employee satisfaction is the highest indicator for employee turnover. Employees with 4 and 5 years at a company are endangered of leaving.

Potential Solution

Since satisfaction had the most effect in determining employee turnover, the underlying problem can be generalized down to a personal level. Or the problem is not with the employees, but persist in a deeper level of the company (their core values and purpose).

Solution 1: Develop learning programs for managers. Then use analytics to gauge their performance and measure progress. Some advice:

Be a good coach

Empower the team and do not micromanage

Express interest for team member success

Have clear vision / strategy for team

Help team with career development

Solution 2:

We can rank employees by their probability of leaving, then allocate a limited incentive budget to the highest probability instances.

OR, we can allocate our incentive budget to the instances with the highest expected loss, for which we'll need the probability of turnover.

What Now

This problem is about people decision. When modeling the data, we should not be using this predictive metric as a solution decider. But, we can use this to arm people with much better relevant information for better decision making.

We would have to conduct more experiments or collect more data about the employees in order to come up with a more accurate finding. I would recommend to gather more variables from the database that could have more impact on determining employee turnover and satisfaction such as their distance from home, gender, age, and etc.

Reverse Engineer the Problem

After trying to understand what caused employees to leave in the first place, we can form another problem to solve by asking ourselves

"What features caused employees stay? "What features contributed to employee retention?

There are endless problems to solve!

Any feedback or constructive criticism is greatly appreciated. Thank you :)

"You don't build a business. You build people, and people build the business." - Zig Ziglar

Why do you think employees leave?