Table of Contents

HR Analytics : Hackathon Challenge

I participated in WNS Analytics Wizard hackathon, “To predict whether an employee will be promoted or not” and hence I am coming up with this blog-post of the solution submitted which ranked me 138 (Top 11%) in the challenge. The leader board ranking was decided on the F1-score which is harmonic mean of precision and recall.

About Data

The data-set consists of 54808 rows where each row had 14 attributes including target variable (i.e “is_promoted”). There are 4668 cases where employees have been promoted (8.5%). The data-set is provided in GitHub link here.

Let’s get started in building the data analytics pipeline end to end.

Importing Libraries

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from IPython.display import display from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve from sklearn.model_selection import GridSearchCV, train_test_split,cross_val_score import xgboost as xgb import lightgbm as lgb import warnings warnings.filterwarnings("ignore") # Set all options %matplotlib inline plt.style.use('seaborn-notebook') plt.rcParams["figure.figsize"] = (20, 3) pd.options.display.float_format = '{:20,.4f}'.format pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) sns.set(context="paper", font="monospace")

User Defined Functions

def convert_categorical_to_dummies(d_convert): """ Author: Abhijeet Kumar Description: returns Dataframe with all categorical variables converted into dummies Arguments: Dataframe (having categorical variables) """ df = d_convert.copy() list_to_drop = [] for col in df.columns: if df[col].dtype == 'object': list_to_drop.append(col) df = pd.concat([df,pd.get_dummies(df[col],prefix=col,prefix_sep='_', drop_first=False)], axis=1) df = df.drop(list_to_drop,axis=1) return df def quality_report(df): """ Author: Abhijeet Kumar Description: Displays quality of data in terms of missing values, unique numbers, datatypes etc. Arguments: Dataframe """ dtypes = df.dtypes nuniq = df.T.apply(lambda x: x.nunique(), axis=1) total = df.isnull().sum().sort_values(ascending = False) percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False) quality_df = pd.concat([total, percent, nuniq, dtypes], axis=1, keys=['Total', 'Percent','Nunique', 'Dtype']) display(quality_df) def score_on_test_set(model, file_name, out_name): """ Author: Abhijeet Kumar Description : It runs same steps of preprocessing as in training, scores on the test data provided in hackathon and generates the submission file. Argument : model, test data file, submission file """ test_data = pd.read_csv(file_name) # Treating the missing values of education as a separate category test_data['education'] = test_data['education'].replace(np.NaN, 'NA') # Treating the missing values of education as a separate category test_data['previous_year_rating'] = test_data['previous_year_rating'].fillna(0) # Creating dummy variables for all the categorical columns, droping that column master_test_data = convert_categorical_to_dummies(test_data) # Removing the id attributes df_test_data = master_test_data.drop(['employee_id'],axis=1) if out_name == "submission_lightgbm.csv": y_pred = model.predict_proba(df_test_data.values, num_iteration=model.best_iteration_) else: y_pred = model.predict_proba(df_test_data.values) submission_df = pd.DataFrame({'employee_id':master_test_data['employee_id'],'is_promoted':y_pred[:,1]}) submission_df.to_csv(out_name, index=False) score = model.predict_proba(df_test_data.values) return test_data,score

Reading Data

data = pd.read_csv("train.csv") print("Shape of Data = ",data.shape) data.sample(5)

Shape of Data = (54808, 14)

Checking the event rate

plt.figure(figsize=(6,3)) sns.countplot(x='is_promoted',data=data) plt.show() # Checking the event rate : event is when claim is made data['is_promoted'].value_counts()

0 50140 1 4668 Name: is_promoted, dtype: int64

Displaying the attributes

# Checking the attribute names pd.DataFrame(data.columns)

0 employee_id 1 department 2 region 3 education 4 gender 5 recruitment_channel 6 no_of_trainings 7 age 8 previous_year_rating 9 length_of_service 10 KPIs_met >80% 11 awards_won? 12 avg_training_score 13 is_promoted

Checking Data Quality

# checking missing data quality_report(data)

Attributes Total Percent Nunique Dtype KPIs_met >80% 0 0.0000 2 int64 age 0 0.0000 41 int64 avg_training_score 0 0.0000 61 int64 awards_won? 0 0.0000 2 int64 department 0 0.0000 9 object education 2409 4.3953 3 object employee_id 0 0.0000 54808 int64 gender 0 0.0000 2 object is_promoted 0 0.0000 2 int64 length_of_service 0 0.0000 35 int64 no_of_trainings 0 0.0000 10 int64 previous_year_rating 4124 7.5244 5 float64 recruitment_channel 0 0.0000 3 object region 0 0.0000 34 object

Missing Value Treatment

# Treating the missing values of education as a separate category data['education'] = data['education'].replace(np.NaN, 'NA') # Treating the missing values of previous year rating as 0 data['previous_year_rating'] = data['previous_year_rating'].fillna(0)

Looking at attributes (EDA)

Can we make some inferences from EDA ?

Promotions are worst in Legal department (5.1%). Best promotions are in technology department (10.7%).

Region 9 is worst (1.9%) and region 4 is best (14.4%) in terms of promotions.

Although Master’s & above has greater promotion percentage but difference is not much.

Employees having previous years rating greater than 5 will have better chances of promotion than others.

Employess having KPI greater than 80% has good chances of promotions (16%)

Employees winning awards are promoted more (44%).

for col in data.drop('is_promoted',axis=1).columns: if data[col].dtype == 'object' or data[col].nunique() xx = data.groupby(col)['is_promoted'].value_counts().unstack(1) per_not_promoted = xx.iloc[:, 0] *100/xx.apply(lambda x: x.sum(), axis=1) per_promoted = xx.iloc[:, 1]*100/xx.apply(lambda x: x.sum(), axis=1) xx['%_0'] = per_not_promoted xx['%_1'] = per_promoted display(xx)

is_promoted 0 1 %_0 %_1 department Analytics 4840 512 90.4335 9.5665 Finance 2330 206 91.8770 8.1230 HR 2282 136 94.3755 5.6245 Legal 986 53 94.8989 5.1011 Operations 10325 1023 90.9852 9.0148 Procurement 6450 688 90.3614 9.6386 R&D 930 69 93.0931 6.9069 Sales & Marketing 15627 1213 92.7969 7.2031 Technology 6370 768 89.2407 10.7593

is_promoted 0 1 %_0 %_1 region region_1 552 58 90.4918 9.5082 region_10 597 51 92.1296 7.8704 region_11 1241 74 94.3726 5.6274 region_12 467 33 93.4000 6.6000 region_13 2418 230 91.3142 8.6858 region_14 765 62 92.5030 7.4970 region_15 2586 222 92.0940 7.9060 region_16 1363 102 93.0375 6.9625 region_17 687 109 86.3065 13.6935 region_18 30 1 96.7742 3.2258 region_19 821 53 93.9359 6.0641 region_2 11354 989 91.9874 8.0126 region_20 801 49 94.2353 5.7647 region_21 393 18 95.6204 4.3796 region_22 5694 734 88.5812 11.4188 region_23 1038 137 88.3404 11.6596 region_24 490 18 96.4567 3.5433 region_25 716 103 87.4237 12.5763 region_26 2117 143 93.6726 6.3274 region_27 1528 131 92.1037 7.8963 region_28 1164 154 88.3156 11.6844 region_29 951 43 95.6740 4.3260 region_3 309 37 89.3064 10.6936 region_30 598 59 91.0198 8.9802 region_31 1825 110 94.3152 5.6848 region_32 905 40 95.7672 4.2328 region_33 259 10 96.2825 3.7175 region_34 284 8 97.2603 2.7397 region_4 1457 246 85.5549 14.4451 region_5 731 35 95.4308 4.5692 region_6 658 32 95.3623 4.6377 region_7 4327 516 89.3454 10.6546 region_8 602 53 91.9084 8.0916 region_9 412 8 98.0952 1.9048

is_promoted 0 1 %_0 %_1 education Bachelor’s 33661 3008 91.7969 8.2031 Below Secondary 738 67 91.6770 8.3230 Master’s & above 13454 1471 90.1441 9.8559 NA 2287 122 94.9357 5.0643

is_promoted 0 1 %_0 %_1 gender f 14845 1467 91.0066 8.9934 m 35295 3201 91.6849 8.3151

is_promoted 0 1 %_0 %_1 recruitment_channel other 27890 2556 91.6048 8.3952 referred 1004 138 87.9159 12.0841 sourcing 21246 1974 91.4987 8.5013

is_promoted 0 1 %_0 %_1 previous_year_rating 0.0000 3785 339 91.7798 8.2202 1.0000 6135 88 98.5859 1.4141 2.0000 4044 181 95.7160 4.2840 3.0000 17263 1355 92.7221 7.2779 4.0000 9093 784 92.0624 7.9376 5.0000 9820 1921 83.6385 16.3615

is_promoted 0 1 %_0 %_1 KPIs_met >80% 0 34111 1406 96.0413 3.9587 1 16029 3262 83.0906 16.9094

is_promoted 0 1 %_0 %_1 awards_won? 0 49429 4109 92.3251 7.6749 1 711 559 55.9843 44.0157

Preparing Data for Modeling

# Creating dummy variables for all the categorical columns, droping that column master_data = convert_categorical_to_dummies(data) print("Total shape of Data :",master_data.shape) # dropping the target from dataset labels = np.array(master_data['is_promoted'].tolist()) # Removing the id attributes df_data = master_data.drop(['is_promoted','employee_id'],axis=1) print("Shape of Data:",df_data.shape) df = df_data.values

Total shape of Data : (54808, 61) Shape of Data: (54808, 59)

Model 1 – XGB Classifier

xgb_model = xgb.XGBClassifier() print(xgb_model) # Cross validation scores f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1') print("F1-score = ",f1_scores," Mean F1 score = ",np.mean(f1_scores)) # Training the models xgb_model.fit(df,labels) # Scoring on test set test_data,score_xgb = score_on_test_set(xgb_model,"test.csv","submission_xgb.csv")

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1) F1-score = [ 0.4526749 0.41547519 0.43579122 0.43012552 0.43427621] Mean F1 score = 0.433668606717

XGB Classifier : Parameter Tuning

Our goal is usually to set the model parameters to optimal values that enable a model to complete learning task in the best way possible. Thus, tuning XGboost classifier can optimize the parameters that impact the model in order to enable the algorithm to perform the best.

I performed lot of iterations patiently which led to fine tuning of parameters: n_estimators, max_depth and L1 regularization. A norm is to take baby steps to learn (small learning rate) and tune the parameters. Here, I found that with large number of trees (n_estimators), the F1-scores were improving.

# Create parameters to search params = { 'learning_rate': [0.01], 'n_estimators': [900,1000,1100], 'max_depth':[7,8,9], 'reg_alpha':[0.3,0.4,0.5] } # Initializing the XGBoost Regressor xgb_model = xgb.XGBClassifier() # Gridsearch initializaation gsearch = GridSearchCV(xgb_model, params, verbose=True, cv=5, n_jobs=2) gsearch.fit(df, labels) #Printing the best chosen params print("Best Parameters :",gsearch.best_params_) params = {'objective':'binary:logistic', 'booster':'gbtree'} # Updating the parameter as per grid search params.update(gsearch.best_params_) # Initializing the XGBoost Regressor xgb_model = xgb.XGBClassifier(**params) print(xgb_model) # Cross validation scores f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1',n_jobs=2) print("F1_scores per fold : ",f1_scores,"

Mean F1_score= ",np.mean(f1_scores)) # Fitting model on tuned parameters xgb_model.fit(df, labels) # Scoring on test set test_data,score_xgb_tuned = score_on_test_set(xgb_model,"test.csv","submission_xgb_tuned.csv")<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

Fitting 5 folds for each of 1 candidates, totalling 5 fits

[Parallel(n_jobs=2)]: Done 5 out of 5 | elapsed: 13.0min finished

Best Parameters :{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 1000, 'reg_alpha': 0.4} XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0, max_depth=8, min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0.4, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True, subsample=1) F1_scores per fold : [ 0.51014041 0.48657188 0.49528302 0.53054911 0.51130164] Mean F1_score= 0.506769210361

XGB Classifier : Setting threshold

How does XGBoost classifier predicts the class (‘promoted’ or ‘not promoted’) ? It predicts a probability between 0 and 1 for the unseen cases. Further, it predicts 0 and 1 by putting a threshold at 0.5 by default (1 if probability > 0.5). In unbalance data-set as in here, it may be a biased setting as it would be difficult to capture rare event with 0.5 threshold.

We can change the by default threshold of 0.5 by finding the optimal threshold to increase F1-score.

We need to find the threshold where f1-score is highest.

I tried submissions on few optimal cut-offs to get maximum possible improved F1-score.

The following python code splits the data in 90:10 and trains XGBoost classifier with tuned parameters. It calculates precision and recall at different thresholds and plots the precision recall curve. Further, we calculate F1-score for the same using precision and recall values.

# Splitting the dataset in order to use early stopping round X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.10, stratify=labels) xgb_model = xgb.XGBClassifier(**params) # Training the models xgb_model.fit(X_train, y_train) y_pred = xgb_model.predict_proba(X_test) precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1]) thresholds = np.append(thresholds, 1) f1_scores = 2*(precision*recall)/(precision+recall) plt.step(recall, precision, color='b', alpha=0.4, where='post') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('2-class Precision-Recall curve') plt.show()

Getting optimal threshold

We plot F1-scores with respect to threshold in x-axis to check the F1-score peak. The below python codes gets the threshold value where the F1-score was highest.

scrs = pd.DataFrame({'precision' : precision, 'recal' : recall, 'thresholds' : thresholds, 'f1_score':f1_scores}) print("Threshold cutoff: ",scrs.loc[scrs['f1_score'] == scrs.f1_score.max(),'thresholds'].iloc[0]) print("Max F1-score at cut-off : ",scrs.f1_score.max()) scrs.plot(x='thresholds', y='f1_score')

Threshold cutoff: 0.340377241373 Max F1-score at cut-off : 0.53791130186

Once you get the optimal threshold, use it for test set probability predictions as a cutoff to predict class labels 0 and 1 for the final submission.

What did not work ?

I tried the following other techniques which did not work and hence my final submissions were based on single model “XGBoost classifier” as described in this post.

I tried logistic regression and SVM, f1 score was low (less than 0.4).

I tried Random Forest. F1-score was comparatively low.

I tried LightGBM model. In default setting, It gave 0.50 f1-score but somehow it was not improving with parameter tuning. Little improvement was there when early_stopping_rounds was used. We consider best iteration for predictions on test set.

I created some interaction variables like if previous_year_rating == 5 and KPI > 80 == 1 then 1 else 0 , if awards_won? == 1 and KPI > 80 == 1 then 1 else 0 . It did not help.

, . It did not help. Finally, I took the best tuned params of all three (RF, XGboost and LightGBM) and stacked them with ‘Logistics Regression’ as classifier. It did not gave better f1 than individual XGB Classifier model.

At the End

Readers are also encouraged to download the data-set and check if they can reproduce the results. Also, I would love to check in comments if you can surpass the F1-score achieved here in the blog-post. There are following other things which one can try.

Generally, Stacking improves scores when there are lot of models. One can train say 100s of models of XGBoost and LightGBM (with different close by parameters) and then apply logistic regression on top of that (I tried with only 3 models, failed).

Also, one can try an interaction variable by calculating total score achieved in training (Number of training * Avg. training score).

One can try setting “early_stopping_rounds” in XGBoost classifier training which I did not tried. It prevents over-fitting and can improve results.

The full implementation of the followed approach along with LightGBM model example (jupyter notebook) can be downloaded from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy data analytics 🙂

Like this: Like Loading...