Employee Attrition Study Case

Contents
In this document, weʼll walkthrough step-by-step solution on the given study case.
Here are the documentʼs contents:
1. Data Importing
2. Data Merging
3. Detection of Missing Values
4. Exploratory Data Analysis
5. EDA Summary and Business Recommendation
6. Data Cleaning
7. Correlation Analysis
8. Machine Learning Modelling
Preface
Although we have a high page/slide count, I have made sure
that each slide has large font, and contains only 1 key points
(or even less as 1 key points are sometimes distributed over 2
slides).
This is to make the step-by-step process is explained clearly
and thoroughly.
For a very short summary, please read the following 4 slides.
Short Summary
Short Summary
General steps taken in this project:
1. Load given files
2. Merge features seen in separate files into a unified dataframe
(training_data and test_data) & engineer features from in_time and
out_time files
3. Conduct EDA:
a. Visualization: Box Plot, Histogram, Bar Plot
b. Statistical Test: Chi Square Independence for categorical variable,
Mann-Whitney U-Test for 2 sample means of numerical variable
Short Summary
4. Fill in missing values with KNN Imputer
5. Define false positive and false negative as well as metric to measure model
performance (recall)
6. Conduct modelling:
a. ML Model: Light GBM, CatBoost
b. Baseline Recall: 68% (CatBoost) and 81% (LGBM)
c. Techniques:
i. Hyperparameter Tuning
ii. Random Over Sampling
iii. Model Stacking
d. Final Model: Recall 85%, Accuracy 97%
Short Summary
Employees that are most likely to leave:

young, single, works in HR Department, have low
satisfaction rating (job, environment, work life balance),
have more overtime, given 0 training
Problem
Statement
Problem Statement
Company XYZ that has around 4000 employees experience 15% employee
attrition at any given time.
Management wants to curb employee attrition, as itʼs more desirable to
keep employees rather than finding new ones.
We are given a few files that has employee data and their attrition status,
and weʼre asked to:
- Find potential factors that encourage attrition
- Give inputs on how to reduce attrition
- Build a model to predict employee attrition
1. Data Importing
1. Data Importing
Here are the files that were given to us and their usage:
File Name Details
data_dictionary.xlsx Contains explanation for training_data.csv columns
employee_survey_data.csv Employee satisfaction survey
manager_survey_data.csv Employee’s performance rating
in_time.csv Time when each employee starts work
out_time.csv Time when each employee finishes work
training_data.csv Training data (has Attrition column)
test_data.csv Data to predict (has no Attrition column)

2. Data Merging
2. Data Merging
Problem: our features are not in the same DataFrame together.
This makes it difficult for conducting a thorough EDA and
machine learning modelling.
Solution: use pandas to join the features in
employee_survey_data and manager_survey_data to
training_data and test_data. We do the join on EmployeeID
2. Data Merging
Problem: engineering features from in_time and out_time
Solution:
1. Drop columns that are entirely NaN (because it implies that
the day is a holiday, and no employees are present)
2. Do the following subtraction: (out_time - in_time) to get
the duration of work they spend per day
3. Convert the duration in hours decimal (e.g. 07 hours 30
minutes are converted into 7.5)
2. Data Merging
Solution: (cont.)
4. Get the mean_duration of each employee by averaging the
duration of work of each column.
5. Get the q1, q2, q3, and max_duration by applying
functions horizontally to get the Q1, Median, Q3, and
Maximum duration of work that each employee have over
the course of the recorded data.
2. Data Merging
Solution: (cont.)
6. Get the iqr_duration of each employee by calculating q3_duration -
q1_duration
7. Get the overtime_count - the number of overtime days, by counting how
many days have the value of > 8 hours from each row (each employee).
8. Get the number_of_leaves by counting how many days, from each row, has
the value of 0 (meaning that the employee is taking a leave on that day)
2. Data Merging
Solution: (cont.)
8. Merge our engineered features (q1, q2, q3, mean, max, iqr
duration as well as overtime_count & number_of_leaves)
to the training_data and test_data
2. Data Merging
Result:
3. Detection of
Missing Values
Calling .info() on pandas DataFrame allows us to notice
that there are 5 columns with missing values:
- NumCompaniesWorked
- TotalWorkingYears
- EnvironmentSatisfaction
- JobSatisfaction
- WorkLifeBalance
At most, the number of rows with missing values from each
columns are less than 1.25% (of all rows), so itʼs a very
small amount.
However, problem arises when missing value is also
observed in test_data.csv. Meaning, if we include these
columns to make prediction, we have to fill in the missing
values.
Weʼll fill in the missing values after EDA, so we can
understand our data better.
4. Exploratory
Data Analysis
Before we do any EDA, we must have an aim in mind.
We are doing EDA to know the relationship of each
variables to employee Attrition status.
The target variable is Attrition, which has the following
values:
- No, if the employee does not quit the company
- Yes, if the employee quits the company
- 84% of employees (in
training_data) have No
as their Attrition value
- 16% of employees have
Yes as their Attrition
value
This is in line with the
described problem in the
given guideline.
We have a total of 26 predictor variables to analyze, and 8
additional predictor variables derived from in_time and
out_time data.
We are not going to run through all of them one-by-one,
but weʼre going to:
1. First, discuss how the EDA is done on my end
2. Second, discuss predictor variables which are ʻmore
influentialʼ to the Attrition status of each employee
For numerical
features, weʼll visualize
the column using box
plot and histogram to
see the distribution.
Example: Age
Our hypothesis would be:
Hypothesis 0: the means between “Attrition-No employees”
and “Attrition-Yes employees” are the same
Hypothesis 1: there is a statistically significant difference
between the means of both groups
For numerical features, weʼll also do a two-means hypothesis
test.
Why?
For example, we discover that the mean of Age in Attrition-No
employees is 40, and the mean of Age in Attrition-Yes employees
is 25.
Then, weʼd be more confident in saying that there is a statistically
significant difference.
But…what if…, we discover that the mean of Age in
Attrition-No employees is 27, and the mean of Age in
Attrition-Yes employees is 25.
Could we be sure that both groups have a statistically
different mean of Age? What is the measure to be sure of
concluding such remarks?
Thatʼs why we do two-means test.
The main two means test is the t-test (two independent
sample t-test). However, both samples have to be normally
distributed.
What if our distribution is not normal?
Then we go with Mann-Whitney U-Test. This is a non
parametric test that also investigates the difference in means
of two groups.
It turns out that most of our numerical features do not follow
the normality assumption, so weʼre going to conduct
Mann-Whitney U-test.
Reference:
https://www.statisticshowto.com/mann-whitney-u-test/
Conducting Mann-Whitney U-Test requires scipy package in
Python, and we also assume a 0.05 alpha.
- If the p-value < 0.05, we can reject H0
- If the p-value >= 0.05, we cannot reject H0
For categorical features
(and some numerical
features with a low
variety in numbers e.g.
Satisfaction Survey
Rating which is only 1.0,
2.0, 3.0, 4.0), weʼll do a
bar plot visualization.
The next slide will
inform us on how to
interpret the
visualization.
For employees
who donʼt travel,
92% doesnʼt quit,
In terms of pure and 8% quit
quantity, there are
more employees
who rarely travel
than frequently
travel.
We also do a statistical test to find out if a categorical
feature is related to Attrition variable (which is also a
categorical variable). To achieve this, we use Chi-Square
Test of Independence.
Reference:
https://www.jmp.com/en_au/statistics-knowledge-portal/ch
i-square-test/chi-square-test-of-independence.html
Hereʼs how we conduct EDA:
1. Select a predictor variable
2. Visualize it
3. Test if the variable have a statistically significant ʻrelationʼ
with Attrition
4. Draw conclusions
The following slides will show variables of note that we
thought would be important to notice.
To shorten some sentences that weʼll use in the next slides,
when we say that a certain variable ʻis statistically
significantʼ, it means that:
- If it is a numerical variable, we reject the H0 on the
Mann-Whitney U-Test. There is a significant difference in
the means of Attrition-No vs Attrition-Yes groups.
- If it is a categorical variable, we reject the H0 on the Chi
Square Test of Independence.
Age
- Employees who
quit tend to be
younger.
- This variable is
statistically
significant.
Business Travel
- The less frequent
an employee
travels, the less
likely he/she will
quit.
- This variable is
statistically
significant.
Department and Education Field

- Both are statistically significant
- High attrition found in Human Resources
Marital Status
- Single employees are
more likely to quit
- Most of our
employees are
Married
- This variable is
statistically
significant
Monthly Income
- Employees who
quit tend to have
less salary
- This variable is
almost statistically
significant as it has
a p-value of 0.054
on the means test
Number of Companies Worked
- High attrition is found in

employees who have
worked in 1 and >5
companies
- Low attrition is found in
employees who have
worked in 2-4 companies
- This variable is
statistically significant
- Employees who quit tend to come from employees who have less experience (less
age, less total working years, less total working year in the company, and less
time with current manager, and less year after promotion).
- These five variables mentioned above have high correlation with each other and
are statistically significant.
- Employees that gives low rating to surveys about Environment, Job

Satisfaction, and Work Life Balance, are more likely to quit
- These features are statistically significant, yet are not seen to be
highly correlated with one another
Performance Rating
- The only unique
values are 3 or 4
(which is weird
because ratings
usually go from 1-5)
- This variable is not
statistically
significant to
Attrition status of
employees
Number of Training
Times Last Year
- Employees that are
not trained has the
highest attrition
rate
- This variable is
statistically
significant
- Employees who leave the company have higher average

duration of work per day, and higher number of overtime,
but they have less days of leaves.
The following features are removed because they either

contain 1 singular value, or their values do not have any
predictive power to differentiate between Attrition status of
employees:
- EmployeeCount
- StandardHours
- Gender
- Over18
The following features are removed because they have very

similar distribution with mean_duration, and therefore can
be represented by only one feature
- Q1_duration
- Q2_duration
- Q3_duration
- Max_duration
- IQR_duration
If we want to see visualization and statistical test result for

other columns which are not mentioned in this document, we
can open the first jupyter notebook and use CTRL+F to find
the column weʼre looking for.
5. EDA Summary
and Business
Recommendation
What factors that company can control to prevent

employee from quitting?
- Reduce travel requirements/frequency for employees. Or
better, we could use more digitization to reduce the
amount of employees that is required to travel
- Give more competitive income to employees
- Improve job & environment satisfaction, as well as work
life balance of employees
What factors that company can control to prevent employee from
quitting?
- (cont).
- Give employees more training
- Give employees more flexibility in taking leaves as employees who
quit tend to be the ones who take less leaves
- Adjust workload, or give employees training in productivity/task
management, so they can reduce their overtime days and reduce
their work duration (hour) per day
What kind of employees should the company keep an eye

on, because they are most likely to quit, and itʼs due to
factors that is rather difficult to control?
- Young, single employees
- Employees who work in Human Resource department
(and has HR as their education field)
- Has worked in - either 1 - or more than 5 other companies
What kind of employees should the company keep an eye

on, because they are most likely to quit, and itʼs due to
factors that is rather difficult to control?
- (cont).
- Employees who are in the beginning stage of their career,
and have only spent a short time with the company
- Employees who have just been promoted
These ʻyoungʼ employees are perhaps:

- More eager to find new adventures
- Are still not yet ʻfirmʼ in their career / company choice
- Are still highly sought in the job market due to their ʻyouthʼ
- Not yet seeking ʻstabilityʼ that marriage brings
- Prioritizes wellbeing (work life balance, etc)
And so naturally would be more likely to change companies.
6. Data Cleaning
6. Data Cleaning
Pros Cons
Removing the missing value Very easy to do We cannot do this because

there are missing values on
the 1000 test data that we
have to predict
Median/Mean imputation Explainable Data becomes unnatural (all

missing values have the
same imputed value)
KNN Imputation Imputed values are More ‘abstract’ and less

calculated from most similar explainable than
data points that have the median/mean imputation
missing value
6. Data Cleaning
After dropping the unnecessary columns, we impute the missing
values by doing a KNN Imputer.
These steps are done both on training_data.csv and
test_data.csv.
So, at the end of this step, both training_data.csv and
test_data.csv have:
- No missing values
- The same exact features, except for the target variable (which
the test data does not have)
7. Correlation
Analysis
A correlation heatmap can be seen in the jupyter notebook as

it would be too big to be included and still be readable in the
slides.
Most features have little to no correlation with one another,
and this is generally a good thing in supervised learning task,
especially if we want to try linear regression models.
(But since we are not using linear regression, this should not
be a concern either).
These columns are quite strongly correlated with one another:

- Age
- TotalWorkingYears
- YearsAtCompany
- YearsSinceLastPromotion
- YearsWithCurrManager
A high correlation is also seen in mean_duration and

overtime_count variables.
However, these two variables have a negative correlation to
number_of_leaves.
This again confirms that employees who have more overtime,
tends to take less leaves. This could be a sign of heavy
workload.
8. Machine Learning
Modelling - Preparation
Letʼs review the challenges we face in this dataset:

- We have more than a couple of categorical variables
- If we want to do One-Hot-Encoding, especially for the
feature of ʻEducation Fieldʼ and ʻJob Roleʼ, we will have a
lot of columns filled with 0 and 1, in addition to our
already big number of predictors
- We have a few highly correlated variables
- Close to none linear relationship
If we use distance-based models, we need to do feature scaling. If
we use sklearn-tree models, we need to do one hot encoding.
Good thing that we have 2 models that are the perfect fit for our
problem.
They are: Light GBM and CatBoost - gradient boosted trees that
can handle categorical variables without one-hot encoding and
has proven high performance in Kaggle competitions and other
benchmark dataset.
- Gradient Boosted Trees are tree-based models, so they do

not require feature scaling.
- Light GBM and CatBoost have built-in features to handle
categorical variables without doing encoding before
passing the data.
- Multicollinearity is also not an issue in decision tree
models, as stated in:
https://sci-hub.se/https://link.springer.com/article/10.100
7/s10462-011-9272-4 (page 8)
Therefore, we are going to try to create a classification

algorithm to predict employee attrition with these two
models.
But first, we need to define how weʼre going to evaluate the
performance of our models.
In classification task, we use the metrics of:

- Precision
- Recall
- F1-Score
- Accuracy
In a case for employee Attrition, letʼs identify the two types of

ʻfalseʼ :
Positive : Attrition-Yes, Negative : Attrition-No
- False Positive: we predict that the employee will leave the
company, but actually stays
- False Negative: we predict that the employee will NOT
quit, but actually quits
False Negative is a more ʻdangerousʼ issue, and we would like

our model to have as little false negatives as possible.
If we think an employee is doing fine, but he/she quits, and
weʼre not prepared - that would be disastrous.
Therefore, the metric to highlight is recall, as it is the one that
is connected to False Negative.
Last but not least, we conduct a:
- 65% Train
- 15% Validation, and
- 20% Test
split from training_data.csv
We will train the model on train set, and use validation set to evaluate
various techniques of model improvements.
When we have a final model, we test it one last time on the test set to
get the estimated performance
8. Machine Learning
Modelling - LGBM and
CatBoost Modelling
First Experiment: Baseline Model

Passing in predictors as x_train and target as y_train, then
fitting them to our default models gives us the following
performance:
- CatBoostClassifier:
- Recall 68%, Accuracy 93%
- LGBMClassifier
Second Experiment: Hyperparameter Tuning

Using randomized search CV, we search for best
hyperparameters for our models, while striving to get the best
recall score. Result:
- CatBoostClassifier
- LGBMClassifier
Third Experiment: Random Over Sampling

In this step, we randomly duplicate rows whose Attrition
value is Yes until that class is 50% the amount of Attrition-No
class.
- CatBoostClassifier
- LGBMClassifier
Fourth Experiment: Stacking

Stacking is basically averaging the prediction from both
models. Instead of .predict(), we do .predict_proba() from
both models.
The output of .predict_proba() is the probability that the
prediction result is 0 (Attrition-No) or 1 (Attrition-Yes). It
doesnʼt give an exact 0 or 1, but it gives a decimal indicating
the probability.
Fourth Experiment: Stacking

We get the ʻprobabilityʼ from CatBoost and LGBM, and
average them, then from that result:
- If the average predicted probability for class 0
(Attrition-No) is greater than for class 1, then the final
prediction is 0
(the opposite also holds true)
Final Test Set Evaluation
Model Details Recall Accuracy
CatBoostClassifier Using tuned 0.81 0.96

hyperparameters,
LGBMClassifier and trained on 0.82 0.96
random
oversampled data
Stacked Model Stacking the first 0.85 0.97

two models
Conclusion
Our final model is a stacking of two classifiers:
CatBoostClassifier and LGBMClassifiers which have been
previously tuned, and trained on a dataset which have been
resampled (to reduce imbalance).
On our test set, the estimated recall is 85% and accuracy is
97%.
Warning
Our estimated performance might be different if you have
different random seed during any of the process. However, it
should still be in a similar range.
The 1000 test_data.csv prediction that is submitted is also
expected to have 97% accuracy, but it could be more or less.
Final Conclusion
Conclusion
If the company wants to curb attrition, they need to create a working environment that:
- Requires less to no travel
- Prioritizes wellbeing (work life balance, satisfaction) and employee development
(training), as well as giving competitive benefit (salary)
- Have a productivity system that diminishes overtime days
The company should watch out for employees who are:
- Young, single
- Having background education / working in HR department
- Have been just promoted / spend a short time working in the company
As they are most likely to leave
Conclusion
Our Machine Learning solution uses advanced modelling

techniques which involves:
- Modelling using Gradient Boosted Trees model
- Hyperparameter Tuning
- Random Oversampling for the minority class
- Model Stacking
Conclusion
Our Machine Learning solution, if implemented, and assumed

to have a recall of 85%, will be able to decrease Attrition
Rate from 15% to (15% - 0.85 * 15%) = 2.25%
Having a recall of 85% means that out of 100 employees that

will definitely churn, we can catch 85 of them.
Thank you

Employee Attrition Study Case

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Employee Attrition Study Case

Uploaded by

Copyright:

Available Formats

Contents

Employees that are most likely to leave:

data_dictionary.xlsx Contains explanation for training_data.csv columns

employee_survey_data.csv Employee satisfaction survey

manager_survey_data.csv Employee’s performance rating

in_time.csv Time when each employee starts work

out_time.csv Time when each employee finishes work

training_data.csv Training data (has Attrition column)

test_data.csv Data to predict (has no Attrition column)

Department and Education Field

- High attrition is found in

- Employees that gives low rating to surveys about Environment, Job

- Employees who leave the company have higher average

The following features are removed because they either

The following features are removed because they have very

If we want to see visualization and statistical test result for

What factors that company can control to prevent

What kind of employees should the company keep an eye

What kind of employees should the company keep an eye

These ʻyoungʼ employees are perhaps:

Removing the missing value Very easy to do We cannot do this because

Median/Mean imputation Explainable Data becomes unnatural (all

KNN Imputation Imputed values are More ‘abstract’ and less

A correlation heatmap can be seen in the jupyter notebook as

These columns are quite strongly correlated with one another:

A high correlation is also seen in mean_duration and

Letʼs review the challenges we face in this dataset:

- Gradient Boosted Trees are tree-based models, so they do

Therefore, we are going to try to create a classification

In classification task, we use the metrics of:

In a case for employee Attrition, letʼs identify the two types of

False Negative is a more ʻdangerousʼ issue, and we would like

First Experiment: Baseline Model

Second Experiment: Hyperparameter Tuning

Third Experiment: Random Over Sampling

Fourth Experiment: Stacking

Fourth Experiment: Stacking

Final Test Set Evaluation

Model Details Recall Accuracy

CatBoostClassifier Using tuned 0.81 0.96

Stacked Model Stacking the first 0.85 0.97

Our Machine Learning solution uses advanced modelling

Our Machine Learning solution, if implemented, and assumed

Having a recall of 85% means that out of 100 employees that

You might also like