You are on page 1of 88

Contents

In this document, weʼll walkthrough step-by-step solution on the given study case.
Here are the documentʼs contents:
1. Data Importing
2. Data Merging
3. Detection of Missing Values
4. Exploratory Data Analysis
5. EDA Summary and Business Recommendation
6. Data Cleaning
7. Correlation Analysis
8. Machine Learning Modelling
Preface
Although we have a high page/slide count, I have made sure
that each slide has large font, and contains only 1 key points
(or even less as 1 key points are sometimes distributed over 2
slides).
This is to make the step-by-step process is explained clearly
and thoroughly.
For a very short summary, please read the following 4 slides.
Short Summary
Short Summary
General steps taken in this project:
1. Load given files
2. Merge features seen in separate files into a unified dataframe
(training_data and test_data) & engineer features from in_time and
out_time files
3. Conduct EDA:
a. Visualization: Box Plot, Histogram, Bar Plot
b. Statistical Test: Chi Square Independence for categorical variable,
Mann-Whitney U-Test for 2 sample means of numerical variable
Short Summary
4. Fill in missing values with KNN Imputer
5. Define false positive and false negative as well as metric to measure model
performance (recall)
6. Conduct modelling:
a. ML Model: Light GBM, CatBoost
b. Baseline Recall: 68% (CatBoost) and 81% (LGBM)
c. Techniques:
i. Hyperparameter Tuning
ii. Random Over Sampling
iii. Model Stacking
d. Final Model: Recall 85%, Accuracy 97%
Short Summary

Employees that are most likely to leave:


young, single, works in HR Department, have low
satisfaction rating (job, environment, work life balance),
have more overtime, given 0 training
Problem
Statement
Problem Statement
Company XYZ that has around 4000 employees experience 15% employee
attrition at any given time.
Management wants to curb employee attrition, as itʼs more desirable to
keep employees rather than finding new ones.
We are given a few files that has employee data and their attrition status,
and weʼre asked to:
- Find potential factors that encourage attrition
- Give inputs on how to reduce attrition
- Build a model to predict employee attrition
1. Data Importing
1. Data Importing
Here are the files that were given to us and their usage:
File Name Details

data_dictionary.xlsx Contains explanation for training_data.csv columns

employee_survey_data.csv Employee satisfaction survey

manager_survey_data.csv Employee’s performance rating

in_time.csv Time when each employee starts work

out_time.csv Time when each employee finishes work

training_data.csv Training data (has Attrition column)

test_data.csv Data to predict (has no Attrition column)


2. Data Merging
2. Data Merging
Problem: our features are not in the same DataFrame together.
This makes it difficult for conducting a thorough EDA and
machine learning modelling.
Solution: use pandas to join the features in
employee_survey_data and manager_survey_data to
training_data and test_data. We do the join on EmployeeID
2. Data Merging
Problem: engineering features from in_time and out_time
Solution:
1. Drop columns that are entirely NaN (because it implies that
the day is a holiday, and no employees are present)
2. Do the following subtraction: (out_time - in_time) to get
the duration of work they spend per day
3. Convert the duration in hours decimal (e.g. 07 hours 30
minutes are converted into 7.5)
2. Data Merging
Problem: engineering features from in_time and out_time
Solution: (cont.)
4. Get the mean_duration of each employee by averaging the
duration of work of each column.
5. Get the q1, q2, q3, and max_duration by applying
functions horizontally to get the Q1, Median, Q3, and
Maximum duration of work that each employee have over
the course of the recorded data.
2. Data Merging
Problem: engineering features from in_time and out_time
Solution: (cont.)
6. Get the iqr_duration of each employee by calculating q3_duration -
q1_duration
7. Get the overtime_count - the number of overtime days, by counting how
many days have the value of > 8 hours from each row (each employee).
8. Get the number_of_leaves by counting how many days, from each row, has
the value of 0 (meaning that the employee is taking a leave on that day)
2. Data Merging
Problem: engineering features from in_time and out_time
Solution: (cont.)
8. Merge our engineered features (q1, q2, q3, mean, max, iqr
duration as well as overtime_count & number_of_leaves)
to the training_data and test_data
2. Data Merging
Result:
3. Detection of
Missing Values
3. Detection of Missing Values
Calling .info() on pandas DataFrame allows us to notice
that there are 5 columns with missing values:
- NumCompaniesWorked
- TotalWorkingYears
- EnvironmentSatisfaction
- JobSatisfaction
- WorkLifeBalance
3. Detection of Missing Values
At most, the number of rows with missing values from each
columns are less than 1.25% (of all rows), so itʼs a very
small amount.
However, problem arises when missing value is also
observed in test_data.csv. Meaning, if we include these
columns to make prediction, we have to fill in the missing
values.
3. Detection of Missing Values
Weʼll fill in the missing values after EDA, so we can
understand our data better.
4. Exploratory
Data Analysis
4. Exploratory Data Analysis
Before we do any EDA, we must have an aim in mind.
We are doing EDA to know the relationship of each
variables to employee Attrition status.
The target variable is Attrition, which has the following
values:
- No, if the employee does not quit the company
- Yes, if the employee quits the company
4. Exploratory Data Analysis
- 84% of employees (in
training_data) have No
as their Attrition value
- 16% of employees have
Yes as their Attrition
value
This is in line with the
described problem in the
given guideline.
4. Exploratory Data Analysis
We have a total of 26 predictor variables to analyze, and 8
additional predictor variables derived from in_time and
out_time data.
We are not going to run through all of them one-by-one,
but weʼre going to:
1. First, discuss how the EDA is done on my end
2. Second, discuss predictor variables which are ʻmore
influentialʼ to the Attrition status of each employee
4. Exploratory Data Analysis
For numerical
features, weʼll visualize
the column using box
plot and histogram to
see the distribution.

Example: Age
4. Exploratory Data Analysis
Our hypothesis would be:
Hypothesis 0: the means between “Attrition-No employees”
and “Attrition-Yes employees” are the same
Hypothesis 1: there is a statistically significant difference
between the means of both groups
4. Exploratory Data Analysis
For numerical features, weʼll also do a two-means hypothesis
test.
Why?
For example, we discover that the mean of Age in Attrition-No
employees is 40, and the mean of Age in Attrition-Yes employees
is 25.
Then, weʼd be more confident in saying that there is a statistically
significant difference.
4. Exploratory Data Analysis
But…what if…, we discover that the mean of Age in
Attrition-No employees is 27, and the mean of Age in
Attrition-Yes employees is 25.
Could we be sure that both groups have a statistically
different mean of Age? What is the measure to be sure of
concluding such remarks?
Thatʼs why we do two-means test.
4. Exploratory Data Analysis
The main two means test is the t-test (two independent
sample t-test). However, both samples have to be normally
distributed.
What if our distribution is not normal?
Then we go with Mann-Whitney U-Test. This is a non
parametric test that also investigates the difference in means
of two groups.
4. Exploratory Data Analysis
It turns out that most of our numerical features do not follow
the normality assumption, so weʼre going to conduct
Mann-Whitney U-test.
Reference:
https://www.statisticshowto.com/mann-whitney-u-test/
4. Exploratory Data Analysis
Conducting Mann-Whitney U-Test requires scipy package in
Python, and we also assume a 0.05 alpha.
- If the p-value < 0.05, we can reject H0
- If the p-value >= 0.05, we cannot reject H0
4. Exploratory Data Analysis
For categorical features
(and some numerical
features with a low
variety in numbers e.g.
Satisfaction Survey
Rating which is only 1.0,
2.0, 3.0, 4.0), weʼll do a
bar plot visualization.
4. Exploratory Data Analysis
The next slide will
inform us on how to
interpret the
visualization.
4. Exploratory Data Analysis

For employees
who donʼt travel,
92% doesnʼt quit,
In terms of pure and 8% quit
quantity, there are
more employees
who rarely travel
than frequently
travel.
4. Exploratory Data Analysis
We also do a statistical test to find out if a categorical
feature is related to Attrition variable (which is also a
categorical variable). To achieve this, we use Chi-Square
Test of Independence.
Reference:
https://www.jmp.com/en_au/statistics-knowledge-portal/ch
i-square-test/chi-square-test-of-independence.html
4. Exploratory Data Analysis
Hereʼs how we conduct EDA:
1. Select a predictor variable
2. Visualize it
3. Test if the variable have a statistically significant ʻrelationʼ
with Attrition
4. Draw conclusions
The following slides will show variables of note that we
thought would be important to notice.
4. Exploratory Data Analysis
To shorten some sentences that weʼll use in the next slides,
when we say that a certain variable ʻis statistically
significantʼ, it means that:
- If it is a numerical variable, we reject the H0 on the
Mann-Whitney U-Test. There is a significant difference in
the means of Attrition-No vs Attrition-Yes groups.
- If it is a categorical variable, we reject the H0 on the Chi
Square Test of Independence.
4. Exploratory Data Analysis
Age
- Employees who
quit tend to be
younger.
- This variable is
statistically
significant.
4. Exploratory Data Analysis
Business Travel
- The less frequent
an employee
travels, the less
likely he/she will
quit.
- This variable is
statistically
significant.
4. Exploratory Data Analysis

Department and Education Field


- Both are statistically significant
- High attrition found in Human Resources
4. Exploratory Data Analysis
Marital Status
- Single employees are
more likely to quit
- Most of our
employees are
Married
- This variable is
statistically
significant
4. Exploratory Data Analysis
Monthly Income
- Employees who
quit tend to have
less salary
- This variable is
almost statistically
significant as it has
a p-value of 0.054
on the means test
4. Exploratory Data Analysis
Number of Companies Worked

- High attrition is found in


employees who have
worked in 1 and >5
companies
- Low attrition is found in
employees who have
worked in 2-4 companies
- This variable is
statistically significant
4. Exploratory Data Analysis

- Employees who quit tend to come from employees who have less experience (less
age, less total working years, less total working year in the company, and less
time with current manager, and less year after promotion).
- These five variables mentioned above have high correlation with each other and
are statistically significant.
4. Exploratory Data Analysis

- Employees that gives low rating to surveys about Environment, Job


Satisfaction, and Work Life Balance, are more likely to quit
- These features are statistically significant, yet are not seen to be
highly correlated with one another
4. Exploratory Data Analysis
Performance Rating
- The only unique
values are 3 or 4
(which is weird
because ratings
usually go from 1-5)
- This variable is not
statistically
significant to
Attrition status of
employees
4. Exploratory Data Analysis
Number of Training
Times Last Year
- Employees that are
not trained has the
highest attrition
rate
- This variable is
statistically
significant
4. Exploratory Data Analysis

- Employees who leave the company have higher average


duration of work per day, and higher number of overtime,
but they have less days of leaves.
4. Exploratory Data Analysis

The following features are removed because they either


contain 1 singular value, or their values do not have any
predictive power to differentiate between Attrition status of
employees:
- EmployeeCount
- StandardHours
- Gender
- Over18
4. Exploratory Data Analysis

The following features are removed because they have very


similar distribution with mean_duration, and therefore can
be represented by only one feature
- Q1_duration
- Q2_duration
- Q3_duration
- Max_duration
- IQR_duration
4. Exploratory Data Analysis

If we want to see visualization and statistical test result for


other columns which are not mentioned in this document, we
can open the first jupyter notebook and use CTRL+F to find
the column weʼre looking for.
5. EDA Summary
and Business
Recommendation
5. EDA Summary and Business Recommendation

What factors that company can control to prevent


employee from quitting?
- Reduce travel requirements/frequency for employees. Or
better, we could use more digitization to reduce the
amount of employees that is required to travel
- Give more competitive income to employees
- Improve job & environment satisfaction, as well as work
life balance of employees
5. EDA Summary and Business Recommendation
What factors that company can control to prevent employee from
quitting?
- (cont).
- Give employees more training
- Give employees more flexibility in taking leaves as employees who
quit tend to be the ones who take less leaves
- Adjust workload, or give employees training in productivity/task
management, so they can reduce their overtime days and reduce
their work duration (hour) per day
5. EDA Summary and Business Recommendation

What kind of employees should the company keep an eye


on, because they are most likely to quit, and itʼs due to
factors that is rather difficult to control?
- Young, single employees
- Employees who work in Human Resource department
(and has HR as their education field)
- Has worked in - either 1 - or more than 5 other companies
5. EDA Summary and Business Recommendation

What kind of employees should the company keep an eye


on, because they are most likely to quit, and itʼs due to
factors that is rather difficult to control?
- (cont).
- Employees who are in the beginning stage of their career,
and have only spent a short time with the company
- Employees who have just been promoted
5. EDA Summary and Business Recommendation

These ʻyoungʼ employees are perhaps:


- More eager to find new adventures
- Are still not yet ʻfirmʼ in their career / company choice
- Are still highly sought in the job market due to their ʻyouthʼ
- Not yet seeking ʻstabilityʼ that marriage brings
- Prioritizes wellbeing (work life balance, etc)
And so naturally would be more likely to change companies.
6. Data Cleaning
6. Data Cleaning

Pros Cons

Removing the missing value Very easy to do We cannot do this because


there are missing values on
the 1000 test data that we
have to predict

Median/Mean imputation Explainable Data becomes unnatural (all


missing values have the
same imputed value)

KNN Imputation Imputed values are More ‘abstract’ and less


calculated from most similar explainable than
data points that have the median/mean imputation
missing value
6. Data Cleaning
After dropping the unnecessary columns, we impute the missing
values by doing a KNN Imputer.
These steps are done both on training_data.csv and
test_data.csv.
So, at the end of this step, both training_data.csv and
test_data.csv have:
- No missing values
- The same exact features, except for the target variable (which
the test data does not have)
7. Correlation
Analysis
7. Correlation Analysis

A correlation heatmap can be seen in the jupyter notebook as


it would be too big to be included and still be readable in the
slides.
Most features have little to no correlation with one another,
and this is generally a good thing in supervised learning task,
especially if we want to try linear regression models.
(But since we are not using linear regression, this should not
be a concern either).
7. Correlation Analysis

These columns are quite strongly correlated with one another:


- Age
- TotalWorkingYears
- YearsAtCompany
- YearsSinceLastPromotion
- YearsWithCurrManager
7. Correlation Analysis

A high correlation is also seen in mean_duration and


overtime_count variables.
However, these two variables have a negative correlation to
number_of_leaves.
This again confirms that employees who have more overtime,
tends to take less leaves. This could be a sign of heavy
workload.
8. Machine Learning
Modelling - Preparation
8. Machine Learning Modelling

Letʼs review the challenges we face in this dataset:


- We have more than a couple of categorical variables
- If we want to do One-Hot-Encoding, especially for the
feature of ʻEducation Fieldʼ and ʻJob Roleʼ, we will have a
lot of columns filled with 0 and 1, in addition to our
already big number of predictors
- We have a few highly correlated variables
- Close to none linear relationship
8. Machine Learning Modelling
If we use distance-based models, we need to do feature scaling. If
we use sklearn-tree models, we need to do one hot encoding.
Good thing that we have 2 models that are the perfect fit for our
problem.
They are: Light GBM and CatBoost - gradient boosted trees that
can handle categorical variables without one-hot encoding and
has proven high performance in Kaggle competitions and other
benchmark dataset.
8. Machine Learning Modelling

- Gradient Boosted Trees are tree-based models, so they do


not require feature scaling.
- Light GBM and CatBoost have built-in features to handle
categorical variables without doing encoding before
passing the data.
- Multicollinearity is also not an issue in decision tree
models, as stated in:
https://sci-hub.se/https://link.springer.com/article/10.100
7/s10462-011-9272-4 (page 8)
8. Machine Learning Modelling

Therefore, we are going to try to create a classification


algorithm to predict employee attrition with these two
models.
But first, we need to define how weʼre going to evaluate the
performance of our models.
8. Machine Learning Modelling

In classification task, we use the metrics of:


- Precision
- Recall
- F1-Score
- Accuracy
8. Machine Learning Modelling

In a case for employee Attrition, letʼs identify the two types of


ʻfalseʼ :
Positive : Attrition-Yes, Negative : Attrition-No
- False Positive: we predict that the employee will leave the
company, but actually stays
- False Negative: we predict that the employee will NOT
quit, but actually quits
8. Machine Learning Modelling

False Negative is a more ʻdangerousʼ issue, and we would like


our model to have as little false negatives as possible.
If we think an employee is doing fine, but he/she quits, and
weʼre not prepared - that would be disastrous.
Therefore, the metric to highlight is recall, as it is the one that
is connected to False Negative.
8. Machine Learning Modelling
Last but not least, we conduct a:
- 65% Train
- 15% Validation, and
- 20% Test
split from training_data.csv
We will train the model on train set, and use validation set to evaluate
various techniques of model improvements.
When we have a final model, we test it one last time on the test set to
get the estimated performance
8. Machine Learning
Modelling - LGBM and
CatBoost Modelling
8. Machine Learning Modelling

First Experiment: Baseline Model


Passing in predictors as x_train and target as y_train, then
fitting them to our default models gives us the following
performance:
- CatBoostClassifier:
- Recall 68%, Accuracy 93%
- LGBMClassifier
- Recall 81%, Accuracy 95%
8. Machine Learning Modelling

Second Experiment: Hyperparameter Tuning


Using randomized search CV, we search for best
hyperparameters for our models, while striving to get the best
recall score. Result:
- CatBoostClassifier
- Recall 84%, Accuracy 96%
- LGBMClassifier
- Recall 83%, Accuracy 94%
8. Machine Learning Modelling

Third Experiment: Random Over Sampling


In this step, we randomly duplicate rows whose Attrition
value is Yes until that class is 50% the amount of Attrition-No
class.
- CatBoostClassifier
- Recall 84%, Accuracy 96%
- LGBMClassifier
- Recall 82%, Accuracy 94%
8. Machine Learning Modelling

Fourth Experiment: Stacking


Stacking is basically averaging the prediction from both
models. Instead of .predict(), we do .predict_proba() from
both models.
The output of .predict_proba() is the probability that the
prediction result is 0 (Attrition-No) or 1 (Attrition-Yes). It
doesnʼt give an exact 0 or 1, but it gives a decimal indicating
the probability.
8. Machine Learning Modelling

Fourth Experiment: Stacking


We get the ʻprobabilityʼ from CatBoost and LGBM, and
average them, then from that result:
- If the average predicted probability for class 0
(Attrition-No) is greater than for class 1, then the final
prediction is 0
(the opposite also holds true)
8. Machine Learning Modelling

Final Test Set Evaluation

Model Details Recall Accuracy

CatBoostClassifier Using tuned 0.81 0.96


hyperparameters,
LGBMClassifier and trained on 0.82 0.96
random
oversampled data

Stacked Model Stacking the first 0.85 0.97


two models
8. Machine Learning Modelling

Conclusion
Our final model is a stacking of two classifiers:
CatBoostClassifier and LGBMClassifiers which have been
previously tuned, and trained on a dataset which have been
resampled (to reduce imbalance).
On our test set, the estimated recall is 85% and accuracy is
97%.
8. Machine Learning Modelling

Warning
Our estimated performance might be different if you have
different random seed during any of the process. However, it
should still be in a similar range.
The 1000 test_data.csv prediction that is submitted is also
expected to have 97% accuracy, but it could be more or less.
Final Conclusion
Conclusion
If the company wants to curb attrition, they need to create a working environment that:
- Requires less to no travel
- Prioritizes wellbeing (work life balance, satisfaction) and employee development
(training), as well as giving competitive benefit (salary)
- Have a productivity system that diminishes overtime days
The company should watch out for employees who are:
- Young, single
- Having background education / working in HR department
- Have been just promoted / spend a short time working in the company
As they are most likely to leave
Conclusion

Our Machine Learning solution uses advanced modelling


techniques which involves:
- Modelling using Gradient Boosted Trees model
- Hyperparameter Tuning
- Random Oversampling for the minority class
- Model Stacking
Conclusion

Our Machine Learning solution, if implemented, and assumed


to have a recall of 85%, will be able to decrease Attrition
Rate from 15% to (15% - 0.85 * 15%) = 2.25%

Having a recall of 85% means that out of 100 employees that


will definitely churn, we can catch 85 of them.
Thank you

You might also like