Professional Documents
Culture Documents
In this document, weʼll walkthrough step-by-step solution on the given study case.
Here are the documentʼs contents:
1. Data Importing
2. Data Merging
3. Detection of Missing Values
4. Exploratory Data Analysis
5. EDA Summary and Business Recommendation
6. Data Cleaning
7. Correlation Analysis
8. Machine Learning Modelling
Preface
Although we have a high page/slide count, I have made sure
that each slide has large font, and contains only 1 key points
(or even less as 1 key points are sometimes distributed over 2
slides).
This is to make the step-by-step process is explained clearly
and thoroughly.
For a very short summary, please read the following 4 slides.
Short Summary
Short Summary
General steps taken in this project:
1. Load given files
2. Merge features seen in separate files into a unified dataframe
(training_data and test_data) & engineer features from in_time and
out_time files
3. Conduct EDA:
a. Visualization: Box Plot, Histogram, Bar Plot
b. Statistical Test: Chi Square Independence for categorical variable,
Mann-Whitney U-Test for 2 sample means of numerical variable
Short Summary
4. Fill in missing values with KNN Imputer
5. Define false positive and false negative as well as metric to measure model
performance (recall)
6. Conduct modelling:
a. ML Model: Light GBM, CatBoost
b. Baseline Recall: 68% (CatBoost) and 81% (LGBM)
c. Techniques:
i. Hyperparameter Tuning
ii. Random Over Sampling
iii. Model Stacking
d. Final Model: Recall 85%, Accuracy 97%
Short Summary
Example: Age
4. Exploratory Data Analysis
Our hypothesis would be:
Hypothesis 0: the means between “Attrition-No employees”
and “Attrition-Yes employees” are the same
Hypothesis 1: there is a statistically significant difference
between the means of both groups
4. Exploratory Data Analysis
For numerical features, weʼll also do a two-means hypothesis
test.
Why?
For example, we discover that the mean of Age in Attrition-No
employees is 40, and the mean of Age in Attrition-Yes employees
is 25.
Then, weʼd be more confident in saying that there is a statistically
significant difference.
4. Exploratory Data Analysis
But…what if…, we discover that the mean of Age in
Attrition-No employees is 27, and the mean of Age in
Attrition-Yes employees is 25.
Could we be sure that both groups have a statistically
different mean of Age? What is the measure to be sure of
concluding such remarks?
Thatʼs why we do two-means test.
4. Exploratory Data Analysis
The main two means test is the t-test (two independent
sample t-test). However, both samples have to be normally
distributed.
What if our distribution is not normal?
Then we go with Mann-Whitney U-Test. This is a non
parametric test that also investigates the difference in means
of two groups.
4. Exploratory Data Analysis
It turns out that most of our numerical features do not follow
the normality assumption, so weʼre going to conduct
Mann-Whitney U-test.
Reference:
https://www.statisticshowto.com/mann-whitney-u-test/
4. Exploratory Data Analysis
Conducting Mann-Whitney U-Test requires scipy package in
Python, and we also assume a 0.05 alpha.
- If the p-value < 0.05, we can reject H0
- If the p-value >= 0.05, we cannot reject H0
4. Exploratory Data Analysis
For categorical features
(and some numerical
features with a low
variety in numbers e.g.
Satisfaction Survey
Rating which is only 1.0,
2.0, 3.0, 4.0), weʼll do a
bar plot visualization.
4. Exploratory Data Analysis
The next slide will
inform us on how to
interpret the
visualization.
4. Exploratory Data Analysis
For employees
who donʼt travel,
92% doesnʼt quit,
In terms of pure and 8% quit
quantity, there are
more employees
who rarely travel
than frequently
travel.
4. Exploratory Data Analysis
We also do a statistical test to find out if a categorical
feature is related to Attrition variable (which is also a
categorical variable). To achieve this, we use Chi-Square
Test of Independence.
Reference:
https://www.jmp.com/en_au/statistics-knowledge-portal/ch
i-square-test/chi-square-test-of-independence.html
4. Exploratory Data Analysis
Hereʼs how we conduct EDA:
1. Select a predictor variable
2. Visualize it
3. Test if the variable have a statistically significant ʻrelationʼ
with Attrition
4. Draw conclusions
The following slides will show variables of note that we
thought would be important to notice.
4. Exploratory Data Analysis
To shorten some sentences that weʼll use in the next slides,
when we say that a certain variable ʻis statistically
significantʼ, it means that:
- If it is a numerical variable, we reject the H0 on the
Mann-Whitney U-Test. There is a significant difference in
the means of Attrition-No vs Attrition-Yes groups.
- If it is a categorical variable, we reject the H0 on the Chi
Square Test of Independence.
4. Exploratory Data Analysis
Age
- Employees who
quit tend to be
younger.
- This variable is
statistically
significant.
4. Exploratory Data Analysis
Business Travel
- The less frequent
an employee
travels, the less
likely he/she will
quit.
- This variable is
statistically
significant.
4. Exploratory Data Analysis
- Employees who quit tend to come from employees who have less experience (less
age, less total working years, less total working year in the company, and less
time with current manager, and less year after promotion).
- These five variables mentioned above have high correlation with each other and
are statistically significant.
4. Exploratory Data Analysis
Pros Cons
Conclusion
Our final model is a stacking of two classifiers:
CatBoostClassifier and LGBMClassifiers which have been
previously tuned, and trained on a dataset which have been
resampled (to reduce imbalance).
On our test set, the estimated recall is 85% and accuracy is
97%.
8. Machine Learning Modelling
Warning
Our estimated performance might be different if you have
different random seed during any of the process. However, it
should still be in a similar range.
The 1000 test_data.csv prediction that is submitted is also
expected to have 97% accuracy, but it could be more or less.
Final Conclusion
Conclusion
If the company wants to curb attrition, they need to create a working environment that:
- Requires less to no travel
- Prioritizes wellbeing (work life balance, satisfaction) and employee development
(training), as well as giving competitive benefit (salary)
- Have a productivity system that diminishes overtime days
The company should watch out for employees who are:
- Young, single
- Having background education / working in HR department
- Have been just promoted / spend a short time working in the company
As they are most likely to leave
Conclusion