You are on page 1of 31

Machine Learning Applications

Assignment Report

Submitted to
Prof. Dinesh Kumar

Submitted by: Group - A


Aanchal 2017008
Arjun Mehra 2017010
Madhur Butke 2017030
Raja Ram Pandey 2017011
Sanjoy Podder 2017019
Table of Contents

Executive Summary................................................................................................................................. 3
Data Pre-processing ................................................................................................................................ 4
Exploratory Data Analysis ....................................................................................................................... 5
Logistic Regression .................................................................................................................................. 7
Decision Tree......................................................................................................................................... 13
Random Forest ...................................................................................................................................... 18
Sampling Techniques ............................................................................................................................ 23
Final Recommendations ....................................................................................................................... 24
Exhibit 1: Distribution of explanatory variables................................................................................ 25

2|Page
Executive Summary

Since 2010, Kramerica Industries was facing high attrition rate which was leading to
high cost towards talent management and acquisition. Along with that each time an employee
left there were certain indirect costs due to impact on sales and gap in knowledge transfer.
Bob Sacamano, head of talent management was searching for a way to reduce attrition and
associated cost with the help of Machine Learning algorithms.
Employee data for employees working in different departments was collected for 2 years,
2014 and 2015, which number of variables which could impact the attrition rate. The aim was
to predict the probability of an employee leaving the organisation.
The dataset consisted of data for more than 16,000 employees, both past and present
employees, of the organisation. Initially data was explored with use of certain summary
statistics and variables’ relation with dependent variable (attrition) was studied. After
rigorous inspection certain variables were identified which should not be used in model for
making the prediction. After than hypothesis test was ran on the remaining variables to
understand their significance in predicting attrition.
With the help of these variables a logistic regression model was built. This resulted in overall
accuracy of 91.6% in the testing data. Following this it was observed that due imbalance in
the data this model was predicting 0’s with lower error than the error rate in 1’s. So, with help
of Youden’s Index optimal cut-off was determined to increase the sensitivity metric in the
model. Then a decision tree was model was built which provided overall accuracy of 85.1% in
the testing data, determining Ratio Difference as most primary variable for classification.
Random Forest was used among the ensemble techniques increasing the accuracy to 86.9%
in the testing data. Random forest determined variables Ration Difference, Change in CTC and
Leaves in 2015 as top 3 important variables.
Based on these three models, it was analysed that the Logistic model would be most accurate
and effective in deployment. Also, to tackle the problems of imbalance in the various sampling
techniques were employed, of which stratified sampling proved to be the best.

3|Page
Data Pre-processing

Following measures were taken in order to avoid inconsistency issues in the data.

a) Removing entries for employees who left before 2014 and in 2016.
The provided dataset contains data of 2014 and 2015. Hence for the employees who left the
organization before 2014, the data was inconsistent for various variables like Leaves in 2014-
15, CTC in 2015-15 and so on. So those entries were deleted. There were 9 entries of people
who left in 2016, these entries were also deleted as the number was very small compared to
whole dataset.

b) Creation of new variables


a. Sum of leaves
This is sum of Casual Leaves, Personal Leaves and Sick Leaves in year 2014 and 2015.
b. Ratio of leaves
Ratio if ‘Sum of leaves in 2015’ to ‘Total working days in 2015’. Similar variable was
also added for the year 2014.
c. Difference of leaves ratio
Difference between ‘Ratio of leaves 2015’ and ‘Ratio of leaves 2014’

c) Removing entries for people left in 2014


Certain variables like ‘Difference of leaves ratio’, ‘Difference in CTC’ were defined in which for
people who left in 2014, the respective data of 2015 was not available hence they were
deleted.

d) Infinity values in variables


Few variables like F_to_M, Ratio of leaves, leaves over last year, were defined as ration of two
variable. There were instances where denominator was zero, resulting infinity value of the
ratio. Such values were imputed with appropriate measures.

e) Encoding of categoric variables


One hot encoding for categoric variables.

4|Page
Exploratory Data Analysis
1. Identify variables that should not be used for building ML model among the data provided in
the spreadsheet. Justify your answer (2 points)

The following variables should not be used because of the reasons mentioned below:

Variable Name Reason


X.U.FEFF. Just an identifier, does not have any meaningful implication.
Employee Code Just an identifier, does not have any meaningful implication.
Zodiac Sign Cannot be used for prediction as it can erroneously result in bias
against certain zodiac signs.
Days worked in last quarter People who have left the company even earlier will have this
parameter equal to zero. Gives insight only for those people who
have left company in last quarter of 2015, that too without
statistical significance. Therefore, this variable should be
excluded.
High Education Degree Presence of more than 170 categories in the variable would
make the model highly complex.
Leaves over last year Can have ‘infinite’ value if no leaves were taken in the previous
year, therefore cannot be used in the present form. Might have
to modify the formulation to use the information.
Last Working Date Cannot be used for predicting, as it can be obtained only when
an employee has left the company.
Working Status Same as dependent variable.
Table 1.1

Apart from the above-mentioned variables, we also need to statistically test all the variables for its
significance in building the model. For continuous variables we have conducted ANOVA while for
categoric variables we have performed Chi square test of independence. Below mentioned is the
summary for test along with comments on whether the variables are significant at significance level
5%.

Variable Name p-value Significant or not


Department 2.97e-22 Significant
Employee Category 6.77e-17 Significant
Gender 0.5928 Not significant
Marital Status 1.32e-22 Significant
Paternity.maternity.leave.last.quarter 4.43e-05 Significant
Joining bonus 0.0025 Significant
Degree type 1.74e-09 Significant
Highest degree 0.0006 Significant
Working in native place 0.7858 Not significant

5|Page
Employment type 9.66e-55 Significant
Confirmation status 8.77e-55 Significant
Working status 0.0 Significant
Table 1.2 : Chi square test of independence for categoric variable

For Continuous Variables:

Variable Name p-value Significant or not


Employee.Current.Age < 2.2e-16 Significant
Employee.Age.During.DOJ 0.0013 Significant
Prior.Work.Exp 0.1229 Not Significant
Avg.Weighted.Performance 0.0001 Significant
Ratio.of.leaves.2015 1.0 Not Significant
Ratio.of.leaves.2014 0.4621 Not Significant
Ratio.Difference 0.1742 Not Significant
joining.bonus 0.0373 Significant
Working.in.Native.Place 0.8904 Not Significant
RM.Age 0.0002 Significant
RM.Reportees.Count 1.0000 Not Significant
RM.Reportee.Male 0.5311 Not Significant
changeCTC 0.0050 Significant
emp_manage 0.3123 Not Significant
leaves_prev_year 0.6804 Not Significant
leaves_current_year < 2.2e-16 Significant
Table 1.3 ANOVA for continuous variables

6|Page
Logistic Regression

2. Build a Logistic Regression Model on the given data. Perform diagnostics tests, Interpret the
results. Comment on the accuracy of the model on test data. (2 Points).

Ans. We generated first logistic regression model with almost all the variables, except some of the
variables, which were deemed unnecessary for exploratory data analysis. After that, based on the p
value and significance levels of Wald’s test(we considered cut off value of significance to be 0.05),
which was performed in Rattle, we found the following variables, using which we built the final
model:
a. EmployeeCurrentAge
b. EmployeeAgeDuringDOJ
c. PriorWorkExp
d. Leavesin2015
e. Ratioofleaves2015
f. AutoParts_Department
g. None_DegreeType

All of these variables were significant using Wald’s test. The results of these tests are given in the
rattle models attached with the report.

Fig.: Final logistic regression model

7|Page
According to the magnitude of coefficient estimates, the chance of attrition increases with the
increase in magnitude of coefficients like RatioOfLeaves2015, EmployeeCurrentAge. The most
impactful among these are RatioOfLeaves2015, defined a total number of leaves in 2015 divided
by number working days of an employee in 2015. The chance of attrition decreases with the
increase in magnitude of EmployeeAgeduringDOJ, PriorWorkExp, AutoParts, EducationlevelNone.
So people with higher age and work tend not to leave the organisation and same goes for people
working in Auto Parts department and an employee who is neither a undergraduate, a graduate
nor a post graduate.

The error matrices for the above model for training and testing data are shown below (cut-off
probability =0.5):

Training:
Predicted
Actual 0 1 Error
0 3727 2 0.1

1 414 1006 29.2

Predicted
Actual 0 1 Error
0 72.4 0.0 0.1

1 8.0 19.5 29.2

Overall error: 8.1%, Averaged class error: 14.65%

So, on training data, the sensitivity and specificity are 99.9% and 70.8% , whereas the
precision is 99.8%, the F-score is 99.84%.

Testing:
Predicted
Actual 0 1 Error
0 1636 1 0.1

1 185 386 32.4

Actual Predicted
0 1 Error

8|Page
0 74.1 0.0 0.1

1 8.4 17.5 32.4

Overall error: 8.4%, Averaged class error: 16.25%

So, on testing data, the sensitivity and specificity are 99.9% and 67.6% respectively;
whereas the precision is 99.74%, the F-score is 99.82%.

The ROC Curve and the lift charts for the training and testing data are shown below:

ROC:
Training:

Testing:

9|Page
For testing data, the Area Under ROC curve(AUC) are 89%, signifying, for a randomly selected
pair of positive and negative observations, probability of correctly classifying them is 0.89

Lift
Training:

Testing:

10 | P a g e
Calculation of Youden’s index for optimal cut-off probability:

Cut-off Sensitivity Specificity YI Cut-off Sensitivity Specificity YI


0.05 0.283 0.956 0.194 0.55 0.999 0.669 0.668
0.1 0.573 0.883 0.456 0.6 0.999 0.665 0.665
0.15 0.782 0.82 0.601 0.65 0.999 0.6602 0.6596
0.2 0.889 0.781 0.6705 0.7 0.999 0.651 0.651
0.25 0.952 0.767 0.719 0.75 0.999 0.644 0.6438
0.3 0.979 0.732 0.711 0.8 0.999 0.634 0.6333
0.35 0.993 0.721 0.7148 0.85 0.999 0.6296 0.6236
0.4 0.998 0.7 0.6987 0.9 0.999 0.6147 0.6141
0.45 0.999 0.688 0.6876 0.95 0.999 0.588 0.5878

11 | P a g e
0.8

0.7

0.6
Youden's Index

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Cut-Off Probability

We have found out the specificity and sensitivity values for different cut-off probabilities, by finding
out the observed and predicted probabilities. We have calculated Youden’s Index for different cut-off
probabilities (Youden’s Index = sensitivity+specificity-1) and found out that the optimal cut-off
probability is 0.25, for which the value of the index is highest at 0.719

The overall accuracy of the model at this cut-off probability is = (TP+TN)/(TP+TN+FP+FN) =


(438+1559)/(438+1559+778+133) = 90.44%

12 | P a g e
Decision Tree
Question 3: Construct a simple decision tree (classification tree) classifier. Use decision tree to
generate at least two features. Check whether the newly derived features have statistically
significant relationship with the outcome variable (2 Points)?

We build the decision tree using the variables that were found significant in the question 1. In the
built decision tree, the variables used were ‘RatioDifference’, ‘changeCTC’, ‘Confirmed’,
‘Leaves.in.2015’, ‘leaves.prev.year’.

Business Rules derived from decision tree

Rule number 1: 23 [Attrition=1 cover=75 (1%) prob=1.00]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015< 23.5

Rule number 2: 87 [Attrition=1 cover=51 (1%) prob=1.00]

0.025=< Ratio.Difference< 0.065, changeCTC< 0.03, Confirmed>=0.5, Leaves.in.2015< 9.5

Rule number 3: 3 [Attrition=1 cover=444 (9%) prob=0.96]

Ratio.Difference>=0.105

Rule number 4: 45 [Attrition=1 cover=90 (2%) prob=0.77]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015>=23.5, leaves_prev_year>=28.5

Rule number 5: 86 [Attrition=0 cover=365 (7%) prob=0.37]

0.025=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed>=0.5, Leaves.in.2015>=9.5

13 | P a g e
Rule number 6: 44 [Attrition=0 cover=251 (5%) prob=0.31]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Leaves.in.2015>=23.5, leaves_prev_year< 28.5

Rule number 7: 42 [Attrition=0 cover=620 (12%) prob=0.23]

0.025=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed>=0.5

Rule number 8: 20 [Attrition=0 cover=825 (16%) prob=0.17]

0.065=<Ratio.Difference< 0.105, changeCTC< 0.03, Confirmed< 0.5

Rule number 9: 4 [Attrition=0 cover=2428 (47%) prob=0.12]

Ratio.Difference< 0.105, changeCTC>=0.03

In the decision tree model, Ratio.difference and changeCTC are found to be the most significant
features.

Significance of these variables checked using ANOVA

Both the variables have p value<0.05, hence significant.

ROC CURVE AND LIFT CHARTS FOR TRAINING AND TESTING DATA ARE SHOWN BELOW:

ROC:
TRAINING
AUC = 0.7873

TESTING
AUC = 0.8086

14 | P a g e
LIFT:
TRAINING

TESTING

15 | P a g e
ERROR MATRIX FOR THE ABOVE MODEL

Training:
Predicted
Actual 0 1 Error
0 3692 37 1.0

1 797 623 56.1

Predicted
Actual 0 1 Error
0 71.7 0.7 1.0

1 15.5 12.1 56.1

Overall error: 16.2%, Averaged class error: 28.55%

Testing:

Predicted
Actual 0 1 Error
0 1620 17 1.0

1 312 259 54.6

16 | P a g e
Predicted
Actual 0 1 Error
0 73.4 0.8 1.0

1 14.1 11.7 54.6

Overall error: 14.9%, Averaged class error: 27.8%

Training Testing
Sensitivity 43.87% 45.35%
Specificity 99% 98.96%
Precision 94.39% 93.84%
F-Score 59.9 61.15

17 | P a g e
Random Forest
Question 4: Develop models based on ensemble methods, what insights can you get based on
ensemble methods, did ensemble method improve accuracy? Rank variables based on their
importance.

Ans. We generated Random forest model with almost all the variables, except some of the variables,
which were deemed unnecessary from exploratory data analysis. After that, based on the p value
and significance levels of Wald’s test, which was performed in Rattle, we found the following
variables, using which we built the final model:

1 Ratio.of.leaves.2014 12 leaves_current_year
2 Gender 13 RM.Reportee.Male
3 RM.Reportee.Female 14 Leaves.in.2015
4 Ratio.of.leaves.2015 15 leaves_prev_year
5 RM.Age 16 RM.Reportees.Count
6 joining.bonus 17 Prior.Work.Exp
7 Avg.Weighted.Performance 18 Employee.Age.During.DOJ
8 age_diff 19 f_to_m
9 Department 20 Marital.Status
10 emp_manage 21 Working.in.Native.Place
11 changeCTC 22 Start.Date.2015

All these variables passed the significance test and the result are explained in Question 1.

The error matrices for the above model for training and testing data are shown below:

Training:
Predicted
Actual 0 1 Error
0 3721 0 0
(72.4) (0)
1 0 1418 0
(0) (27.6)

Overall error: 0%, Averaged class error: 0%

So, on training data, the sensitivity and specificity are 100%, whereas the precision is
100%, the F-score is 100%

18 | P a g e
Testing:

Predicted
0 1 Error
Actual 789 17
0 2.1
(71.7) (1.5)
127 167
1 43.2
(11.5) (15.2)

Overall error: 13.1%, Averaged class error: 22.65%

So, on testing data, the sensitivity and specificity are 56.8% and 97.9% respectively; whereas
the precision is 90.8%, the F-score is 69.9%.

Overall, we can see a slight decrease in accuracy in Random Forest compared to logistic
regression. However, it is higher than that of decision tree.

ROC:
The ROC Curve and the lift charts for the training and testing data are shown below:

Training:

Testing:

19 | P a g e
For training data the Area Under ROC curve (AUC) is 100% and for testing data the AUC is
87%, signifying, for 87% of randomly selected pairs of positive and negative observations from
testing data probability of positive class will be higher than the negative class.

Lift:

Training:

20 | P a g e
Testing:

Variable Importance

From the model we get Mean Decrease Accuracy and Mean Decrease Gini which is used to
rank variables based on their importance

Variable Mean Decrease


Variable Mean Decrease Gini
rank Accuracy
1 Ratio.of.leaves.2015 34.76 328.78
2 changeCTC 15.83 77.15
3 leaves_prev_year 15.22 60.55
4 leaves_current_year 12.91 84.07
5 Avg.Weighted.Performance 12.12 59.09
6 Employee.Current.Age 10.86 67.93
7 Ratio.of.leaves.2014 10.46 39.57
8 Prior.Work.Exp 9.64 61.75
9 Employee.Age.During.DOJ 7.56 57.22
10 RM.Reportee.Male 6.31 40.93
11 RM.Age 5.72 61.53
12 Department 5.71 24.13
13 RM.Reportees.Count 5.47 44
14 Marital.Status 4.68 14.7
15 f_to_m 4.68 37.51

21 | P a g e
16 age_diff 4.42 61.75
17 emp_manage 3.32 2.41
18 RM.Reportee.Female 2.87 27.41
19 Gender 2.48 5
20 joining.bonus 1.01 0.11
21 Start.Date.2015 0 0
22 Working.in.Native.Place -0.81 7.87

22 | P a g e
Sampling Techniques

Question 5: Based on previous questions, what kind of modelling problems would Julia expect
when the classes are not represented adequately (imbalanced data)? Suggest ways to handle
these problems (2 points).
Problem with imbalanced data: Since most machine learning algorithm are designed to improve
accuracy, by reducing the error the model will typically do so by having a bias towards the majority
class. Therefore, accuracy as a measure for model performance will become faulty. Other measures
of model performance will have to be looked at such as precision, recall and F-score. Such an
imbalance will force a stronger trade-off between precision and recall.

For solving imbalance problems, one way is to consider F score or specificity as a model selection
criteria. Although ideally industrial practises suggests that we should adopt techniques such as
undersampling, oversampling, SMOTE, and stratified sampling to remove imbalance from the data.

Question 6: Use various sampling techniques that are best suited for the data based on model
accuracy (4 Points).

Sampling Technique Accuracy (LR) Accuracy (Decision Accuracy


Tree) (Random F)
Simple Random Sampling 83.75% 85.1% 85.6%
Stratified Sampling 93% 84.5% 83.3%
Random Oversampling 84.85% 71.5% 76.9%
Random Undersampling 83% 71.5% 75.2%
SMOTE 85.6% 77.6% 74.5%

From the previous analysis, we concluded that Logistic Regression gave the best F-score on test data
out of all the techniques mentioned. Therefore, we shall use that model for predictions. As is given
in the above table, for logistic regression, a stratified sampling technique is working the best, giving a
93% overall model accuracy. Therefore, the company shall be using this technique while predicting
attrition.

23 | P a g e
Final Recommendations
Question 7: Based on the different model results, what would be your final recommendation
to Kramerica Industries? (4 points)

The coefficients obtained from the Logistic Regression model for the significant variables is given
below:

Employee Current Age 8.9782


Employee Age During DOJ -8.6259
Prior Work Exp -0.4014
Leaves in 2015 -0.3816
Ratio of leaves 2015 136.4236
Auto Parts -0.7359

Through the table above, we can conclude that the attrition problem is more significant in other
departments of the company than in Auto Parts. Older age employees and those having a long
tenure at Kramerica have a greater tendency to leave the organisation. The behaviour worsens with
people having a higher prior work experience. Interestingly, from the decision tree, we can see that
people experiencing a change in CTC of more than 3% have a lower chance of leaving the company.

From the above insights, it might be concluded that the company is not providing enough salary
hikes and growth opportunities for employees at the senior positions, and therefore is experiencing
attrition there. We would suggest the company to revise its workforce strategy for the senior
employees to control the problem.

24 | P a g e
Appendix

Exhibit 1: Distribution of explanatory variables.

We plotted Histograms for continuous variables and box plot for categoric variable across classes to
visually validate whether the given variable is significant or not.

25 | P a g e
26 | P a g e
27 | P a g e
28 | P a g e
29 | P a g e
30 | P a g e
31 | P a g e

You might also like