You are on page 1of 21

Employee Attrition

Prediction
Introduction
In order to start with exercise, The dataset includes features like Age, Employee Role, Daily
Rate,Job Satisfaction, Years At Company, Years In Current Role etc. For this exercise, we will try
to study the factors that lead to employee attrition. This is a fictional data set . Attrition is a
problem that impacts all businesses, irrespective of geography, industry and size of the company.
Employee attrition leads to significant costs for a business, including the cost of business
disruption, hiring new staff and training new staff. As such, there is great business interest in
understanding the drivers of, and minimizing staff attrition.
In this context, the use of classification models to predict if an employee is likely to quit could
greatly increase the HR’s ability to intervene on time and remedy the situation to prevent attrition.
While this model can be routinely run to identify employees who are most likely to quit, the key
driver of success would be the human element of reaching out the employee, understanding the
current situation of the employee and taking action to remedy controllable factors that can
prevent attrition of the employee.
Abstract
Now a day's Employee Attrition prediction become a major problem in the
organizations. Employee Attrition is a big issue for the organizations specially when
trained, technical and key employees leave for a better opportunity from the
organization. This results in financial loss to replace a trained employee. Therefore,
we use the current and past employee data to analyze the common reasons for
employee attrition or attrition. For the prevention of employee attrition, we applied a
well known classification methods, that is, Decision tree, Logistic Regression, SVM,
KNN, Random Forest, Naive bayes methods on the human resource data. For this
we implement feature selection method on the data and analysis the results to
prevent employee attrition. This is helpful to companies to predict employee
attrition, and also helpful to their economic growth by reducing their human
resource cost.
Literation Survey
Employee attrition refers to the gradual loss of employees over time. Most literature
on employee attrition categorizes it as either voluntary or involuntary. Involuntary
attrition is thought of as the mistake of the employee, and refers to the organization
firing the employee for various reasons. Voluntary attrition is when the employee
leaves the organization by his own will. This paper focuses on voluntary attrition. A
meta-analytic review of voluntary attrition found that the strongest predictors of
voluntary attrition included age, pay, and job satisfaction. Other studies showed that
several other features, such as working conditions, job satisfaction, and growth
potential also contributed to voluntary attrition . Organizations try to prevent
employee attrition by using machine learning algorithms to predict the risk of an
employee leaving, and then take pro-active steps for preventing such an incident.
The Process for Classification
1.Create an estimation sample and two validation samples by splitting the data into three
groups.
2.Set up the dependent variable, employee attrition (as a categorical 0-1 variable)
3.Estimate the classification model using the estimation data, and interpret the results.
4.Assess the accuracy of classification in the first validation sample, possibly repeating
steps 2-5 a few times changing the classifier in different ways to increase performance.
5.Finally, assess the accuracy of classification in the second validation sample. You should
eventually use and report all relevant performance measures and plots on this second
validation sample only.
Step 1: Split the data
We split the data into an estimation sample and two validation samples - using a
randomized splitting technique. The second validation data mimic out-of-sample data,
and the performance on this validation set is a better approximation of the performance
one should expect in practice from the selected classification method. The split used is
80% estimation, 10% validation, and 10% test data, depending on the number of
observations - for example, when there is a lot of data, you may only keep a few hu
ndreds of them for the validation and test sets, and use the rest for estimation.
Step 2: Set up the
dependent variable
The data original file was not organized as a categorical Valuable, so we changed the
column “Attrition” to 0 and 1 values.
In our data the number of 0/1’s in our estimation sample is as follows:
Step 3: Simple Analysis
We are running a simple table to visualize the Data of those values that are attrited
Step 4: Classification
and Interpretation
Given our decisions, we decide to use a number of classification methods to
develop a model that discriminates the different classes.
In this we will consider: logistic regression and classification and regression
trees (CART). H
Logistic Regression: Logistic Regression is a method similar to linear regression
except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic
regression estimates the coefficients of a linear model using the selected
independent variables while optimizing a classification criterion. For example, this
is the logistic regression parameters for our data:
Given a set of independent variables, the output of the estimated logistic
regression (the sum of the products of the independent variables with the
corresponding regression coefficients) can be used to assess the probability an
observation belongs to one of the classes. Specifically, the regression output
can be transformed into a probability of belonging to, say, class 1 for each
observation. The estimated probability that a validation observation belongs to
class 1 (e.g., the estimated probability that the customer defaults) for the first
few validation observations, using the logistic regression above, is:
Step 5: Validation
accuracy
Using the predicted class probabilities of the validation data, as outlined
above, we can generate some measures of classification performance.
1. Hit ratio
This is the percentage of the observations that have been correctly
classified (i.e., the predicted class and the actual class are the same). These
are as follows for probability threshold 45%:
2. Confusion matrix
The confusion matrix shows for each class the number (or percentage) of the data that
are correctly classified for that class. For example, for the method above with the highest
hit rate in the validation data (among logistic regression and the 2 CART models), and
for probability threshold 45%, the confusion matrix for the validation data is:
Step 6. Test Accuracy

Having iterated steps 2-5 until we are satisfied with the performance of our selected
model on the validation data, in this step the performance analysis outlined in step 5
needs to be done with the test sample. **
Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart,
and profit curve look like for our test data. For the hit ratio and the confusion matrix
we use 45% as the probability threshold for classification.
Step 7. Data
Analysis
Step 7. Data Analysis
After we ran the model multiple times and iterate to find the best value, we came with some conclusions:
Model is biased towards predicting non attrition.
There is a tension between probability threshold and the number of employees who are accurately predicted as
potential churners. A high probability threshold would end in a high number of errors. The business relevance is
predict attrition well, rather than non attrition hence a lower probability threshold is chosen.
The confusion matrix shows that of all the people who are going to leave the company, our algorithm identifies
about 42% of them accurately. While not ideal, this is a huge improvement on random sampling where we could
have predicted only about 16% (the actual attrition rate). On the other hand, there is a cost of wrongly identifying
attrition of non-leaving employees resulting in inefficiencies in resource allocation.
Log Regression is the best model, as it always predict a higher area under the curve and a better confusion matrix
Data Analysis
Let’s have a look at the data and see how features are contributing in the e we
can only see the distribution of numerical/continuous values in a dataset. In order
to take a peak into categorical/object values we have to bind them with a numeric
variable and then you will be able to see their relevance to the dataset or you can
replace the categorical variable with dummies..
For this exercise, our aim is to predict the employee attrition and it is important to see
which variables are contributing the most in attrition. But before that we need to know if
the variables are correlated if they are, we might want to avoid those in model building
process.
There are many continuous variables, we can have a look at their distribution and create a
grid of pair plots but that would be too much code to see the correlation as there are a lot
variables. Rather, we can create a seaborn heatmap of numeric variables and see the
correlation. The variables which are not poorly correlated(i.e correlation value tend
towards 0), we will pick those variables and move forward with them and will leave the
ones which are strongly correlated(i.e correlation value tend towards be 1).
#Let's remove the strongly correlated variables
hr_data_uc = hr_data_num[['Age','DailyRate','DistanceFromHome',
'EnvironmentSatisfaction', 'HourlyRate',
'JobInvolvement', 'JobLevel',
'JobSatisfaction',
'RelationshipSatisfaction',
'StockOptionLevel',
'TrainingTimesLastYear']].copy()
hr_data_uc.head()
Let’s first replace Yes and No in Attrition with 1 and 0.

#Copy categorical data


hr_data_cat = hr_data[['Attrition', 'BusinessTravel','Department',
'EducationField','Gender','JobRole',
'MaritalStatus',
'Over18', 'OverTime']].copy()
hr_data_cat.head()
Now that we have all the data in numerical format, we can
now combine hr_data_num and hr_data_cat.
hr_data_final = pd.concat([hr_data_num, hr_data_cat],
axis=1)
hr_data_final.head()
Conclusion
Throughout this post, we saw Data is important is(actually in most of places it is
important). We saw how we can avoid using correlated values and why it is important not
to use those while modelling. We used Random forest and learned how it can be very
advantageous . Most of all we found factors which are most important to employees and if
are not fulfilled might lead to Attrition.
Bibilography
Cotton, J.L. and Tuttle, J.M., 1986. “Employee turnover: A meta-
analysis and review with implications for research” Academy of
management review, pp.55-70.
[2] Liu, D., Mitchell, T.R., Lee, T.W., Holtom, B.C. and Hinkin,
T.R., 2012. “When employees are out of step with coworkers:
How job satisfaction trajectory and dispersion influence
individual-and unit-level voluntary turnover”. Academy of
Management Journal, pp.1360-1380.
[3] Heckert, T.M. and Farabee, A.M., 2006. “Turnover intentions of
the faculty at a teaching-focused university”. Psychological
reports, pp.39-45.
[4] Rish, Irina, “An empirical study of the naive bayes classifier”,
IJCAI Workshop on Empirical Methods in AI.
[5] David A. Freedman, Statistical Models: Theory and Practice,
Cambridge University Press p. 128.
[6] Rosenblatt, Frank, Principles of Neurodynamics: Perceptrons and
the Theory of Brain Mechanisms.
[7] Fawcett, T., 2006. An introduction to ROC analysis

You might also like