You are on page 1of 13

DEPARTMENT OF BUSINESS ECONOMICS

UNIVERSITY OF DELHI

Using Logistic Regression to Predict the Attrition of Various


Employees for a Company

SUBMITTED TO: Prof Sunil Kumar

BY:Bandeep Bharaj
Bhanu Gupta
Devendra Pratap
Dheeraj Tiwari
Parmeet Singh
Surabhi Gupta

1
Table of Contents

Introduction .................................................................................................................................... 3
Data Description ............................................................................................................................. 4
Objective ......................................................................................................................................... 5
Methodology ................................................................................................................................... 5
Logistic Regression Intution: ....................................................................................................... 5
Performance Measures: .............................................................................................................. 5
Confusion Matrix: .................................................................................................................... 5
Overall accuracy: ..................................................................................................................... 6
Sensitivity................................................................................................................................. 6
Specificity ................................................................................................................................. 6
Precision and Recall ................................................................................................................. 6
ROC Curve ................................................................................................................................ 6
Results ............................................................................................................................................. 8
Final Conclusion and Future Implications: .................................................................................... 13

2
Introduction
Companies in India as well as in other countries face a formidable challenge of recruiting and
retaining talents while at the same time having to manage talent loss through attrition be that
due to industry downturns or through voluntary individual turnover. Losing talents and
employees result in performance losses which can have long term negative effect on companies
especially if the departing talent leaves gaps in its execution capability and human resource
functioning which not only includes lost productivity but also possibly loss of work team
harmony and social goodwill.

With attrition rates being a bane of every industry, companies are devising innovative business
models for effective retention of talent. There are a lot of factors responsible for attrition and
employers are getting increasingly conscious of the factors that can keep an employee
committed.

Attrition may be defined as gradual reduction in membership or personnel as through


retirement, resignation or death. In other words, attrition can be defined as the number of
employees leaving the organization which includes both voluntary and involuntary separation.
The employee gradually reduces his/her ties with the company than crib about the underlying
factors causing attrition. It is symptomatic of a much deeper malaise that cuts deeper into the
innards of organizations.

Attrition rates vary from sector and industry to industry. Apart from the unavoidable ones like
resignation, retirement, death or disability, the causes are found to be many and varied. They
vary according to the nature of business, the level of the employees and the nature of the
responsibility shouldered by them. The obvious, common and main reasons are the ‘ergonomic
discomfort’ experienced by the employee and the ‘functional incompatibility’ between the
corporate management and the employees. Very often an employee finds himself among
colleagues and superiors he is unable to cope up with. Or he finds himself totally out of tune in
his functions with the employee’s functional requirements, failing to rise to the employer’s
expectations. Another important reason is that the employee’s remuneration is not voluminous
enough to bear the brunt and cushion the concussions of his family and social life. Employee
retention refers to policies and practices companies use to prevent valuable employees from
leaving their jobs.

As it becomes very necessary for the HR managers to understand the factors that prompt
employees to quit an organization, firms are adopting many retention strategies to combat the
attrition problem. In this project, we have made a model that will predict that whether a
particular employee is likely to leave the company or not using logistic regression. Then,

3
personalized and specialized steps can be taken to look after the needs of the employees that
are more likely to leave the organization.

Data Description
Dependent Variable:

Attrition Binary variable, 1 for the person who has left the company and 0
otherwise

Independent Variables:

Age In Years
BusinessTravel Categorical variable ( employees who travel frequently, rarely and who do not travel
Department HR, R&D and Sales
DistanceFromHome in Km
EducationField HR, Life Sciences, Marketing, Medical and others
EnvironmentSatisfaction Based on survey on the scale of 1-5
Gender Male or female
JobInvolvement Based on survey on the scale of 1-5
JobRole Their particular role in the company
JobSatisfaction Based on survey on the scale of 1-5
MaritalStatus Divorced, married or Single
MonthlyIncome in USD
NumCompaniesWorked Previous work experience (in years)
OverTime Binary Variable (Yes or No)
PerformanceRating Given by the manager on the scale of 1-5
RelationshipSatisfaction Based on survey on the scale of 1-4
TotalWorkingYears Total work experience in Years
TrainingTimesLastYear The number of times training has been given to the employee
WorkLifeBalance Based on survey on the scale of 1-4
YearsAtCompany in Years
YearsInCurrentRole in Years
YearsSinceLastPromotion in Years
YearsWithCurrManager in Years
PercentSalaryHike in %

4
Objective
To build a predictive model to predict whether an employ will leave a company in a particular
year or not that because it has a direct impact on the total cost incurred by a company.

The variables mentioned above have been used in order to build the predictive model.

Methodology
Logistic Regression Intution:
Logistic regression is a specialized form of regression used to predict and explain a categorical
dependent variable. It works best when the dependent variable is a binary categorical variable.
One special advantage of logistic regression is that it is not restricted by the normality
assumption which is a basic assumption in the regression analysis. This technique can also
accommodate non-metric variables such as nominal or categorical variables by coding them
into dummy variables. Another advantage of logistic regression is that it directly predicts the
probability of an event occurring. To make sure that the dependent variable, which is the
probability, is bounded between zero and one, the logistic regression defines a relationship
between the dependent and independent variables that resembles an S-shaped curve, which
uses an iterative process to estimate the ‘most likely’ values of the coefficients. This results in
the use of a ‘likelihood’ function in fitting the equation rather than using the sum of squares
approach of the regression analysis. The dependent variable is considered as the ‘odds ratio’ of
a specific observation belonging to a particular group or category. In that sense, logistic
regression estimates the probability directly. (Srinivasan, V. & Valk, R. 2008)

Performance Measures:

Confusion Matrix:
A confusion matrix is a table that is often used to describe the performance of a
classification model (or “classifier”) on a set of test data for which the true values
are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is
commonly mislabeled as the other. Most performance measures are computed
from the confusion matrix.

5
Overall accuracy: is a measure that indicates the correctly predicted matches and non-
matches. This may be problematic when the classes are not balanced.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (# true positives + # true negatives)/ (total # of prediction)

Sensitivity: given that a result is truly an event, what is the probability that the model will
predict an event results.

S𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = (# true positives) / (# true positives + # false negatives)

Specificity: given that a result is truly not an event, what is the probability that the model
will predict a negative result.

𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (# true negatives)/ (# true negatives + # false positives)

Precision and Recall:

Precision= (# true positives) / (# true positives + # false positives)

Recall = (# true positives) / (# true positives +# false negatives)

ROC Curve:

With two classes the Receiver Operating Characteristic (ROC) curve can be used to
estimate performance using a combination of sensitivity and specificity. The area under
the ROC curve is a common metric of performance.

6
Steps to building the model
1. Choosing the variables and specifying a model: The relevant independent variables were
chosen to specify the model.
2. Data Cleaning
a. Removing redundant variables: There were some redundant variables like the
employee id which were removed from the independent variable matrix.
b. Treating the missing values: The observations which had missing values greater
than 60% were dropped. For others, the missing values were treated using
median for numerical variables and mode for categorical variables.
c. Converting into dummy variables: There were a lot of categorical variables like
the department of the employee, their position level etc, which were converted
into dummy variables. The total number of independent variables increased to
55 after accounting for such dummy variables.

3. Creating the matrix of features (independent variables ‘X’) and the dependent variable
vector (‘y’): These were created so that we can run the model of logistic regression.

Where p is the predicted probability of the event and p/(1-p) shows the odds in favour
of the particular event to happen.

4. Splitting the dataset into train set and test set: In all we had 1470 observations, out of
which we used 70% to train the model and 30% to evaluate the performance of the
model.
5. Modelling: Then we run the logistic regression on the train set and interpreted the
results.
6. Predicting: Using the above model, we predicted the attrition for the test set.
7. Measuring Performance: Then we compared the predicted values of the test set to the
actual values of the test set to measure the performance of our model. The
performance measures that have been specified above were used for this.
8. Optimizing: Then based on the performance of the model on the test set, we have tried
to optimize our model for better predictions.

7
Results
Interpretation of the coefficients of the Logistic Regression Model

Since there were a lot of independent variables in our model, we are depicting the
interpretation of only few of them.

Logistic Regression Results:

Age: As the age increases by one year, the odds in favour of an employee leaving the job
decreases by exp(-0.0267) i.e., 0.97.

Though, this variable is not coming out to be statistically significant only at 1% level of
significance.

Gender: The odds in favour of a male employee leaving the job is more by exp(0.3207)=1.34 as
compared to the odds in favour for a female employee.

This variable is statistically significant at 5% level of significance.

8
Monthly Income: As the age increases by one km, the odds in favour of an employee leaving
the job decreases by exp(-0.000078) i.e., 1.

Though, this variable is not coming out to be statistically significant even at 10% level of
significance.

Total Working Years: As the total working years increase by one year, the odds in favour of an
employee leaving the job decreases by exp(-0.0569) i.e., 0.945.

Though, this variable is not coming out to be statistically significant only at 1% level of
significance.

Training time last year: As the training time last year increases by one year, the odds in favour
of an employee leaving the job decreases by exp(-0.0853) i.e., 0.918.

Though, this variable is not coming out to be statistically significant only at 1% level of
significance.

Percent Salary Hike: As the percent salary hike increases by one year, the odds in favour of an
employee leaving the job decreases by exp(-0.0690) i.e., 0.918.

Though, this variable is not coming out to be statistically significant only at 1% level of
significance.

We can interpret all the continuous as well as dummy variables in the similar fashion and check
the level of significance as well.

9
Performance of the model on the test set:
Confusion Matrix:
Predicted Values

Actual Values

From the matrix, we can see that:

True Negatives (tn)= 368, these are the people for whom the model has predicted that they
won’t leave their job and the people have actually not left.

False Positives(fp)= 12, these are the people for whom the model has predicted that they will
leave their job but the people have actually not left.

False Negatives (fn)= 43, these are the people for whom the model has predicted that they
won’t leave their job but the people have actually left.

True Negatives (tn)= 18, these are the people for whom the model has predicted that they will
leave their job and the people have actually left.

Accuracy:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (# true positives + # true negatives)/ (total # of prediction)

Thus, accuracy= 87.52%

However, in the dataset the dataset is imbalanced and thus, the performance of all the
predicting models is not as good as it seems. Although, the logistic regression performed with
an accuracy of 87.52% but it was biased towards the majority class (stayed) and could not
predict any instance of the minority class (Left). We could not use these results to check the
factors which have influence in predicting the employee leave status.

So, there are other performance measures that will be considered to measure the performance
of the model.

10
Precision and recall:

Precision= (# true positives) / (# true positives + # false positives)

Recall = (# true positives) / (# true positives +# false negatives)

Precision= 60% and recall= 29.5%

Precision is a measure of out of the positively predicted values, how many are actually true and
thus a precision of 60% means that out of the the total employees that the model predicted will
leave the job only 60% have actually left the job.

Recall, on the other hand is the measure of out of the actual positive values, how many are
actually predicted by the model. Thus, a precision of 60% means that out of the the total
employees that actually left the job only 29.5% have actually been predicted by the model.

Thus, we can say that accuracy alone is not a correct measure of a model in which the
dependent variable is a binary variable, which takes only two values, 0 and 1.

ROC (Reciever Operating Curve):

11
The ROC shows that the area under the curve is 0.79, which shows that the model is decent
model but not a very good model to predict the attrition. In the following part, we will propose
ways to improve the performance of the model.

Ways to improve the performance of the model:

1. Since the dataset is imbalanced (the people that have left the job are very less as
compared to the people who left the job), we can do the oversampling using k means to
balance the dataset.
2. Since logistic regression is not performing very well, we can use other algorithms such as
random forest classifier or ANN classifier to increase the performance of the model.
However, they are more difficult to interpret.
3. We can collect more data on other relevant variables that we might have missed.
4. We can try changing the threshold for the decision making of the classifier. Here, the
model predicts the probability and if the probability is less than 0.5, it classifies it as a 0
and if the probability is greater than 0.5, it classifies it as a 1. Thus, the threshold is 0.5
here. We can try other values for this threshold, e.g., 0.4 r 0.6 etc and check the
performance again.

12
Final Conclusion and Future Implications:
How to retain valuable employees is one of the biggest problems that plague companies in the
competitive marketplace. Not too long ago, companies accepted the “revolving door policy” as
part of doing business and were quick to fill a vacant job with another eager candidate.
Nowadays, businesses often find that they spend considerable time, effort, and money to train
an employee only to have them develop into a valuable commodity and leave the company for
greener pastures. In order to create a successful company, employers should consider as many
options as possible when it comes to retaining employees, while at the same time securing
their trust and loyalty so they have less of a desire to leave in the future.

Employees need to be retained because good, faithful, trained and hardworking employees are
required to run business. They have acquired good product knowledge over the long run and a
trained employee can handle customers better and also solve problems of peers who are new
to the organization. When an employee leaves he takes away with him all company information
such as ongoing projects, etc. Goodwill of the company gets hampered due to more employee
turnover rate and the competitors start poking their nose to recruit best talents from them.
Efficiency of work is hampered to a large extent. For an example – If an employee leaves in the
middle of an ongoing project it’s very difficult to fill that Vacuum and a new employee can
never replace an old and talented employee so this leads to delayed completion of projects and
less work satisfaction among other team members.

As it becomes very necessary for the HR managers to understand the factors that prompt
employees to quit an organization, firms are adopting many retention strategies to combat the
attrition problem. So, the companies can develop a model like such to predict the probability of
a particular employee leaving the job. Then, personalized and specialized steps can be taken to
look after the needs of the employees that are more likely to leave the organization. In this
way, the retention rate of the company can be improved and also the needs of the employees
be taken care of at the same time.

13

You might also like