Employees Don't Leave The Company They Leave Their Managers

Employees don’t leave
the company; they leave

their managers
Nayana Kumari · Follow
Published in Web Mining [IS688, Spring 2…
8 min read · 21]
Apr 30, 2021
Listen Share
A ‘K-means’ clustering approach to

identify attrition
Jeffery Hamilton/Getty Images
It is not a pleasant thing to hear ‘I Quit,’ but

companies can gain insight through
evidence and pattern analysis to determine
the reasons. Here is an analysis of
employee churn. This analysis aims to
help companies understand the pattern
and predict who might leave the company
in the future. A good workplace would
always find ways to improve employee
retention and employee satisfaction.
Almost all companies face the issue of

higher attrition at some point in their
lifetime. While attrition is a regular
phenomenon and helps an organization
bring in talents from other companies
without ramping up the workforce, various
factors contribute to this churn(attrition). I
will try to apply my learning of clustering
and data point distance measurements to
identify a few critical factors that drive a
high attrition rate.
For this analysis, I will use a dataset that

contains employee attrition along with
employee profiles.
The codebase for this analysis can be

found here:
https://github.com/nt27web/WebMining-
Clustering
Data Exploration
Let’s look at the data at hand. The format of
data is .csv
Python with the following libraries will be used

for this analysis
from IPython.display import

display
import pandas as pd
from sklearn.cluster import
KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing
import LabelEncoder,
MinMaxScaler
pandas: for data extract, cleansing, and

manipulation
sklearn: for the k-means and related

calculations
matplotlib: to plot the graphs of various

datasets and calculations
Data Attributes:
First, I extracted the CSV files into a
python dataframe-
data = pd.read_csv('HR-Employee-
Attrition.csv')
display(data.shape)
The shape of the dataset looks like this:

(1470, 35), i.e., 1470 rows, and 35 columns.
Let’s check for the null and empty values:
display(data.isnull().sum())
All the columns have filled data and no

null or empty cells, which are good.
After checking some of the categorical

attributes for their unique values and
statistics for numerical attributes, we
found the below columns appropriate for
our analysis.
Columns: Age, Daily rate, and Education

Field.
display(data['Age'].describe())
display(data['DailyRate'].descri
be())
display(data['EducationField'].u
nique())
Column Stats and values:
Along with the above columns, the below

columns are distributed on a similar scale,
which will help identify the areas of
attrition.
Columns: Years at company, Years In

Current Role, Years Since Last Promotion,
Years with Current Manager.
display(data['YearsAtCompany'].d
escribe())
display(data['YearsInCurrentRole
'].describe())
display(data['YearsSinceLastProm
otion'].describe())
display(data['YearsWithCurrManag
er'].describe())
Result:
Some fields are categorical and have only

2/3 unique values. E.g., Gender, Marital
status. We will have to remove them from
the dataset as they will not help identify
the cluster clearly.
So, I have taken the categorical column

having the most density in terms of values -
Education Field.
I will have to vectorize this field for my

analysis Since k-means does not work on
categorical attributes.
display(len(data))
Number of records: 1470
The final set of attributes chosen for the

analysis:
‘Attrition’, ‘DailyRate’, ‘EducationField’,

‘YearsAtCompany’, ‘YearsInCurrentRole’,
‘YearsSinceLastPromotion’,
‘YearsWithCurrManager’
I created my dataset with the above-

mentioned fields:
f_data = pd.DataFrame(data,
columns=['Attrition',
'DailyRate', 'EducationField',
'YearsAtCompany'
, 'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager'
])
Now let us check the dataset-
display(f_data.head())
I also have filtered the dataset where

attrition is ‘Yes.’ As it implicates, I am
interested only in the records where the
employee has left the company.
m_data =
f_data[f_data['Attrition'] ==
'Yes']
f_data = m_data.drop(
['Attrition'], axis=1)
display(len(f_data))
Total number of records after filtering: 237
I will vectorize the categorical field

EducationField since all my columns need
to be numeric vectors to find k-means.
X = f_data
y = f_data['EducationField']
le = LabelEncoder()
X['EducationField'] =
le.fit_transform(X['EducationFie
ld'])
y = le.transform(y)
Result:
I need to adjust the scale of the parameters

to fit in the k-means function and make
drawing the scatter plots easy.
s = X.columns
ms = MinMaxScaler()
X = ms.fit_transform(X)
X = pd.DataFrame(X, columns=
[cols])
Result:
As you may have noticed, the values in the

“YearsXXX” columns (YearsAtCompany,
YearsInCurrentRole,
YearsSinceLastPromotion,
YearsWithCurrManager) have changed to
decimal since they took the same range of
scale and proportion.
Before we find the k-means clusters, we

need to supply a k number; basically the
number of clusters we are interested in.
There are several methods available to find
the optimal k number. The most frequently
used is called the “Elbow method.”
Elbow Method to find k-value

If a graph is plotted to have several clusters
along the x-axis and the number of errors
along the y-axis, a curve is formed which
shows the relationship between the
number of clusters and several errors. This
curve has a point where it changes its
trajectory of the steep decline of errors
with an increase in clusters. That point is
called the ‘elbow.’ Beyond that point, with
the changes of the cluster, the error does
not decrease steeply. We take a cluster
value at that point. We also test some
values above and below that number to
find the optimal k value.
Let’s take a look at the elbow plot -
As you can see based on this graph, the

optimum number is 2. We will also test
values 3, 4 & 5 and check the accuracy.
Here we check the accuracy with each
number of clusters value(k).
k_means = KMeans(n_clusters=2,
random_state=0)
y_k_means =
k_means.fit_predict(x)
labels = k_means.labels_
# check how many of the samples
were correctly labeled
correct_labels = sum(y ==
labels)
print("Result: %d out of %d
samples were correctly labeled."
% (correct_labels, y.size))
print('Accuracy score: {0:0.2f}
%'.format((correct_labels *100)/
float(y.size)))
When K=3
The accuracy is very low so, k value 3 is not

optimal.
Let’s take a look at the scatter plot with the

x-axis as YearsInCompany and Y-axis as
YearswithCurrentmanager.
plt.scatter(x['YearsAtCompany'],
x['YearsWithCurrManager'],
c=y_k_means, cmap='rainbow')
plt.show()
Clearly, it’s tough to identify the clusters,

and the number of errors is high. This also
means the elements which are part of the
same cluster are not geographically located
together.
When K=4
This is a decent accuracy value considering

the number of rows in the dataset. But we
will validate using the scatter plot again.
Like the earlier scatter plots, it’s tough to

identify the clusters. So, we will continue
to test other k values.
When K=5
The accuracy is lower than K=4 and the

scatter plot below also suggests the same.
Finally, we will test with k = 2
Clearly, we found that the accuracy level is

highest among other values around the
‘elbow.’
Let us validate the same with a scatter plot

using different columns on the x and y-
axis.
X= DailyRate,
Y=NumberofYearsinCompany
As we can see, the data points are scattered

all over and can not be conclusively
identified which area has had higher
attrition.

axis.
X= EducationField,
Y=NumberofYearsinCompany
As you can see, another inconclusive

scatter plot without a clear
segment/cluster.

axis.
X= NumberofYearsinCompany,
Y=Yearswithcurrentmanager
Clearly, K=2 gives us the most optimal

result in terms of accuracy and clear
clusters (represented by two distinct
colors). Also, YearsAtCompany and
YearswithCurrentManager plot give us
some degree of clear segments.
As you can see, the first segment (cluster in

purple) represents the employees who
have left the organization and who are
working for longer tenure with the
company having the same manager for a
long period.
Employees who have stayed with the

company longer but constantly worked
with different managers have not left the
company.
The second segment (cluster in red) is

where the employees in their early years in
the company, even though they have had
worked with managers for a lesser number
of years or have changed managers
regularly, have left the organization.
People who have stayed with their

manager for a longer time and are not
associated with the company longer have
much stayed with the company.
Based on this analysis, an organization may

derive the areas of improvement as follows.
An organization may choose to focus on
the employees in their early days in the
company and pay extra attention if they
are growing and if they feel accomplished.
Similarly, employees who are with the

company for a longer period might need a
change in their role, project, and manager.

Employees Don't Leave The Company They Leave Their Managers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Employees Don't Leave The Company They Leave Their Managers

Uploaded by

Copyright:

Available Formats

Employees don’t leave

the company; they leave

A ‘K-means’ clustering approach to

Jeffery Hamilton/Getty Images

It is not a pleasant thing to hear ‘I Quit,’ but

Almost all companies face the issue of

For this analysis, I will use a dataset that

The codebase for this analysis can be

Python with the following libraries will be used

from IPython.display import

pandas: for data extract, cleansing, and

sklearn: for the k-means and related

matplotlib: to plot the graphs of various

The shape of the dataset looks like this:

Let’s check for the null and empty values:

All the columns have filled data and no

After checking some of the categorical

Columns: Age, Daily rate, and Education

Column Stats and values:

Along with the above columns, the below

Columns: Years at company, Years In

Some fields are categorical and have only

So, I have taken the categorical column

I will have to vectorize this field for my

Number of records: 1470

The final set of attributes chosen for the

‘Attrition’, ‘DailyRate’, ‘EducationField’,

I created my dataset with the above-

Now let us check the dataset-

I also have filtered the dataset where

Total number of records after filtering: 237

I will vectorize the categorical field

I need to adjust the scale of the parameters

As you may have noticed, the values in the

Before we find the k-means clusters, we

Elbow Method to find k-value

Let’s take a look at the elbow plot -

As you can see based on this graph, the

The accuracy is very low so, k value 3 is not

Let’s take a look at the scatter plot with the

Clearly, it’s tough to identify the clusters,

This is a decent accuracy value considering

Like the earlier scatter plots, it’s tough to

The accuracy is lower than K=4 and the

Finally, we will test with k = 2

Clearly, we found that the accuracy level is

Let us validate the same with a scatter plot

As we can see, the data points are scattered

Let us validate the same with a scatter plot

As you can see, another inconclusive

Let us validate the same with a scatter plot

Clearly, K=2 gives us the most optimal

As you can see, the first segment (cluster in

Employees who have stayed with the

The second segment (cluster in red) is

People who have stayed with their

Based on this analysis, an organization may

Similarly, employees who are with the

You might also like