You are on page 1of 1

Employees don’t leave

the company; they leave

their managers
Nayana Kumari · Follow
Published in Web Mining [IS688, Spring 2…
8 min read · 21]
Apr 30, 2021

Listen Share

A ‘K-means’ clustering approach to

identify attrition

Jeffery Hamilton/Getty Images

It is not a pleasant thing to hear ‘I Quit,’ but

companies can gain insight through
evidence and pattern analysis to determine
the reasons. Here is an analysis of
employee churn. This analysis aims to
help companies understand the pattern
and predict who might leave the company
in the future. A good workplace would
always find ways to improve employee
retention and employee satisfaction.

Almost all companies face the issue of

higher attrition at some point in their
lifetime. While attrition is a regular
phenomenon and helps an organization
bring in talents from other companies
without ramping up the workforce, various
factors contribute to this churn(attrition). I
will try to apply my learning of clustering
and data point distance measurements to
identify a few critical factors that drive a
high attrition rate.

For this analysis, I will use a dataset that

contains employee attrition along with
employee profiles.

The codebase for this analysis can be

found here:

Data Exploration
Let’s look at the data at hand. The format of
data is .csv

Python with the following libraries will be used

for this analysis

from IPython.display import

import pandas as pd
from sklearn.cluster import
import matplotlib.pyplot as plt
from sklearn.preprocessing
import LabelEncoder,

pandas: for data extract, cleansing, and


sklearn: for the k-means and related


matplotlib: to plot the graphs of various

datasets and calculations

Data Attributes:
First, I extracted the CSV files into a
python dataframe-

data = pd.read_csv('HR-Employee-


The shape of the dataset looks like this:

(1470, 35), i.e., 1470 rows, and 35 columns.

Let’s check for the null and empty values:


All the columns have filled data and no

null or empty cells, which are good.

After checking some of the categorical

attributes for their unique values and
statistics for numerical attributes, we
found the below columns appropriate for
our analysis.

Columns: Age, Daily rate, and Education



Column Stats and values:

Along with the above columns, the below

columns are distributed on a similar scale,
which will help identify the areas of

Columns: Years at company, Years In

Current Role, Years Since Last Promotion,
Years with Current Manager.



Some fields are categorical and have only

2/3 unique values. E.g., Gender, Marital
status. We will have to remove them from
the dataset as they will not help identify
the cluster clearly.

So, I have taken the categorical column

having the most density in terms of values -
Education Field.

I will have to vectorize this field for my

analysis Since k-means does not work on
categorical attributes.


Number of records: 1470

The final set of attributes chosen for the


‘Attrition’, ‘DailyRate’, ‘EducationField’,

‘YearsAtCompany’, ‘YearsInCurrentRole’,

I created my dataset with the above-

mentioned fields:

f_data = pd.DataFrame(data,
'DailyRate', 'EducationField',
, 'YearsInCurrentRole',

Now let us check the dataset-


I also have filtered the dataset where

attrition is ‘Yes.’ As it implicates, I am
interested only in the records where the
employee has left the company.

m_data =
f_data[f_data['Attrition'] ==
f_data = m_data.drop(
['Attrition'], axis=1)

Total number of records after filtering: 237

I will vectorize the categorical field

EducationField since all my columns need
to be numeric vectors to find k-means.

X = f_data
y = f_data['EducationField']
le = LabelEncoder()
X['EducationField'] =
y = le.transform(y)


I need to adjust the scale of the parameters

to fit in the k-means function and make
drawing the scatter plots easy.

s = X.columns
ms = MinMaxScaler()
X = ms.fit_transform(X)
X = pd.DataFrame(X, columns=


As you may have noticed, the values in the

“YearsXXX” columns (YearsAtCompany,
YearsWithCurrManager) have changed to
decimal since they took the same range of
scale and proportion.

Before we find the k-means clusters, we

need to supply a k number; basically the
number of clusters we are interested in.
There are several methods available to find
the optimal k number. The most frequently
used is called the “Elbow method.”

Elbow Method to find k-value

If a graph is plotted to have several clusters
along the x-axis and the number of errors
along the y-axis, a curve is formed which
shows the relationship between the
number of clusters and several errors. This
curve has a point where it changes its
trajectory of the steep decline of errors
with an increase in clusters. That point is
called the ‘elbow.’ Beyond that point, with
the changes of the cluster, the error does
not decrease steeply. We take a cluster
value at that point. We also test some
values above and below that number to
find the optimal k value.

Let’s take a look at the elbow plot -

As you can see based on this graph, the

optimum number is 2. We will also test
values 3, 4 & 5 and check the accuracy.
Here we check the accuracy with each
number of clusters value(k).

k_means = KMeans(n_clusters=2,
y_k_means =
labels = k_means.labels_
# check how many of the samples
were correctly labeled
correct_labels = sum(y ==
print("Result: %d out of %d
samples were correctly labeled."
% (correct_labels, y.size))
print('Accuracy score: {0:0.2f}
%'.format((correct_labels *100)/

When K=3

The accuracy is very low so, k value 3 is not


Let’s take a look at the scatter plot with the

x-axis as YearsInCompany and Y-axis as

c=y_k_means, cmap='rainbow')

Clearly, it’s tough to identify the clusters,

and the number of errors is high. This also
means the elements which are part of the
same cluster are not geographically located

When K=4

This is a decent accuracy value considering

the number of rows in the dataset. But we
will validate using the scatter plot again.

Like the earlier scatter plots, it’s tough to

identify the clusters. So, we will continue
to test other k values.

When K=5

The accuracy is lower than K=4 and the

scatter plot below also suggests the same.

Finally, we will test with k = 2

Clearly, we found that the accuracy level is

highest among other values around the

Let us validate the same with a scatter plot

using different columns on the x and y-

X= DailyRate,

As we can see, the data points are scattered

all over and can not be conclusively
identified which area has had higher

Let us validate the same with a scatter plot

using different columns on the x and y-

X= EducationField,

As you can see, another inconclusive

scatter plot without a clear

Let us validate the same with a scatter plot

using different columns on the x and y-

X= NumberofYearsinCompany,

Clearly, K=2 gives us the most optimal

result in terms of accuracy and clear
clusters (represented by two distinct
colors). Also, YearsAtCompany and
YearswithCurrentManager plot give us
some degree of clear segments.

As you can see, the first segment (cluster in

purple) represents the employees who
have left the organization and who are
working for longer tenure with the
company having the same manager for a
long period.

Employees who have stayed with the

company longer but constantly worked
with different managers have not left the

The second segment (cluster in red) is

where the employees in their early years in
the company, even though they have had
worked with managers for a lesser number
of years or have changed managers
regularly, have left the organization.

People who have stayed with their

manager for a longer time and are not
associated with the company longer have
much stayed with the company.

Based on this analysis, an organization may

derive the areas of improvement as follows.
An organization may choose to focus on
the employees in their early days in the
company and pay extra attention if they
are growing and if they feel accomplished.

Similarly, employees who are with the

company for a longer period might need a
change in their role, project, and manager.

You might also like