You are on page 1of 1

Employees don’t leave

the company; they leave


their managers
Nayana Kumari · Follow
Published in Web Mining [IS688, Spring 2…
8 min read · 21]
Apr 30, 2021

Listen Share

A ‘K-means’ clustering approach to


identify attrition

Jeffery Hamilton/Getty Images

It is not a pleasant thing to hear ‘I Quit,’ but


companies can gain insight through
evidence and pattern analysis to determine
the reasons. Here is an analysis of
employee churn. This analysis aims to
help companies understand the pattern
and predict who might leave the company
in the future. A good workplace would
always find ways to improve employee
retention and employee satisfaction.

Almost all companies face the issue of


higher attrition at some point in their
lifetime. While attrition is a regular
phenomenon and helps an organization
bring in talents from other companies
without ramping up the workforce, various
factors contribute to this churn(attrition). I
will try to apply my learning of clustering
and data point distance measurements to
identify a few critical factors that drive a
high attrition rate.

For this analysis, I will use a dataset that


contains employee attrition along with
employee profiles.

The codebase for this analysis can be


found here:

https://github.com/nt27web/WebMining-
Clustering

Data Exploration
Let’s look at the data at hand. The format of
data is .csv

Python with the following libraries will be used


for this analysis

from IPython.display import


display
import pandas as pd
from sklearn.cluster import
KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing
import LabelEncoder,
MinMaxScaler

pandas: for data extract, cleansing, and


manipulation

sklearn: for the k-means and related


calculations

matplotlib: to plot the graphs of various


datasets and calculations

Data Attributes:
First, I extracted the CSV files into a
python dataframe-

data = pd.read_csv('HR-Employee-
Attrition.csv')

display(data.shape)

The shape of the dataset looks like this:


(1470, 35), i.e., 1470 rows, and 35 columns.

Let’s check for the null and empty values:

display(data.isnull().sum())

All the columns have filled data and no


null or empty cells, which are good.

After checking some of the categorical


attributes for their unique values and
statistics for numerical attributes, we
found the below columns appropriate for
our analysis.

Columns: Age, Daily rate, and Education


Field.

display(data['Age'].describe())
display(data['DailyRate'].descri
be())
display(data['EducationField'].u
nique())

Column Stats and values:

Along with the above columns, the below


columns are distributed on a similar scale,
which will help identify the areas of
attrition.

Columns: Years at company, Years In


Current Role, Years Since Last Promotion,
Years with Current Manager.

display(data['YearsAtCompany'].d
escribe())
display(data['YearsInCurrentRole
'].describe())
display(data['YearsSinceLastProm
otion'].describe())
display(data['YearsWithCurrManag
er'].describe())

Result:

Some fields are categorical and have only


2/3 unique values. E.g., Gender, Marital
status. We will have to remove them from
the dataset as they will not help identify
the cluster clearly.

So, I have taken the categorical column


having the most density in terms of values -
Education Field.

I will have to vectorize this field for my


analysis Since k-means does not work on
categorical attributes.

display(len(data))

Number of records: 1470

The final set of attributes chosen for the


analysis:

‘Attrition’, ‘DailyRate’, ‘EducationField’,


‘YearsAtCompany’, ‘YearsInCurrentRole’,
‘YearsSinceLastPromotion’,
‘YearsWithCurrManager’

I created my dataset with the above-


mentioned fields:

f_data = pd.DataFrame(data,
columns=['Attrition',
'DailyRate', 'EducationField',
'YearsAtCompany'
, 'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager'
])

Now let us check the dataset-

display(f_data.head())

I also have filtered the dataset where


attrition is ‘Yes.’ As it implicates, I am
interested only in the records where the
employee has left the company.

m_data =
f_data[f_data['Attrition'] ==
'Yes']
f_data = m_data.drop(
['Attrition'], axis=1)
display(len(f_data))

Total number of records after filtering: 237

I will vectorize the categorical field


EducationField since all my columns need
to be numeric vectors to find k-means.

X = f_data
y = f_data['EducationField']
le = LabelEncoder()
X['EducationField'] =
le.fit_transform(X['EducationFie
ld'])
y = le.transform(y)

Result:

I need to adjust the scale of the parameters


to fit in the k-means function and make
drawing the scatter plots easy.

s = X.columns
ms = MinMaxScaler()
X = ms.fit_transform(X)
X = pd.DataFrame(X, columns=
[cols])

Result:

As you may have noticed, the values in the


“YearsXXX” columns (YearsAtCompany,
YearsInCurrentRole,
YearsSinceLastPromotion,
YearsWithCurrManager) have changed to
decimal since they took the same range of
scale and proportion.

Before we find the k-means clusters, we


need to supply a k number; basically the
number of clusters we are interested in.
There are several methods available to find
the optimal k number. The most frequently
used is called the “Elbow method.”

Elbow Method to find k-value


If a graph is plotted to have several clusters
along the x-axis and the number of errors
along the y-axis, a curve is formed which
shows the relationship between the
number of clusters and several errors. This
curve has a point where it changes its
trajectory of the steep decline of errors
with an increase in clusters. That point is
called the ‘elbow.’ Beyond that point, with
the changes of the cluster, the error does
not decrease steeply. We take a cluster
value at that point. We also test some
values above and below that number to
find the optimal k value.

Let’s take a look at the elbow plot -

As you can see based on this graph, the


optimum number is 2. We will also test
values 3, 4 & 5 and check the accuracy.
Here we check the accuracy with each
number of clusters value(k).

k_means = KMeans(n_clusters=2,
random_state=0)
y_k_means =
k_means.fit_predict(x)
labels = k_means.labels_
# check how many of the samples
were correctly labeled
correct_labels = sum(y ==
labels)
print("Result: %d out of %d
samples were correctly labeled."
% (correct_labels, y.size))
print('Accuracy score: {0:0.2f}
%'.format((correct_labels *100)/
float(y.size)))

When K=3

The accuracy is very low so, k value 3 is not


optimal.

Let’s take a look at the scatter plot with the


x-axis as YearsInCompany and Y-axis as
YearswithCurrentmanager.

plt.scatter(x['YearsAtCompany'],
x['YearsWithCurrManager'],
c=y_k_means, cmap='rainbow')
plt.show()

Clearly, it’s tough to identify the clusters,


and the number of errors is high. This also
means the elements which are part of the
same cluster are not geographically located
together.

When K=4

This is a decent accuracy value considering


the number of rows in the dataset. But we
will validate using the scatter plot again.

Like the earlier scatter plots, it’s tough to


identify the clusters. So, we will continue
to test other k values.

When K=5

The accuracy is lower than K=4 and the


scatter plot below also suggests the same.

Finally, we will test with k = 2

Clearly, we found that the accuracy level is


highest among other values around the
‘elbow.’

Let us validate the same with a scatter plot


using different columns on the x and y-
axis.

X= DailyRate,
Y=NumberofYearsinCompany

As we can see, the data points are scattered


all over and can not be conclusively
identified which area has had higher
attrition.

Let us validate the same with a scatter plot


using different columns on the x and y-
axis.

X= EducationField,
Y=NumberofYearsinCompany

As you can see, another inconclusive


scatter plot without a clear
segment/cluster.

Let us validate the same with a scatter plot


using different columns on the x and y-
axis.

X= NumberofYearsinCompany,
Y=Yearswithcurrentmanager

Clearly, K=2 gives us the most optimal


result in terms of accuracy and clear
clusters (represented by two distinct
colors). Also, YearsAtCompany and
YearswithCurrentManager plot give us
some degree of clear segments.

As you can see, the first segment (cluster in


purple) represents the employees who
have left the organization and who are
working for longer tenure with the
company having the same manager for a
long period.

Employees who have stayed with the


company longer but constantly worked
with different managers have not left the
company.

The second segment (cluster in red) is


where the employees in their early years in
the company, even though they have had
worked with managers for a lesser number
of years or have changed managers
regularly, have left the organization.

People who have stayed with their


manager for a longer time and are not
associated with the company longer have
much stayed with the company.

Based on this analysis, an organization may


derive the areas of improvement as follows.
An organization may choose to focus on
the employees in their early days in the
company and pay extra attention if they
are growing and if they feel accomplished.

Similarly, employees who are with the


company for a longer period might need a
change in their role, project, and manager.

You might also like