You are on page 1of 8

Using HR Analytics to Support Managerial Decisions: A Case

Study
Liyuan Liu∗ Sanjoosh Akkineni∗
lliyuan@students.kennesaw.edu sakkinen@students.kennesaw.edu
Analytics and Data Science Institute Analytics and Data Science Institute
Kennesaw State University Kennesaw State University
Kennesaw, Georgia, USA Kennesaw, Georgia, USA

Paul Story∗ Clay Davis†


pstory@kennesaw.edu clayton.davis@gmail.com
Analytics and Data Science Institute Novelis Inc.
Kennesaw State University Atlanta, Georgia, USA
Kennesaw, Georgia, USA

ABSTRACT 1 INTRODUCTION
More and more organizations are becoming employee-centric in With the tremendous development of data science and business
the 21st century. Employment and workforce industry becomes intelligence, analytics becomes a critical tool in many fields, such as
crucial because the human capital value is directly linked to the marketing, finance, natural language processing, etc. The essential
organization’s profitability. Human Resource (HR) Analytics en- purpose of business analytics for the organization is understand-
ables HRs to make strategic contributions and support managerial ing how to leverage the existing data to support the managerial
decisions. However, in most of the industry, HRs should have been decision and create business value. Employment and workforce
on board with data analysis. There are several challenges: the HR industry become crucial in an organization because the human cap-
data is significant, messy, and imbalanced, it is hard to harness ital value is directly linked to the organization’s effectiveness [24].
both structured and unstructured data, some HR managers lack How to solve critical people problems using human resource data
data mining skills and the lack of related empirical research that is one of the main tasks for organizations. Regarding the Economist
gives a detailed analytics guideline. The contribution of this study Intelligence Unit report, there were 82% of organizations intending
is that we develop a framework to support an industrial aluminum to either start or develop their HR Analytics before the end of 2018
company to make the decisions and to improve strategy execution. [27].
The framework includes descriptive analysis, predictive analysis, HR analytics is the systematic identification and quantification
and entity sentiment analysis. We analyzed an industrial aluminum of the people drivers of business outcomes [30]. Many researchers
company’s HR data as a case study and found some actionable suggested using data mining and artificial intelligence to solve
issues using descriptive analysis. Then we employ machine learn- the people analytics problems [4, 28]. At the same time, advanced
ing algorithms to predict employees’ turnover rates and find risk analytics algorithms could support organizations getting benefits,
factors. Moreover, we applied the entity sentiment analysis on the such as preventing the voluntary turn over rate of high talents,
unstructured data collected from employees’ engagement survey. detecting human capital risk, increasing hiring efficiency, etc. For
example, Google People Analytics team optimized their HR survey
CCS CONCEPTS questionnaires and got a response rate above 90%, which is a highly
• Computer systems organization → Machine Learning; Mod- satisfying response rate for all questionnaires [26]. Even though
elling; many of the HR analytics taking place today, but is in the concept
stage, the implementation of HR analytics in the real-world still
meets many challenges.
KEYWORDS
HR Analytics, Turnover Prediction, Sentiment Analysis, Imbalanced (1) There lack guidelines for HRs to analyze the data and HRs
Data lack data analytics skills: As HR Analytics is a new area for
organizations, there requires experienced teams or guide ma-
terials to conduct the human resource data analysis process.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
HR analytics still is an open research question. In human
for profit or commercial advantage and that copies bear this notice and the full citation resource management education sectors, students may not
on the first page. Copyrights for components of this work owned by others than ACM have the same quantitative rigor compared with MBA stu-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a dents. The requirement for the modeling or coding for HRs
fee. Request permissions from permissions@acm.org. in the data analysis could be another challenge for the de-
ACMSE 2020, April 2–4, 2020, Tampa, FL, USA velopment of HR Analytics.
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7105-6/20/03. . . $15.00 (2) The broad data: Many organizations may encounter a prob-
https://doi.org/10.1145/3374135.3385281 lem that they have data, but they do not know how to use
them. Alternatively, they have the plan to do people analyt- the marriage of modem HRIS technology with time-honored con-
ics but do not know what kind of data they should collect. cepts of job satisfaction and employee motivation could improve
The messy and extensive real-world human data puts many companies’ performance and productivity [3]. In 2005, Lippert et
organizations in a dilemma. al. proposed their model to explore the relationship between HRIS
(3) The data ethics problem caused by HR analytics: Most of the and technology trust [19]. Maier et al. [25] produced a model and
personal data is sensitive and private [14, 15, 20]; they may shown an indirect effect of attitudes towards the HRIS on turnover
include personal health information, sexual orientation, the intention is fully mediated by employees’ satisfaction.
employees’ identity information, etc. This private informa- However, as the requirement of HRIS, the information must
tion can put HR analytics in the murky territory of their lack input to a standard form; otherwise, it is hard to utilized when
of appropriate privacy-preserving strategy. organizations made their managerial decision. Also, with the de-
(4) Big data is not big enough: There are many small organiza- velopment of big data, there are more and more data generated
tions always asking a question: Due to the organization size, every day and everywhere.The research related to HR Analytics
it is not easy for them to collect big data to analyze. How to was roaring after 2010. HR analytics has several debates but is still
employ HR analytics in small size organizations? developing in the big data era. Angrave et al. discussed their con-
clusions in their paper [1]. They agreed that HR analytics is not
These challenges are critical when organizations start to analyze
necessary to develop into a “must-have capability." The first reason
human data. Therefore, we designed an HR Analytics framework
is that HRs lack analytics and coding skills. Data mining skills were
that supports HRs to use data analytics and make the managerial de-
not required for most HR degrees curriculums in academia. There
cision, as well as to promote the popularization of HR Analytics in
existed a situation that HRs do not understand analytics, while ana-
different industries. We separated the framework into three stages.
lytics teams do not understand HR. However, they pointed out that
The pre-analysis stage includes data collection, data cleaning, and
day-to-day metrics, measures, analysis platforms, and frameworks
data preprocessing. We devised human data to 6 aspects that help
were necessary to upgrade HR professions’ analytics capabilities.
organizations collect human data efficiently and accurately. We also
Most researchers were in favor of the development of HR analytics
address the solution, that if the collected data is imbalanced, how do
was determined. As Bassi mentioned in their article [2], HR ana-
we employ statistical methods to solve the imbalanced problem. The
lytics was an evidence-based approach that helped employers to
analysis stage includes descriptive analysis, which not only summa-
support organizations to make better decisions. Even though some
rizes the distributions of datasets but also helps organizations find
organizations claimed HR analytics is not a core competency of
insights from data. The descriptive analysis also could help small
them, it still added the benefits to help HRs reduce the workload
companies to find the patterns of data and support making strategic
while improving the profit of organizations.
decisions. In the predictive analysis stage, different data mining
Descriptive analysis and predictive analysis are two primary
models could be created to predict human trends, such as predicting
components of HR analytics. The descriptive analysis mainly uses
turnover, performance, etc. Nowadays, more and more companies
data to describes what already happened in organizations. It could
may have unstructured data, for example, the engagement survey
help us to find the business insights behind data, and the envi-
data. How to employ models to deal with the unstructured data
sioned report or dashboard can support to reveal the underlying
becomes another crucial question. In this study, we also proposed a
cause of the event in business. Compared with descriptive analysis,
method to analyze the text data collected from employees’ engage-
predictive analysis is a data-driven approach that uses statistical
ment survey. These text data also can be analyzed using machine
techniques, machine learning methods, and data mining models
learning algorithms and sentiment analysis [22]. The post-analysis
to extract future trends from historical data [21, 28, 34]. Regarding
stage is the final report and the decision-making stage. In this stage,
the study of Mishra et al., they concluded the predictive analysis
data visualization is required. The imagined dashboards or reports
to several aspects: turnover modeling, which is used to predict fu-
will be established, and managers could make the management
ture voluntary or involuntary turnover of the business in specific
decisions based on the dashboards or reports. In this study, we em-
functions, business units, geographies and countries using a vari-
ployed a real-world dataset collected from an industrial aluminum
ety human data such as tenure, performance management, merit,
company. The case study confirms that our proposed HR analytics
etc. Response Modelling mainly developed predictive models with
framework is suitable for the real-world.
previous job advertising data and helped to target the appropriate
Other parts of this study are organized as follows. Section 2
candidates or platforms to recruit employees. After the turnover
discusses the current advanced research, Section 3 introduces the
prediction, we also could employ the retention model to create the
overview of our proposed HR analytics pipeline. Section 4 discusses
employees’ profile and identify the high-risk turnover factors. They
the methodology of the predictive analysis stage. Section 5 shows
also mentioned risk modeling and talent forecast modeling could
the insights we found and how the company could use the findings
help organizations seeking talents [28].
to make managerial decisions. At last, we give the conclusions in
The related research of predictive analysis is shown in Table 1.
Section 6.
Several abbreviations need to be clarified in Table 1. Decision Tree
denoted as DT, Random Forest denoted as RF, logistic regression de-
2 RELATED WORK noted as LR, neural network denoted as NN, naive bayes denoted as
With the development of computers, the Human Resource Informa- NB, K-nearest neighbors denoted as KNN, support vector machines
tion System (HRIS) becomes a fundamental system in enterprises. denoted as SVM, Linear model denoted as LM, gradient boosting
The research of HRIS developed from the 1990s. Berry claimed that tree denoted as GRT, extreme gradient boosting trees denoted as
Prediction Data Model Balanced
Publication
Field
Models
Type Recommended Strategy
• Performance data: It could consist of the performance rating
Kane-Sellers
Turnover LG Structured LG No
score, self-rating score, manager rating score, etc.
2007 [17] • Promotion data: If the employee got transferred, promoted,
Hamidah et al. DT, RF,
Performance Structured DT No or demoted, these data all could be included.
2011 [12] NN
Sikaroudi et al.
Turnover
DT, RF, NB,
Structured RF No • Training and development data: For example, the training
2012 [29] KNN, SVM, NN
Faliagka et al.
method, training time, training type, etc.
Recruit LM, DT, SVM Structured DT No
2012 [8] • Engagement survey data: It could be either structured data
Di Mitri et al. or unstructured data. For instance, the rating for each survey
Performance LM Structured LM No
2017 [7]
DT, RF, GB, XGB, questions is the structured data, and the employees’ text
Zhao et al.
2018 [33]
Turnover LR, SVM, NB, Structured XGB No comments are unstructured data.
NN, LDA, KNN
Gao et al.
Turnover
DT, NN, RF,
Structured WQRF Yes
The second stage is the analysis stage. The two main analysis
2019 [9] LR, WQRF methods are descriptive analysis and predictive analysis. In the
Table 1: Related Research of Predictive Analysis descriptive analysis, the main task is understanding the organiza-
tion’s HR problems and using statistics metrics or understandable
visualizations to show the insights behind data. Descriptive analysis
is also the primary analysis technique for small organizations that
XGB, Linear discriminant analysis denoted as LDA, and WQRF de- don’t have enough big data to create the predictive model. In this
noted as the weighted random forest. Most of the studies focus on process, incorporate some HR metrics or measures and present the
using different machine learning algorithms to predict HR problems. result visually are recommended. The predictive analysis takes the
However, the articles in Table 1 all focused on structured data but role of game-changer in the HR industry. It could improve employer-
didn’t give an example of how to address the analysis of unstruc- employee interactions in advance. There are three main methods
tured data, such as employees’ comments from the engagement for predictive analysis: statistics-based analysis, machine learning-
survey. Few of them solve the imbalanced data problem. Since in based analysis, and deep learning-based analysis. We found three
the real world, most HR data is imbalanced; for example, the ratio principal problems that can be solved using predictive analysis.
of turnover employees to active employees. In this study, we not
• Turnover prediction: This method could give the most im-
only show how to use the structured data to finish the HR analysis
portant factors that affect the voluntary turnover rates of
pipeline but also provide a method of analyzing unstructured data
high talents in organizations. It also enables organizations
in HR analytics using real-world datasets.
to target the possible future turnover employees in the orga-
nization.
3 ANALYTICS FRAMEWORK OVERVIEW • Recruit prediction: Nowadays, organizations are using social
In this section, we introduce an overview of the HR analytics frame- media or advertising to recruit talents. The predictive model
work we proposed in the real industry. As Figure 1 shows, there are also could help organizations to create an improved and
three stages in the analysis process. They are pre-analysis stage, streamlined hiring strategy and targeting an appropriate
analysis stage, and post-analysis stage. employee when analyzing the candidates’ behavior.
• Profile prediction: Using advanced predictive analysis, orga-
Pre-analysis Stage Analysis Stage Post-analysis Stage
nizations could use clustering or predict the common factors
Define business needs and get data Explore and model data: descriptive Organization strategic decision making:
Demography payroll Promoted analysis, predictive analysis, data for specific employees, such as find the general factors for
visualization, etc. • To give priority to the investments
Engagement
• Promote good performance high turnover risk employees.
Training Performance employees who have high turnover
Survey risk
• Change merit rate, etc. Post-analysis is the third stage in our analysis framework. In
this stage, the managerial strategies will be made regarding the
analysis results from the second stage. The actionable results are
Figure 1: Overview of Analysis Framework
recommended to be included in the final visualizations. An organi-
zation could make some decisions, for instance, promoting excellent
The first stage is the pre-analysis stage, which aims to define performance employees who have high turnover risk regrading the
the business needs and collect the necessary data. There was a predictive turnover analysis.
common problem for all organizations: they don’t know which kind
of data is suitable for analysis. It is critical to accumulate data from 4 METHODOLOGY
different operations and departments within the organization for In this section, we mainly discuss the methodologies in the pre-
positively and efficiently implementing HR analytics. We conclude dictive analysis. In the predictive analysis, we have two models:
the necessary data for analysis with six categories. It also could be one is multiple machine learning models created with structured
a suggestion for organizations when they are struggling with HR data; the other is the entity sentiment analysis using Google Cloud
data collection. Natural Language Processing API [11]. Since our dataset is im-
• Demography data: It includes the personal information of balanced, we also applied three balanced strategies to solve the
employees such as age, gender, tenure, department, etc. imbalanced problem that could also improve the model’s perfor-
• Payroll data: Any data related to payroll such as merit, actual mance. The three balanced strategies are random oversampling,
vs. target pay, etc. SMOTE, and ADASYN. In order to prevent the overfitting problem,
Variable Type Category Description
stratified cross-validation is employed in the data split process. We Age Numeric Demography Employees’ physical age
tested five machine learning models: K-nearest Neighbors (KNN), Gender Nominal Demography
Female-0
Male-1
Logistic Regression (LG), Random Forest (RF), Gradient Boosting High School-1
Diploma-2
Tree (GRB), and Decision Tree (DT). In this study, 5-folds stratified Highest
Ordinal Demography Bachelor-3
Education Level
cross-validation is employed. Master-4
Doctoral-5
Married, Single,
Marital Status Nominal Demography
4.1 Data Description Divorced
Years of employees
Tenure Numeric Demography
stayed in the company
The data is collected from an aluminum company. There are 1,866 Tooth-0
employees with 22 variables in the final cleaned dataset. The out- Tooth and Tail Nominal Demongraphy
Tail-1
Departments categories defined
comes of the dataset are voluntary departure or active, which indi- by the department function
Junior-1
cates 1 and 0 as the target variable. The dataset is imbalanced since Middle-2
1,345 employees are active, and only 251 employees voluntarily Management Level Ordinal Demography Senior-3
Top-4
depart. The details of each variable are shown in Table 2. We used Defined based on job band
No-0
the first 21 variables in the table to finish the descriptive analysis Yes-1
Match Nominal Demography
and predictive analysis, the last two unstructured data in the table If the personal job band match the
position job band
were employed to conduct an entity sentiment analysis, and address Termination Status Nominal Demography
Active-0
Voluntary-1
the most positive and negative aspects employees most considered Far below expectation-1
in the company. Below expectation-2
Final Rating Ordinal Performance Meet expectation-3
Exceed expectation-4
4.2 K-nearest Neighbors Algorithm Far exceed expectation-5
The percentage of
Actual vs. Target Pay Numeric Payroll
actual vs. target payroll
K nearest neighbor algorithm is based on similarity functions, K Merit Numeric Payroll The percentage of merit
nearest neighbor is a simple algorithm that stores all available cases Prorated Nominal Payroll
No-0
Yes-1
and classifies new cases based on a similarity measure, it uses a No-0
similarity measure or distance function to find the nearest cases to Yes-1
If the employee was
a new case. Before 2005? Nominal Payroll hired before 2005?
Since there are different
We use K=5 in this study regarding the Elbow curve ran in this packages used before and
case [18]. after 2005.
Promotion, Transferred,
Action Nominal Promotion Demotion, No Promotion,
Rehire
4.3 Logistic Regression Growth score from 1 to 5
Engagement
Growth Numeric collected from engagement
Logistic regression is an important machine learning algorithm. Survey
survey
The goal is to model the probability of a random variable Y is 0(stay) Engagement
Positive leadership score
Positive Leadership Numeric from 1 to 5 collected from
Survey
or 1(turn over) given experimental data. yi ∈ Y that represents the engagement survey

outcome of each employee. A set of X=x 1 , x 2 , ..., x n represents the Autonomy Numeric
Engagement
Autonomy score from 1 to
5 collected from engagement
Survey
features are input in the model [23]. The generalized linear model survey
Competence score from 1 to
function of logistic regression can be defined with the parameter θ . Engagement
Competence Numeric 5 collected from engagement
Survey
survey
1 Pride score from 1 to 5
hθ (x) = (1) Pride Numeric
Engagement
collected from engagement
1 + e −θ x
T Survey
survey
Engagement Employees’ comments about things
It seeks to predict the effect of a series of variables on a binary Things most like Text
Survey they most like in the company
Engagement Employees’ comments about things
response variable. So logistic regression can work a lot like multiple Things most dislike Text
Survey they most dislike in the company
regression with several independent variables and the one binary Table 2: Variable description
dependent variable. You can also seek to classify observations by
estimating the probability that an observation is in a particular
category. The relationship between the input set x 1 , x 2 , . . . , x n )
and the predicted probability P of the classes can be defined as:
P [32]. These algorithms produce a set of rules which can be em-
loд( ) = β 0 + β 1X 1 + · · · + βn X n (2)
1−P ployed for prediction through the repeated process of splitting. The
We use L2 regularization with primal formulation to prevent the specific metrics could calculate the splitting rule. Classification er-
multicollinearity problem in this study. ror, information gain, and Gini are the three most utilized values
to used measure how well a given attribute separates the training
4.4 Decision Tree examples. Gini and entropy could be calculated use Equation ??
Decision tree is a popular supervised learning algorithm that widely and 4. Information gain (S, X) of a variable X relative to a collection
used in both industry and academia [10]. The chi-squared auto- of examples S, is defined as:
matic interaction detection (CHAID), classification, and regression Õ |Sv |
trees (CART), C4.5, and C5.0 are some of the most common tree дiniGain(S, X ) = Entropy(S) − Entropy(Sv ) (3)
|s |
methods. Information gain and entropy are used to create the trees v ∈value(X )
Original Data SMOTE Data
Entropy is a measure of homogeneity that can be defined as below. 3 3
In Equation 4, m denotes to the number of classes and the proportion 2 2
of sample S that belong to class “i” denotes as Ni 1 1

Original Data PC2

SMOTE Data PC2


0 0
m
Õ 1 1

Entropy(S) = −Ni loд2 Ni (4) 2 2

i=1 3 3
10 0 10 20 30 40 0 10 20 30 40
Original Data PC1 SMOTE Data PC1
ADASYN Data Random Oversampling Data
4.5 Random forests 3 3

Random forest algorithm is one of the most popular and most pow- 2 2

Random Oversamling Data PC2


erful supervised machine learning algorithms that are capable of 1

ADASYN Data PC2


1

performing both regression and classification tasks. This algorithm 0 0

creates a forest with several decision trees. In general, the more 1 1

trees in the forest, the more robust the prediction and thus higher 2 2

accuracy. 3
10 0 10 20 30 40
3
0 10 20 30 40
ADASYN Data PC1 Random Oversampling Data PC1
Advantages of the random forest, random forest classifier can be
used for both classification and regression tasks. It will handle the
missing values and maintain accuracy when a large proportion of Figure 2: Data Visualization Before and different Balanced
the data is missing. When we have more trees in the forest, random Strategies
forest won’t overfit the model. It has the power to handle large data
sets with higher dimensionality.
4.8 Evaluation Metrics
Evaluation metrics need to be well designed when we compare
4.6 Gradient Boosting
the performance of different models. In this study, four evaluation
Gradient Boosting tree has shown reliable performance in many metrics are employed: ROC AUC, accuracy, recall, and precision.
studies. It is a member of ensemble learning, which combines weak After finishing the modeling process, on the training dataset, the
learners to strong learners. It is evident that the gradient boosting testing data is operated with the model, and the four evaluation
tree is a method combine gradient descent and boosting to optimize metrics value could be measured.
the loss function and then find the optimal solution. In each stage,
introduce a weak learner to compensate for the shortcomings of 4.9 Google Cloud NLP API: Entity Sentiment
existing weak learners. Based on the variable target characteristics,
it can be used for regression or classification objects. The more
Analysis
background can be reached in Chen’s research[6]. Google cloud natural language API could provide different advanced
text mining services. It enables us to complete various tasks of
sentiment analysis, entity analysis, entity sentiment analysis, etc.
4.7 Balanced Strategies Entity-based sentiment analysis is employed in this study. The
There are three balanced strategies we implement to deal with the entities mentioned in each employee’s comments sections of the
imbalanced problem in this study. They are random oversampling, documents will be detected at first. Then a sentiment analysis model
synthetic minority over-sampling technique (SMOTE), and a novel could be performed and accurately predicts the sentiment expressed
adaptive synthetic (ADASYN) sampling approach. about each entity in employees’ comments. For example, if the
employee wrote his or her comment as “IT system is not very is
• Random over-sampling is a straightforward strategy that
useful for documents sharing,". With entity sentiment analysis, the
generates new minority samples randomly and then replaces
IT system will be detected as the entity of this answer at first. Then
the current majority samples.
a negative score will be given since it expresses employees’ negative
• SMOTE is an advanced over-sampling strategy. It was de-
attitude of the IT system in the organization.
veloped in 2002 [5]. The main idea of this approach is over-
sampling the minority by creating a new "synthetic" samples
instead of simple random over-sampling and replace. The ex-
5 RESULTS AND DISCUSSION
periments results show this method has a better performance In this section, we show our results in the analysis stage and also
compared with random oversampling [13, 31]. discuss the insights regarding our findings. We test 5 models, as the
• ADASYN exhibited after SMOTE. In ADASYN, there has last section mentioned. They are logistic regression (LR), KNN, de-
a weighted distribution generated regarding the different cision tree (CART), random forests (RF), and gradient boosting tree
monitory samples learning difficulty. The details of ADASYN (GB). With the stratified cross-validation, the model’s performance
could be found in [16]. is displayed in Table 3. From the results, we could see that gradient
boosting and random forest have the best performance with ran-
Figure 2 shows the visualization of employees in our dataset before dom oversampling. When we rank the features, we found that the
and after each balanced strategy. The red dots represent voluntary most critical factors that affect attrition are Actual vs. Target pay,
employees, and green dots denote active employees. Tenure, and age, as Figure 3 shows. Logistic regression also could
Balanced Strategy Models AUC Recall Precision Accuracy
Random Over-sampling Logistic Regression 0.60 0.58 0.24 0.62 6
KNN 0.53 0.43 0.18 0.61
Decision Tree 0.54 0.21 0.24 0.77
Random Forest 0.62 0.31 0.43 0.83 5
Gradient Boosting 0.68 0.63 0.28 0.73

Odds Ratio
SMOTE Logistic Regression 0.58 0.54 0.26 0.61 4
KNN 0.55 0.48 0.22 0.64
Decision Tree 0.61 0.38 0.24 0.75
3
Random Forest 0.56 0.18 0.21 0.60
Gradient Boosting 0.61 0.32 0.25 0.69
ADASYN Logistic Regression 0.62 0.56 0.23 0.66 2
KNN 0.55 0.51 0.18 0.60
Decision Tree 0.59 0.33 0.25 0.76 1
Random Forest 0.60 0.25 0.38 0.83
Gradient Boosting 0.61 0.31 0.25 0.81
0
Table 3: Model Performance with Different Evaluation Met-

FBE

BE

Merit

Transfer
Junior Management

Demotion

Promotion

Senior Management
rics

Variables
0.07
Figure 4: Overall Most Important Factors Affect Turnover
0.06
Relative Importance

0.05

0.04

0.03

0.02 a. South America b. Asia c. Corporate

0.01

0.00
Actual vs Target

Tenure

Age

Before 2005

Final Rating

Prorated

Single

Married
Education Level

Merit

Tooth and Tail


Management Level

Rehire

Variables d. Europe e. North America f. R&T

Figure 3: Factor Importance From Random Forest Figure 5: Most Important Factors Affect Turnover by Region

produce the feature ranking. Regarding the odds ratio, we could un- Firstly, the outputs of logistic regression show that employees who
derstand the most critical factors that cause employees’ voluntary have been transferred or promoted are more likely to leave, espe-
departure. These factors are contributing to creating the profile of cially those with lower merit payouts. Secondly, there exist critical
the employees who have high-risk for voluntary departure. Figure years of departure, which is around three to six years of tenure.
4 shows the odds ratios greater than 1, which indicates a positive Organizations should pay more attention by doing a better job in
relationship between each feature and the voluntary departure. Per- the employee care with employees who have three to six tenure.
formance rating is the most critical factor that affects turnover. The Thirdly, make sure there is no experience of a mismatch between
experiment results also show that higher merit, higher voluntary job bands (position and personal). Fourthly, we found that employ-
departure risk, which also gives an alert to the company about pay- ees who are leaving voluntarily are consistently performing higher
roll management. We also performed segment models by regions (constant merit increases) or lower (constant merit decrease). At
since the company has different branches all over the world. Not last, we suggest companies pay more attention to a problem that
surprisingly, the different locations have different factors that con- there are no discrepancies among incentives: Higher merit scores
tribute to voluntary departure, as Figure 5 shows. As the essential but lower actual vs. target payouts, the employee most likely to
factors show, there are exceptional cases in North America and RT. voluntary departure as Figure 6 shows. The average merit for an
The top management employees most likely to voluntary departure active employee is 0.034, which is lower than voluntary employees.
compare with other regions. It gives the company attention that Voluntary employees’ average merit is 0.038. However, actual vs.
makes the managerial strategies and take actions to keep the top target pay has the opposite trend. Voluntary employees have lower
managers who have high performance as well. actual vs. target pay compared with active employees. The average
We mainly answered four questions via descriptive and predic- actual vs. target pay for active and voluntary employees is 1.16 and
tive analysis. The first question is, What suggestions could we make 0.92.
for monitoring high-risk employees? We make several suggestions The second question we answered is, “Does the company retain
that could support the company in making managerial decisions. high-performers?" We mainly use descriptive analysis to solve it. In
1.0
Active
1.0 Voluntary
0.8

0.8

Odds Ratio
0.6
0.6

0.4
0.4

0.2 0.2

0.0
Merit Actual vs Target 0.0

Growth

Pride
Positive Leadership

Autonomy

Competence
Figure 6: Average Merit and Actual vs Target Pay by Termi-
nation Variables

general, the answer is “Yes". All the high performers in different re- Figure 7: Odds Ratio with Engagement Survey Variables
gions have a lower voluntary attrition rate compared with average
and low performers, but the top performers in Europe have higher
employees’ comments. All the comments answer two questions
voluntary attrition rates than average performers. With the descrip-
“Things that excited you working in the company?" and “Thing that
tive analysis, we found Europe has the highest high performers
doesn’t excite you in the company?" For the structured data, we
voluntary attrition ratio (24.24%) among all regions, and RT has the
separate the survey questions into five categories; they are Growth,
lowest rate (7.14%). The ratios for other regions are Asia 23.17%,
Positive Leadership, Autonomy, Competence, and Pride. We linked
Corporate 21.18%, North America 17.9%, and South America 9.38%.
the survey results to the termination type to find if the engagement
We also conclude the profiles of high-performers who also have
has some relationship with termination. The odds ratio of logistic
high voluntary risk. High performers who leave voluntarily tend
regression is shown in Figure 7. We found that higher engagement
to be younger, have higher Actual vs. Target pay and higher merit,
scores in growth, positive leadership, autonomy, are associated with
have 2-3 years of tenure, and were promoted or transferred. We
higher voluntary attrition. Higher competence and pride scores are
also defined three categories of employees by performance rating
associated with lower voluntary attrition. For the unstructured
employees who got FBE and BE defined as low performers. Average
performers have ME performance rating, and EE and FEE employ-
ees are converted to high performers. The comparison of voluntary
ratio among the three performers is shown in Table 4. This table
shows the different voluntary ratios in different regions. The third

Region Low performers Average performers High performers


Overall 41.98% 21.25% 18.29%
Asia 52% 23.43% 23.17%
Corporate 40% 25.34% 21.18%
Europe 37.50% 19.84% 23.23%
North America 29.28% 21.58% 17.90%
R&T 22.22% 14.46% 7.14%
South America 50% 18.57% 9.38%
Table 4: Voluntary Ratio of Different Regions by Performers
Type

question is, “Does previous year merit and payout drive higher per-
formance for the following year?" With the descriptive analysis, we
found the answer is: In general, high merit drives high performance, Figure 8: Entity Sentiment Analysis for Most Excite Things
but not for Actual vs. target payouts. In Asia, increased merit and ac-
tual vs. target payout operate higher performance. In RT and South data, we performed the entity sentiment analysis using Google
America, increased merit and actual vs. target payouts both lead Cloud NLP API. Figure 8 shows the top 15 entities which most
to decreased performance in the following year. The fourth ques- excited employees in the company using text ming. Employees
tion is “How to analyze engagement survey, including structured had the most positive attitudes towards development, growth and
and unstructured data?" In the engagement survey, there are two opportunity, environment and atmosphere, teamwork and group
types of data. Structured data is the scores of each survey question relationships. Figure 9 shows the top 15 entities which most not ex-
answered by employees, and unstructured data is collected from cited employees. With these analyses, we found that employees had
Molecular Biosystems 5, 12 (2009), 1593–1605.
[11] Google. [n. d.]. Cloud Natural Language | Cloud Natural Language API | Google
Cloud. https://cloud.google.com/natural-language/. ([n. d.]).
[12] J Hamidah, H AbdulRazak, and AO Zulaiha. 2011. Towards applying data mining
techniques for talent managements. In 2009 International Conference on Computer
Engineering and Applications, IPCSIT, Vol. 2.
[13] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a
new over-sampling method in imbalanced data sets learning. In International
conference on intelligent computing. Springer, 878–887.
[14] Meng Han, Dongjing Miao, Jinbao Wang, and Liyuan Liu. 2018. Defend the
clique-based attack for data privacy. In International Conference on Combinatorial
Optimization and Applications. Springer, 262–280.
[15] Meng Han, Dongjing Miao, Jinbao Wang, and Liyuan Liu. 2020. A balm: defend
the clique-based attack from a fundamental aspect. Journal of Combinatorial
Optimization (2020), 1–22.
[16] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive
synthetic sampling approach for imbalanced learning. In 2008 IEEE International
Joint Conference on Neural Networks (IEEE World Congress on Computational
Intelligence). IEEE, 1322–1328.
[17] Marjorie Laura Kane-Sellers. 2007. Predictive models of employee voluntary
turnover in a North American professional sales force using data-mining anal-
ysis. Texas A&M University.
Figure 9: Entity Sentiment Analysis for Most Not Excite [18] Trupti M Kodinariya and Prashant R Makwana. 2013. Review on determining
Things number of Cluster in K-Means Clustering. International Journal 1, 6 (2013),
90–95.
[19] Susan K Lippert and Paul Michael Swiercz. 2005. Human resource information
systems (HRIS) and technology trust. Journal of information science 31, 5 (2005),
the most negative attitudes towards leadership, manager; training, 340–353.
development and opportunity; work and life balance. [20] Liyuan Liu and Meng Han. 2019. Privacy and Security Issues in the 5G-Enabled
Internet of Things. 5G-Enabled Internet of Things (2019), 241.
[21] Liyuan Liu, Meng Han, Yiyun Zhou, and Yan Wang. 2018. LSTM Recurrent
6 CONCLUSION Neural Networks for Influenza Trends Prediction. In International Symposium on
Bioinformatics Research and Applications. Springer, 259–264.
In this study, we performed HR analytics on a real-world dataset. We [22] Liyuan Liu, Jennifer Lewis Priestley, Yiyun Zhou, Herman E Ray, and Meng Han.
proposed our analysis framework as a three-stage analysis approach. 2019. A2Text-Net: A Novel Deep Neural Network for Sarcasm Detection. In
2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI).
We employed descriptive analysis incorporate with machine-learning- IEEE, 118–126.
based models to find the insights behind data. We not only quanti- [23] Liyuan Liu, Bingchen Yu, Meng Han, Shanshan Yuan, and Na Wang. 2019. Mild
tatively analyzed the dataset but also gave a case study that could cognitive impairment understanding: an empirical study by data-driven approach.
BMC bioinformatics 20, 15 (2019), 1–13.
guide the HR professions to investigate data following the pipeline. [24] Liu Liyuan, Han Meng, Zhou Yiyun, and Parizi Reza. 2019. Eˆ 2 C-Chain: A
Our findings prove that HR analytics has many benefits for orga- Two-Stage Incentive Education Employment and Skill Certification Blockchain.
In 2019 IEEE International Conference on Blockchain (Blockchain). IEEE, 140–147.
nizations through the decision-making process, and confirmed it [25] Christian Maier, Sven Laumer, Andreas Eckhardt, and Tim Weitzel. 2013. Analyz-
plays a central role in linking HR strategy to business outcomes. ing the impact of HRIS implementations on HR personnelâĂŹs job satisfaction
and turnover intention. The Journal of Strategic Information Systems 22, 3 (2013),
193–207.
REFERENCES [26] Steffen Maier. 2016. How Google Uses People Analytics to Create a Great Work-
[1] David Angrave, Andy Charlwood, Ian Kirkpatrick, Mark Lawrence, and Mark place. https://www.entrepreneur.com/article/284550. (Nov 2016).
Stuart. 2016. HR and analytics: why HR is set to fail the big data challenge. [27] Bernard Marr. 2019. Why Data Is HR’s Most Important Asset. https://www.forbes.
Human Resource Management Journal 26, 1 (2016), 1–11. com/sites/bernardmarr/2018/04/13/why-data-is-hrs-most-important-asset/
[2] Laurie Bassi. 2011. Raging debates in HR analytics. People and Strategy 34, 2 #1f94dada6b0f. (N 2019).
(2011), 14. [28] Sujeet N Mishra, Dev Raghvendra Lama, and Yogesh Pal. 2016. Human Resource
[3] William E Berry. 1993. HRIS can improve performance, empower and motivate Predictive Analytics (HRPA) for HR management in organizations. International
âĂIJknowledge workersâĂİ. Employment Relations Today 20, 3 (1993), 297–303. Journal of Scientific & Technology Research 5, 5 (2016), 33–35.
[4] Karl J Cama, Norbert Herman, and Daniel T Lambert. 2015. Human Resource [29] Esmaieeli Sikaroudi, Amir Mohammad, Rouzbeh Ghousi, and Ali Sikaroudi. 2015.
Analytics Engine with Multiple Data Sources. (July 23 2015). US Patent App. A data mining approach to employee turnover prediction (case study: Arak
14/159,906. automotive parts manufacturing). Journal of Industrial and Systems Engineering
[5] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 8, 4 (2015), 106–121.
2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial [30] Sjoerd Van Den Heuvel and Tanya Bondarouk. 2016. The rise (and fall?) of HR
intelligence research 16 (2002), 321–357. analytics: The future application, value, structure, and system support. In Acad-
[6] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. emy of Management Proceedings, Vol. 2016. Academy of Management Briarcliff
In Proceedings of the 22nd acm sigkdd international conference on knowledge Manor, NY 10510, 10908.
discovery and data mining. ACM, 785–794. [31] Juanjuan Wang, Mantao Xu, Hui Wang, and Jiwu Zhang. 2006. Classification of
[7] Daniele Di Mitri, Maren Scheffel, Hendrik Drachsler, Dirk Börner, Stefaan Ternier, imbalanced data by using the SMOTE algorithm and locally linear embedding.
and Marcus Specht. 2017. Learning pulse: a machine learning approach for In 2006 8th international Conference on Signal Processing, Vol. 3. IEEE.
predicting performance in self-regulated learning using multimodal data. In [32] Qing Ren Wang and Ching Y Suen. 1984. Analysis and design of a decision tree
Proceedings of the seventh international learning analytics & knowledge conference. based on entropy reduction and its application to large character set recognition.
ACM, 188–197. IEEE Transactions on Pattern Analysis and Machine Intelligence 4 (1984), 406–417.
[8] Evanthia Faliagka, Kostas Ramantas, Athanasios Tsakalidis, and Giannis Tzimas. [33] Yue Zhao, Maciej K Hryniewicki, Francesca Cheng, Boyang Fu, and Xiaoyu Zhu.
2012. Application of machine learning algorithms to an online recruitment 2018. Employee turnover prediction with machine learning: A reliable approach.
system. In Proc. International Conference on Internet and Web Applications and In Proceedings of SAI intelligent systems conference. Springer, 737–758.
Services. Citeseer. [34] Yiyun Zhou, Meng Han, Liyuan Liu, Jing Selena He, and Yan Wang. 2018. Deep
[9] Xiang Gao, Junhao Wen, and Cheng Zhang. 2019. An Improved Random For- learning approach for cyberattack detection. In IEEE INFOCOM 2018-IEEE Con-
est Algorithm for Predicting Employee Turnover. Mathematical Problems in ference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE,
Engineering 2019 (2019). 262–267.
[10] Pierre Geurts, Alexandre Irrthum, and Louis Wehenkel. 2009. Supervised learn-
ing with decision tree-based methods in computational and systems biology.

You might also like