You are on page 1of 123

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/376682505

HR Analytics: Association with Big Data and Machine Learning Algorithms. An


Empirical Study

Thesis · December 2023


DOI: 10.13140/RG.2.2.35545.49768

CITATIONS READS

0 327

1 author:

Mohammad Ataur Rahman


Hochschule Bremen
1 PUBLICATION 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mohammad Ataur Rahman on 20 December 2023.

The user has requested enhancement of the downloaded file.


"HR Analytics: Association with Big Data and Machine Learning Algorithms. An
Empirical Study"

MASTER’S THESIS
Submitted to the
School of International Business (SiB)
of Bremen University of Applied Sciences
In Partial Fulfillment of the Requirements

For the Degree

Master of Business Administration (MBA)

Submitted by: Mohammad Ataur Rahman


Afrikanische Str. 94
13351 Berlin

Matriculation No.: 5020389

First Examiner: Prof. Dr. Vera de Hesselle

Second Examiner: Prof. Dr. Armin Varmaz

Due Date: 12.10.2023


I dedicate my thesis to my beloved mother.
“The impact of a mother on the lives of her children is immeasurable.” - James
E. Faust

i
Abstract
In the contemporary landscape of Human Resources (HR) management, the
convergence of data analytics, big data technologies, and machine learning
algorithms has given rise to HR analytics, a transformative approach that
empowers organizations to make data-driven decisions in talent acquisition,
retention, and workforce optimization. This thesis presents an empirical
investigation into the integration of HR analytics, big data, and machine learning
algorithms within the HR domain, with a specific focus on three pivotal HR tasks:
resume screening, employee turnover prediction, and sentiment analysis.

The study leverages a diverse dataset, including resumes extracted from


LinkedIn, employee turnover questionnaire responses, and sentiment analysis
feedback collected via LinkedIn and WhatsApp. Methodologically, the research
entails data preprocessing the application of advanced machine learning
algorithms, and performance metric evaluation in Python, a general-purpose
programming language.

The findings reveal that machine learning algorithms significantly enhance the
accuracy and efficiency of resume screening, leading to more precise candidate
selection. Moreover, predictive models effectively identify employees at risk of
turnover and sentiment analysis uncovers valuable insights into employee
satisfaction and engagement, enabling organizations to address areas of concern
and enhance overall workplace conditions.

As organizations increasingly recognize the importance of HR analytics in their


strategic endeavors, this study serves as a foundational resource for HR
professionals, managers, and organizational leaders seeking to leverage big data
and machine learning to optimize their HR practices and ultimately enhance
organizational performance.

ii
Acknowledgment
“Praise be to Allah who is the Most Gracious, the Most Merciful!”

I express my gratitude to the Almighty, who has endowed me with the strength
and ability to successfully complete my thesis. I am deeply thankful to Allah for
His countless blessings upon me.

I extend my heartfelt appreciation to my primary mentor, Prof. Dr. Vera de


Hesselle, and secondary supervisor, Prof. Dr. Armin Varmaz for their dedicated
guidance, unflagging support, and invaluable insights throughout the course of
this research. I am grateful to all my professors who imparted their knowledge
during this program, enriching my understanding of their respective subjects.

I wish to acknowledge the contributions of my former program coordinator, Ms.


Regine Hink, and present coordinator Ms. Astrid Decker, as well as all the
members of the IGC, for their support and guidance from the moment I embarked
on this master's journey.

Lastly, my profound and heartfelt gratitude goes out to my parents, family, and
friends for their enduring and unparalleled love, assistance, and encouragement.

iii
Table of Contents

Abstract ................................................................................................................... ii
Acknowledgment .................................................................................................... iii
List of Figures ...........................................................................................................ix
List of Pictures .......................................................................................................... x
List of Abbreviations .................................................................................................xi
Chapter 1: Introduction .......................................................................................... 13
1.1 Background and context of the research ............................................................ 13
1.2 Research problem statement ............................................................................. 13
1.3 Research objectives........................................................................................... 14
1.4 Research questions ........................................................................................... 15
1.5 Significance of the study .................................................................................... 16
1.6 Scope of the study ............................................................................................ 16
1.7 Limitations of the study ..................................................................................... 16
1.8 Organization of the thesis .................................................................................. 17
Chapter 2: Literature Review .................................................................................. 18
2.1 HR Analytics: Concept and Evolution .................................................................. 18
2.1.1 Definition of HR Analytics .......................................................................... 19
2.1.2 Role of HR Analytics in decision-making ..................................................... 20
2.1.3 Historical Overview of HR Analytics ............................................................ 21
2.1.3.1 Early Beginnings ............................................................................... 22
2.1.3.2 Technological Advancements ............................................................. 22
2.1.3.3 Strategic HR Analytics ....................................................................... 22
2.1.3.4 Employee Experience and Engagement .............................................. 22
2.1.3.5 Predictive Analytics and Machine Learning ......................................... 23
2.1.4 HR Analytics Tools and Technologies .......................................................... 23
2.1.5 Challenges and Limitations of HR Analytics ................................................. 23
2.2 Big Data in HR: Applications and Challenges ....................................................... 25
2.2.1 Definition of Big Data in the context of HR.................................................. 26
2.2.2 The Three V's of Big Data in Human Resources ........................................... 28
2.2.2.1 Volume ............................................................................................. 28
2.2.2.2 Velocity ............................................................................................ 28
2.2.2.3 Variety ............................................................................................. 28
2.2.3 Applications of Big Data in HR .................................................................... 29

iv
2.2.4 Challenges in Implementing Big Data in HR................................................. 31
2.2.4.1 Data Privacy and Security .................................................................. 31
2.2.4.2 Data Quality and Integration ............................................................. 31
2.2.4.3 Skills and Resources .......................................................................... 31
2.2.5 Future Trends and Implications .................................................................. 32
2.2.5.1 Artificial Intelligence (AI) in HR ........................................................... 32
2.2.5.2 Machine Learning (ML) in HR ............................................................. 32
2.2.5.3 Impact of Emerging Technologies ....................................................... 32
2.3 Machine Learning Algorithms in HR Analytics ..................................................... 33
2.3.1 Introduction to HR Analytics and Machine Learning .................................... 34
2.3.2 Overview of machine learning and its applications in HR ............................. 35
2.3.3 Data Collection and Preprocessing ............................................................. 35
2.3.4 Feature Selection and Engineering ............................................................. 36
2.3.5 Classification Algorithms for HR Analytics ................................................... 37
2.3.6 Clustering Algorithms for HR Analytics ....................................................... 37
2.3.7 Ethical and Privacy Considerations ............................................................ 37
2.3.8 Future Trends and Challenges .................................................................... 38
2.3.8.1 Emerging trends in machine learning for HR analytics ......................... 38
2.3.8.2 Integrating natural language processing and sentiment analysis ......... 38
2.3.8.3 AI-powered talent acquisition and candidate screening ....................... 39
2.4 Resume Screening: Trends and Techniques ......................................................... 39
2.4.1 Evolution of Resume Screening .................................................................. 39
2.4.2 Digital Platforms and Resume Collection .................................................... 41
2.4.3 The Role of PDF Processing in Resume Extraction ....................................... 41
2.4.4 Natural Language Processing (NLP) in Resume Screening ............................ 42
2.4.5 The Algorithmic Approach ......................................................................... 43
2.5 Predicting Employee Turnover: Past Studies and Frameworks .............................. 43
2.5.1 Understanding Employee Turnover ............................................................ 44
2.5.2 Data Collection in Turnover Prediction ....................................................... 44
2.5.3 Data Manipulation and Analysis with Pandas .............................................. 44
2.5.4 RandomForestClassifier in Turnover Prediction ........................................... 45
2.5.5 Model Evaluation Metrics .......................................................................... 45
2.5.6 The Role of Visualization in Turnover Prediction.......................................... 46
2.6 Sentiment Analysis in Employee Feedback.......................................................... 46
2.6.1 Significance of Employee Feedback ............................................................ 47

v
2.6.2 An Introduction to Sentiment Analysis ....................................................... 47
2.6.3 The Power of TextBlob in Sentiment Analysis .............................................. 48
2.6.4 Support Vector Machine in Text Analysis .................................................... 48
2.6.5 Importance of Vectorization in Text Data .................................................... 48
2.6.6 Evaluating Sentiment Models .................................................................... 49
Chapter 3: Methodology ........................................................................................ 50
3.1 Resume Screening ............................................................................................. 50
3.1.1 Data Collection ......................................................................................... 50
3.1.2 Data Collection Process ............................................................................. 50
3.1.3 Text Preprocessing .................................................................................... 50
3.1.4 Cosine Similarity Calculation ...................................................................... 51
3.1.6 Visualization ............................................................................................. 51
3.1.7 Code Implementation ............................................................................... 51
3.2 Predicting Employee Turnover ........................................................................... 51
3.2.1 Data Collection ......................................................................................... 51
3.2.2 Questionnaire Design ................................................................................ 51
3.2.3 Sample Size .............................................................................................. 54
3.2.4 Data Preprocessing ................................................................................... 54
3.2.4.1 Data Cleaning ................................................................................... 54
3.2.4.2 Feature Engineering .......................................................................... 54
3.2.5 Pilot Study ................................................................................................ 54
3.2.6 Machine Learning Model ........................................................................... 55
3.2.6.1 Model Selection ................................................................................ 55
3.2.6.2 Model Training ................................................................................. 55
3.2.6.3 Model Validation .............................................................................. 55
3.2.7 Visualization ............................................................................................. 55
3.3 Sentiment Analysis ............................................................................................ 55
3.3.1 Data Collection ......................................................................................... 55
3.3.2 Survey question ........................................................................................ 55
3.3.3 Pilot Survey (Question Validation) .............................................................. 56
3.3.4 Data Preprocessing ................................................................................... 56
3.3.4.1 Data Cleaning ................................................................................... 56
3.3.4.2 Feature Extraction............................................................................. 57
3.3.5 Sentiment Analysis Model ......................................................................... 57
3.3.5.1 Model Selection ................................................................................ 57

vi
3.3.5.2 Model Training ................................................................................. 57
3.3.5.3 Test Set ............................................................................................. 57
3.3.5.4 Performance Metrics ......................................................................... 57
3.3.6 Sentiment Prediction on New Data ............................................................ 57
3.3.6.1 New Data Source .............................................................................. 57
3.3.6.2 Data Preprocessing for New Data ...................................................... 57
3.3.6.3 Sentiment Prediction ......................................................................... 58
Chapter 4: Results and Discussion .......................................................................... 59
4.1 Resume Screening ............................................................................................. 59
4.1.1 Similarity Percentages for All Resumes ....................................................... 59
4.1.1.1 Interpretation of Results .................................................................... 60
4.1.1.2 Discussion......................................................................................... 60
4.1.2 Top 10 Resumes by Similarity Percentage ................................................... 61
4.1.2.1 Interpretation of Results .................................................................... 61
4.1.2.2 Discussion......................................................................................... 62
4.1.2.3 Enhancing Efficiency and Effectiveness ............................................... 62
4.1.2.4 Future Directions............................................................................... 62
4.1.3 Cosine Similarity Heatmap of Top 10 Resumes ............................................ 63
4.1.3.1 Interpretation of Results .................................................................... 63
4.1.3.2 Discussion......................................................................................... 64
4.1.3.3 Enhancing Recruitment Strategy ........................................................ 64
4.1.3.4 Future Directions............................................................................... 64
4.1.4 Overlapping Words of Top 10 resumes with the Job Requirement................ 65
4.1.4.1 Implications for Candidate Evaluation ................................................ 66
4.1.4.2 Discussion......................................................................................... 66
4.1.4.3 Future Directions............................................................................... 66
4.2 Employee Turnover Prediction ........................................................................... 67
4.2.1 Performance Metrics for Testing Data ........................................................ 67
4.2.1.1 Implications and Significance ............................................................. 68
4.2.1.2 Model Robustness and Generalization ................................................ 68
4.2.1.3 Ethical Considerations ....................................................................... 68
4.2.1.4 Future Research and Development ..................................................... 69
4.2.2 Demographic Analysis ............................................................................... 69
4.2.2.1 Age Distribution ................................................................................ 69
4.2.2.2 Gender Distribution ........................................................................... 70

vii
4.2.2.3 Department Distribution.................................................................... 70
4.2.3 Distribution of Predicted Labels ................................................................. 71
4.2.3.1 Discussion......................................................................................... 72
4.2.3.2 Implications and Significance ............................................................. 73
4.2.3.2 Alignment with Turnover Prevention .................................................. 73
4.2.3.4 Ethical Considerations ....................................................................... 73
4.2.3.5 Future Research and Development ..................................................... 73
4.2.4 Employee Turnover Prediction in Gender Distribution ................................. 73
4.2.4.1 Implications and Future Strategies ..................................................... 74
4.2.4.2 Discussion......................................................................................... 75
4.2.5 Top 10 Important Features for Employees at Risk ........................................ 75
4.3 Sentiment Analysis ............................................................................................ 77
4.3.1 Performance Metrics for Labeled Dataset................................................... 77
4.3.1.1 Interpretation of Results .................................................................... 78
4.3.1.2 Discussion......................................................................................... 79
4.3.1.3 Limitations and Future Directions....................................................... 80
4.3.2 Distribution of Predicted Sentiments.......................................................... 80
4.3.2.1 Implications for Organizational Strategy ............................................ 81
4.3.3 Distribution of Sentiment Polarity and Subjectivity ..................................... 82
4.3.3.1 Sentiment Polarity............................................................................. 83
4.3.3.2 Sentiment Subjectivity ....................................................................... 83
4.3.3.3 Implications for Organizational Strategy ............................................ 84
4.3.4 Word Cloud Analysis ................................................................................. 84
4.3.4.1 Interpretation of Results .................................................................... 85
4.3.4.2 Implications for Organizational Strategy ............................................ 86
Chapter 5: Conclusion and Recommendation.......................................................... 87
5.1 Conclusion ........................................................................................................ 87
5.2 Recommendations for Practitioners ................................................................... 87
5.3 For Future Research .......................................................................................... 88
5.4 Final Thoughts .................................................................................................. 88
References ............................................................................................................. 90
Appendix ..............................................................................................................108
Declaration of Honor .............................................................................................122

viii
List of Figures

Figure 2-1: The Evolution of HR Technology


Figure 2-2: A Guide To The 4 Types of HR Analytics
Figure 2-3: HR Analytics and Predictive Decision-making model
Figure 2-4: 7 Common People Analytics Challenges
Figure 2-5: What Data Does an HR Analytics Tool Need?
Figure 2-6: Applications of Data Science in HR Analytics
Figure 2-7: The Relationship between AI, ML, and the Three Broad Types of ML
Figure 2-8: Data Preprocessing
Figure 2-9: Manually Review Resumes
Figure 2-10: How Does a Resume Parser Work?
Figure 4-1: Similarity Percentage for All Resumes
Figure 4-2: Similarity Percentage for Top 10 Resumes
Figure 4-3: Cosine Similarity Heatmap of Top 10 Resumes
Figure 4-4: Age distribution of the surveyed employees
Figure 4-5: Gender distribution of the surveyed employees
Figure 4-6: Department distribution of the surveyed employees
Figure 4-7: Distribution of Predicted Labels of Employees at Risk
Figure 4-8: Gender Distribution for Employees at Risk
Figure 4-9: Top 10 Important Features for Employees at Risk
Figure 4-10: Distribution of Predicted Sentiments
Figure 4-11: Distribution of Sentiment Polarity and Subjectivity

ix
List of Pictures

Picture 4-1: Overlapping Words of Top 10 resumes


Picture 4-2: Performance Metrics for Testing Data
Picture 4-3: Performance Metrics for Labeled Dataset
Picture 4-4: Word Cloud of Comments

x
List of Abbreviations

HR Human Resources

AI Artificial Intelligence

ML Machine Learning

SVM Support Vector Machine

NLP Natural Language Processing

NLTK Natural Language Toolkit

CSV Comma Separated Values

SQL Structured Query Language

ROC-AUC Receiver Operating Characteristic - Area Under the Curve

PDF Portable Document Format

JSS Job Satisfaction Survey

JDI Job Descriptive Index


JIG Job in General
PSQ Pay Satisfaction Questionnaire
CSI Career Satisfaction Inventory
TF-IDF Term Frequency - Inverse Document Frequency

xi
Chapter 1: Introduction
1.1 Background and context of the research
The field of Human Resources (HR) has undergone a significant transformation
in recent years, driven by the rapid advancements in data analytics, big data
technologies, and machine learning algorithms. Traditional HR practices have
evolved into a data-driven discipline, commonly referred to as HR analytics. HR
analytics leverages data to make informed decisions about recruitment,
employee retention, performance evaluation, and overall workforce
management. It represents a paradigm shift in HR management, allowing
organizations to gain deeper insights into their human capital and make more
strategic and evidence-based decisions.

In the era of big data, organizations are inundated with vast amounts of
information, including resumes of job applicants, employee feedback,
performance metrics, and more. This influx of data presents both challenges and
opportunities for HR professionals. To harness the power of big data and make
meaningful predictions and decisions, HR departments are increasingly turning
to machine learning algorithms and advanced analytics techniques. This thesis
explores the intersection of HR, big data, and machine learning, aiming to
uncover insights and practical applications that can enhance HR practices.

1.2 Research problem statement


The modern HR landscape is characterized by a dynamic and diverse workforce
where talent acquisition, retention, and engagement have become paramount for
organizational success. Traditional HR methodologies, reliant on manual
processes and subjective decision-making, are increasingly falling short of
meeting the demands of this evolving environment. In this context, the adoption
of HR analytics, powered by big data and machine learning, offers a promising
avenue to revolutionize how HR professionals address critical challenges.

This research investigates the potential of machine learning algorithms to


revolutionize various aspects of human resources management. It delves into the
efficacy of utilizing these algorithms in resume screening, aiming to enhance
efficiency, accuracy, and reduce bias in candidate selection. Additionally, the
13
study seeks to leverage predictive models to anticipate employee turnover,
offering early insights that empower HR departments to implement retention
strategies effectively. Furthermore, it explores sentiment analysis applied to
employee feedback data as a means to discern workforce sentiments and
perceptions, ultimately uncovering actionable insights to enhance workplace
satisfaction, boost employee morale, and improve overall organizational
performance.

The overarching research problem, therefore, revolves around evaluating the


real-world impact and applicability of HR analytics solutions within the specific
contexts of resume screening, employee turnover prediction, and sentiment
analysis. By addressing this problem, this research aims to advance the
understanding of HR analytics' role in reshaping HR practices and decision-
making in contemporary organizations.

1.3 Research objectives


This study revolves around the application of machine learning and big data
techniques to HR analytics for a variety of purposes. Here are some objectives
for your thesis:

Resume Screening:

• To develop a machine learning model that can efficiently screen resumes


against specified job requirements.
• To assess the accuracy, efficiency, and practicality of automated resume
screening in comparison to traditional methods.

Predicting Employee Turnover:

• To design and test a predictive algorithm that can identify employees at


high risk of leaving the organization.
• To examine the factors influencing employee turnover based on the
collected survey data.

14
Sentiment Analysis:

• To gauge employees' sentiments regarding performance acknowledgment


in their respective companies.
• To apply sentiment analysis algorithms to categorize employee feedback
as positive, negative, or neutral.

1.4 Research questions


Hare research questions tailored to each of the study areas:

Resume Screening

RQ1: How does the integration of big data and machine learning
algorithms in HR analytics impact the resume screening process, and what
insights can be derived to enhance candidate selection practices?

RQ2: How do machine learning algorithms contribute to mitigating bias in


the resume screening process?

Predicting Employee Turnover

RQ3: How accurate are machine learning models in predicting employee


turnover based on the questionnaire responses, and what are the implications for
HR decision-making?

RQ4: What are the most influential features and factors that contribute to
the prediction of employees at risk of turnover, and how can this knowledge
inform HR strategies for retention?

Sentiment Analysis

RQ5: How do employees perceive the acknowledgment of their


performance within the company, and what are the prevailing sentiments
(positive, negative, or neutral) among the survey respondents?

RQ6: What are the most frequently mentioned words or phrases in


employee comments, and how do these insights inform HR strategies for
improving workplace satisfaction and morale?

15
1.5 Significance of the study
This study has far-reaching ramifications for both academics and practise. From
an academic standpoint, it adds to the developing discipline of HR analytics by
giving actual proof of the usefulness of big data and machine learning approaches
in HR procedures. It also contributes to the expanding body of knowledge on the
interface of HR, data science, and technology.

In practise, the study's results can advise HR professionals, managers, and


organisational leaders on the potential benefits of implementing HR analytics
systems. The findings of the study might help HR departments make data-driven
choices to improve recruitment, retention, and employee happiness.

1.6 Scope of the study


This research focuses on the specific domains of resume screening, employee
turnover prediction, and sentiment analysis within the realm of HR analytics. It
examines the application of big data and machine learning algorithms to enhance
the accuracy and efficiency of these HR processes. While these areas represent
critical facets of HR management, it's important to note that HR analytics
encompasses a wider spectrum of activities. The study aims to provide in-depth
insights and empirical evidence in these chosen domains, shedding light on the
practical implications of HR analytics in real-world scenarios.

1.7 Limitations of the study


Several limitations should be acknowledged in this study. The data sources
primarily consist of information collected from LinkedIn, WhatsApp platforms.
While these sources offer valuable insights, they may not fully represent the
diversity of the global workforce, and potential biases in the data should be
considered. Generalizability is another concern, as the findings are based on
specific datasets and samples. Ethical considerations surrounding data privacy
and consent may also affect data collection and analysis. The effectiveness of
machine learning models and data analytics techniques may vary depending on
the specific tools and technologies used. Additionally, the temporal factor needs
to be considered, as the data reflects a particular point in time and HR dynamics

16
can change. These limitations should be taken into account when interpreting and
applying the study's findings.

1.8 Organization of the thesis


The organization of this thesis is divided into four main chapters.

Chapter 2 provides a comprehensive literature review, examining the key


concepts related to HR analytics, big data, machine learning, and the specific HR
tasks under investigation.

Chapter 3 outlines the methodology used in this research, including data


collection, preprocessing, and analysis techniques.

Chapters 4 delve into the empirical study's three tasks: resume screening,
predicting employee turnover, and sentiment analysis, respectively, and offer a
detailed discussion of the results, their implications.

Chapter 5 concludes the thesis by providing recommendations and insights for


HR professionals.

17
Chapter 2: Literature Review
2.1 HR Analytics: Concept and Evolution
Human resources analytics, often known as HR analytics, is a rapidly growing
discipline that focuses on using data and analytics to make educated decisions
in the field of human resources management, according to Harris and Dulebohn
(2021). It comprises obtaining, assessing, and analysing various HR-related data
to get insights into workforce trends, employee performance, recruiting
strategies, and overall organisational effectiveness.

Human resources (HR) technology has evolved dramatically, from paper-based


and manual systems in the pre-1990s to today's use of talent analytics and big
data (Figure 2-1). Factors such as the rising complexity of HR roles, the
proliferation of data, and the need for improved HR efficiency have all contributed
to this process. HR technology progressed gradually, beginning with early HR
software programs in the 1990s, combining HR data with enterprise resource
planning (ERP) systems in the 2000s, and progressing further with human capital
management (HCM) systems in the 2010s. The present phase, in the 2020s, is
distinguished by the broad adoption of talent management suites, which provide
complete employee lifecycle management. Today's HR technology uses talent
analytics and big data to improve decision-making, discover performance
patterns, and forecast employee behaviour, resulting in more efficient and
informed HR procedures that contribute to organisational success.

Figure 2-1: The Evolution of HR Technology

Source: BEAJAMES. (2014)

18
HR analytics is the combination of HR data, statistical analysis, and sophisticated
analytics approaches to extract important insights and inform strategic HR
decision-making. To fully realise the promise of HR data and analytics, Searle et
al. (2020) emphasise the rising need for HR practitioners to strengthen their
analytical skills and engage with data scientists.

2.1.1 Definition of HR Analytics


HR analytics, according to Jac Fitz-enz (2014), uses statistical tools and
approaches to analyse HR data and improve employee performance. HR
analytics, according to Lawler (2019), is the use of data and analytical
methodologies to analyse, forecast, and enhance HR-related outcomes and
processes. It entails analysing human resource data in order to acquire insights
into workforce patterns, identify areas for improvement, and make data-driven
choices. HR Analytics, by leveraging the power of data, helps companies to
evaluate and analyze workforce trends, anticipate future results, and identify
areas for development in order to maximize human capital initiatives and achieve
overall organizational performance.

HR analytics is divided into four stages (Figure 2-2) of data analysis: descriptive,
diagnostic, predictive, and prescriptive analytics. The fundamental step of
descriptive analytics dives into prior data trends to acquire insights, relying on
statistical tools to summarise historical data without making future predictions.
Diagnostic analytics expands on this by attempting to explain these data patterns,
discovering causal links and variables behind trends, and applying techniques
such as data mining, regression analysis, and correlation analysis. The next
phase is predictive analytics, which projects future outcomes by finding patterns
and correlations in past and present data, assisting HR choices like as recruiting
and talent retention. Finally, prescriptive analytics elevates predictive insights to
the next level by providing focused suggestions and actions based on predictive
discoveries, employing techniques like as machine learning and artificial
intelligence to foresee situations and ideal interventions for improved decision-
making.

19
Figure 2-2: A Guide To The 4 Types of HR Analytics

Source: Boatman, A. (2023)

Furthermore, Davenport, Harris, and Shapiro (2010) define HR analytics as the


process of measuring, forecasting, and enhancing organisational performance
and employee outcomes via the use of quantitative and qualitative HR data. This
involves employing advanced statistical approaches, predictive modelling, and
data visualisation to assist HR decision-making and support strategic objectives.

2.1.2 Role of HR Analytics in decision-making


HR Analytics, according to Lengnick (2019), serves as a strategic decision-
making partner by leveraging data to support evidence-based HR practises,
resulting in improved workforce planning, talent acquisition, and retention. HR
Analytics is important in decision-making because it uses data-driven insights to
advise and steer strategic human resource efforts. HR practitioners may extract
important information from huge volumes of employee data, including as
performance measurements, engagement surveys, and demographic
information, by applying advanced analytical approaches such as predictive
modelling and machine learning.

The figure (Figure 2-3) depicts the important components of efficient decision-
making in HR analytics, emphasising the importance of predictive analytics. It
distinguishes three basic elements: data, which serves as the basis and includes
employee information like as performance reports and training history;

20
algorithms, which are used for data analysis and pattern identification; and the
judgements that are generated from algorithmic insights. This paradigm
highlights the broad range of predictive analytics applications in HR, such as
forecasting employee performance, attrition, and recruiting interventions, as well
as identifying training needs. Finally, the graphic emphasises how data-driven HR
decision-making connects with organisational goals, hence improving the quality
of HR-related choices and initiatives.

Figure 2-3: HR Analytics and Predictive Decision-making model

Source: Abdul, Q. (2019)

According to Boudreau and Cascio (2017), HR Analytics offers organisations with


insights into the causes of employee performance, allowing them to spot skill
shortages, develop focused training programmes, and more efficiently deploy
resources. According to Minbaeva (2020), data-driven HR decision-making
provided by HR Analytics helps organisations identify and reduce personnel risks,
optimise worker productivity, and integrate HR strategy with business goals.
According to Marr (2019), organisations may utilise HR Analytics to find patterns
and trends in employee behaviour, which can lead to more successful employee
engagement tactics, improved job satisfaction, and lower attrition rates.

2.1.3 Historical Overview of HR Analytics


The evolution of human resources (HR) practises and their integration with data
analytics is covered in the historical overview of HR analytics. Previously, HR was
largely concerned with administrative responsibilities; however, with the

21
advancement of technology and the understanding of the value of data, HR
analytics developed as a strong tool. It began with fundamental data like as
employee turnover and worker demographics and progressed to predictive
analytics and machine learning algorithms for talent acquisition, performance
management, and employee engagement. This historical trajectory depicts the
evolution of human resources from a reactive and administrative role to a
strategic partner who uses data-driven insights to make informed decisions and
drive organisational success.

2.1.3.1 Early Beginnings


HR Analytics has its origins in early uses of data analysis in the subject of human
resources. HR analytics, according to Edward E. Lawler III (2008), arose in the
late 1970s and early 1980s when HR practitioners began utilising data to analyse
HR outcomes and assess the impact of HR programmes.

2.1.3.2 Technological Advancements


HR Analytics gained traction as technological technology and data collecting
methods improved. Kyle Lagunas (2018) emphasises the importance of
technological improvements in enabling HR Analytics to progress beyond simple
reporting and towards predictive and prescriptive analytics.

2.1.3.3 Strategic HR Analytics


As organisations recognised the strategic benefits of HR Analytics, they
transitioned from a descriptive to a more strategic approach. According to Jac
Fitz-enz (2018), HR Analytics has grown into a critical tool for aligning HR
practises with organisational goals and achieving business performance.

2.1.3.4 Employee Experience and Engagement


HR Analytics was also useful in assessing employee satisfaction and
engagement. According to Nigel Guenole (2017), HR Analytics is crucial for
monitoring and enhancing employee engagement, which is critical for productivity
and retention.

22
2.1.3.5 Predictive Analytics and Machine Learning
HR Analytics has been further revolutionised by the incorporation of predictive
analytics and machine learning algorithms. Pasha Roberts (2017) show how
predictive models in HR Analytics may assist organisations in forecasting
employee behaviour and making educated decisions.

2.1.4 HR Analytics Tools and Technologies


According to Laumer et al. (2019), HR analytics solutions use sophisticated
analytics technologies such as machine learning and predictive modelling to help
HR managers detect patterns and trends for improving employee performance
and retention. These technologies analyse massive volumes of HR data, such as
employee demographics, performance indicators, and engagement surveys,
using powerful algorithms and machine learning approaches. According to Harel
et al. (2016), integrating HR analytics tools with talent management systems and
using data-driven insights helps organisations improve recruiting and selection
processes, allowing them to find the best individuals. Cascio (2018) underlines,
these technologies enable organisations to analyse the performance of HR
interventions and initiatives by facilitating the study and evaluation of HR Key
Performance Indicators (KPIs).

Furthermore, as Schramm and Wiesche (2018) explain, incorporating natural


language processing and sentiment analysis into HR analytics systems enables
organisations to glean insights from employee feedback and sentiment data,
leading to higher engagement and satisfaction. Finally, predictive analytics
approaches in human resources, as stated by Boudreau and Cascio (2017),
enable organisations to foresee future talent requirements, recognise flight risks,
and build proactive strategies for recruiting and keeping top personnel.

2.1.5 Challenges and Limitations of HR Analytics


Employee experience has recently acquired prominence (Figure 2-4), notably
during the global financial crisis, when remote work became the norm. As a result
of this transition, organisations are placing a larger focus on understanding
employee well-being through periodic surveys. As moving towards a mixed work
style, the necessity for trustworthy surveys and the translation of their findings

23
into actionable insights becomes critical. This technique not only promotes
employee mental health, but also enables People Analytics teams to link findings
with important business KPIs such as absenteeism and retention.

Concurrently, the pandemic has hastened the demand for reskilling and upskilling
in order to increase organisational value and productivity. However, many
businesses fail to keep complete skill data, a concern that will demand their
attention in the coming years. People analytics may help by maintaining up-to-
date skill inventories and enabling more effective staff development. Furthermore,
these divisions should adapt to contribute to the broader strategy of the
organisation, transitioning from HR-centric to business-focused approaches.
Data-driven insights should guide decision-making in this endeavour, with a focus
on seamless integration of People Analytics into the boardroom to deliver
improved business outcomes. Furthermore, knowing the skill set that leads to
high-performance teams and efficiently translating data into usable business
language are critical. This dual strategy has the potential to improve recruiting,
team development, and total company value. Finally, encouraging data-driven
decision-making and providing self-service access to information will enable HR
and business colleagues to make educated decisions.

Figure 2-4: 7 Common People Analytics Challenges

Source: Lucassen, J.-P. (2023)

24
Finally, the changing People Analytics environment in 2022 and beyond includes
a variety of problems and possibilities, ranging from employee experience and
skill development to business connectivity and data accessibility. Addressing
these concerns can help organisations make more informed, data-driven
decisions, improving overall performance and flexibility.

Despite its promise to revolutionise human resource management, HR analytics


confronts a number of obstacles and constraints. One of the major issues in HR
analytics, according to Kavanagh, Thite, and Johnson (2019), is the quality and
availability of data. Human resource data is frequently dispersed across different
systems, resulting in inconsistencies and inaccuracies that impede the extraction
of meaningful insights. According to Lawler III and Levenson (2019), interpreting
HR analytics necessitates domain expertise as well as a grasp of the
organisational environment, since HR analysts must have a thorough
understanding of HR procedures and practises in order to generate relevant
insights from data. To begin with, data quality remains a big challenge, since HR
data frequently lacks standardisation, completeness, and correctness.
Inadequate data can result in inaccurate analysis and wrong conclusions.
Second, when dealing with sensitive employee data, privacy problems and
ethical considerations arise, necessitating careful respect to data protection
standards. According to Fitz-enz (2016), reaching high prediction accuracy in HR
analytics is difficult owing to the complexity and subjectivity of human behaviour,
because elements other than HR data, such as individual motives and external
influences, can influence employee actions and decisions. Redman and
Holmström (2016) emphasise the difficulty that organisations implementing HR
analytics have due to a lack of historical data, which can restrict forecast accuracy
and make identifying long-term trends and patterns challenging.

2.2 Big Data in HR: Applications and Challenges


Big Data analytics, according to Laumer, Eckhardt, and Weitzel (2018), has the
potential to revolutionise HR practises by giving organisations with unparalleled
insights into their workforce, allowing data-driven decision-making and strategic
planning. According to Marler and Boudreau (2017), using Big Data in HR allows
organisations to analyse massive amounts of data from many sources in order to
uncover patterns, trends, and correlations relating to talent acquisition,

25
performance management, and employee engagement. According to Alavi,
Antons, and Ditschler (2018), by embracing Big Data, HR practitioners may
estimate future workforce demands, forecast turnover rates, and build proactive
talent acquisition and retention strategies, resulting in increased organisational
performance. According to Vaiman and Scullion (2019), Big Data analytics allows
HR departments to personalise employee experiences, customise training
programmes, and optimise organisational procedures, resulting in increased
employee happiness and productivity.

2.2.1 Definition of Big Data in the context of HR


According to Marler and Fisher (2013), Big Data in HR involves the processing
and analysis of massive volumes of data, such as employee profiles,
performance indicators, social media activity, and external market trends. It
entails the use of vast and complicated data sets to get insights about employee
behaviour, performance, and engagement, as stated by Rogers and Lindauer
(2016). It entails the collection and analysis of massive amounts of structured and
unstructured data, such as personnel profiles, recruiting data, social media
activity, and sentiment analysis, as stated by Marler and Boudreau (2017).

Data in the field of HR analytics may be classified (Figure 2-5) as coming from
internal or external sources. Internal data comes from an organization's human
resources department and includes measures such as employee tenure,
remuneration, training records, performance reviews, and more. The difficulty
stems from the possibility of data fragmentation, which might impair
dependability. Data scientists may help organise and consolidate this dispersed
information into useable buckets for analyses. External data, on the other hand,
requires coordination with other departments and provides a larger view.
Financial data for measures such as revenue per employee, organization-specific
data relevant to the organization's core offers, passive data from workers (e.g.,
social media activity and feedback surveys), and historical data reflecting global
events impacting employee behaviour are all included. These internal and
external data sources together give a solid platform for HR analysis and decision-
making.

26
Figure 2-5: What Data Does an HR Analytics Tool Need?

Source: Lalwani, P. (2023)

Bondarouk, Rul, and van der Heijden (2017) define it as the processing and
analysis of enormous volumes of data in order to derive important insights that
drive HR decision-making. It refers to the large quantity of data created by HR
procedures and systems, such as employee demographics, performance
metrics, training records, and feedback, according to Davenport (2014). It entails
the integration and analysis of massive and different data sets, such as employee
data, social media activity, and external market data, as defined by Bissola and
Imperatori (2018). According to Mellahi, Demirbag, and Riddle (2018), it entails
the processing and analysis of massive volumes of employee data in order to
derive relevant insights for strategic HR decision-making. According to Schramm
and Rocco (2017), it entails the collection and analysis of huge amounts of data,
such as employee records, performance data, and data from other sources.

27
2.2.2 The Three V's of Big Data in Human Resources
To grasp the core of Big Data in HR, consider the three essential aspects often
connected with Big Data, sometimes known as the three V's:

2.2.2.1 Volume
Chou (2016) defines big data in HR as "vast amounts of employee-related
information generated from various sources such as HR systems, employee
surveys, and social media platforms." According to Bondarouk et al. (2019), the
abundance of HR data offers organisations with both a burden and an opportunity.
HR practitioners may acquire deeper insights into workforce dynamics and make
data-driven choices by successfully analysing enormous amounts of employee
data. Furthermore, according to Parry and Tyson (2018), the abundance of HR
data enables organisations to do in-depth analysis and obtain insights into
employee demographics, performance trends, and talent management
strategies. This data can help HR practitioners make evidence-based decisions
and contribute to organisational success.

2.2.2.2 Velocity
The velocity of HR data, according to Martin (2019), relates to the quick rate at
which information is created, updated, and analysed in real-time. This helps HR
professionals to make fast choices based on the most recent information.
According to Al-Dhaafri et al. (2019), real-time analytics in human resources
enables organisations to proactively detect and manage issues such as
employee disengagement and high turnover rates. HR data velocity enables HR
practitioners to take prompt actions, increasing total worker productivity.
Furthermore, according to Jiang et al. (2019), the velocity of HR data helps
organisations to change from reactive to proactive HR practises. HR
professionals may use real-time data analytics to discover emerging trends,
forecast future workforce demands, and take early action to address difficulties
and optimise HR strategy.

2.2.2.3 Variety
According to Marler and Fisher (2013), HR data includes both organised and
unstructured data, allowing HR managers to analyse employee sentiment and
28
engagement beyond typical measures. According to Kumar and Singh (2018),
this diversified data includes unstructured data from sources like as social media
and employee feedback, providing important insights into employee feelings,
preferences, and levels of engagement. They also point out that sophisticated
analytics approaches may assist organisations in making sense of this data and
driving more successful HR policies. Furthermore, Kwon and Adler (2014)
emphasise that a diverse set of HR data sources gives a full perspective of the
workforce, covering both quantitative and qualitative characteristics.
Organisations may acquire a greater knowledge of employee experiences, views,
and behaviours by analysing various data sources, leading to more focused HR
interventions and improved employee outcomes.

2.2.3 Applications of Big Data in HR


Big data has transformed several businesses, including human resources (HR).
Its uses in human resources have proved helpful, allowing organisations to make
data-driven choices and optimise workforce management techniques. Big data
analytics in HR, according to Chen and Huang (2014), may give important
insights into employee behaviour, allowing organisations to make data-driven
choices in areas such as recruiting, talent management, and employee
engagement. According to Li, Liang, and Li (2017), the application of big data
analytics in HR has the potential to revolutionise the recruiting process by finding
the most qualified applicants using sophisticated data mining techniques, hence
lowering the time and expense associated with traditional approaches.

Analytics plays a critical role in tackling important difficulties and optimising many
elements of workforce management (Figure 2-6) in the field of human resources.
Employee retention is a vital area where analytics may assist firms detect attrition
patterns, employee traits associated with longer tenures, reasons for departures,
and even anticipate employee performance. Companies may apply methods to
improve retention rates, cut recruiting and training expenses, and improve overall
organisational performance by utilising data-driven insights. Furthermore,
analytics assists in the selection of optimal applicants for work openings by
providing tools to examine market data, necessary skills, recruiting strategies,
and anticipate candidate performance, expediting the hiring process and assuring
a more accurate match between job roles and prospects.

29
Figure 2-6: Applications of Data Science in HR Analytics

Source: Kumar, M. B. (2023)

Compensation management is another critical area where HR analytics has a


huge influence. Businesses may identify competitive and financially viable wage
structures that satisfy their workforce while successfully controlling expenses by
analysing market trends, employee preferences, and offer acceptance rates.
Furthermore, analytics has an impact on sales incentives, staff engagement, and
CEO remuneration, allowing organisations to link rewards with performance and
retain leadership. Finally, by addressing skill gaps and fostering employee
development, analytics enables businesses to identify training needs, efficiently
allocate resources, and improve the qualifications and skills of their workforce,
ultimately improving overall company performance and market competitiveness.
Effective HR analytics is now required for talent acquisition, retention, and
organisational success in today's corporate market.

Big data in HR helps with the recruiting process by analysing massive volumes
of candidate data, discovering patterns, and forecasting successful hiring. It
enables people management by analysing employee performance, engagement,
and satisfaction, allowing organisations to establish personalised growth
programmes and increase retention rates. According to Kaushik, Chahal, and
Bansal (2019), HR managers may obtain insights into employee sentiment,
30
identify reasons impacting employee engagement, and proactively address
issues to enhance overall organisational performance and productivity by
employing big data analytics. According to Shukla, Kumar, and Gopal (2019),
using big data analytics in talent management allows organisations to identify
high-potential individuals, create customised development programmes, and
increase talent retention through focused interventions.

2.2.4 Challenges in Implementing Big Data in HR


2.2.4.1 Data Privacy and Security
According to Marler and Boudreau (2017), the use of Big Data analytics in human
resources raises issues about data privacy and security since organisations
manage sensitive employee information. According to Bondarouk, Parry, and
Furtmueller (2017), human resources departments confront difficulties in
reconciling the potential benefits of Big Data analytics with the ethical issues of
managing personal employee information.

2.2.4.2 Data Quality and Integration


According to Zheng, Yang, and Wang (2019), assuring data quality and
integration is critical to the effective deployment of Big Data in HR. Data
integration issues emerge owing to the variety of HR data sources, including
multiple formats, systems, and databases, as highlighted by Raghuram, Garud,
and Wiesenfeld (2019). To solve these issues, deploying data integration
solutions like as data warehouses or data lakes can aid in the consolidation of
HR data, allowing for complete analysis and insights for HR decision-making.
Furthermore, Rynes, Giluk, and Brown (2007) emphasise the necessity of data
validation, standardisation, and cleansing processes to improve the dependability
of HR analytics and enable informed decision-making based on trustworthy data.

2.2.4.3 Skills and Resources


According to Hekler et al. (2016), when it comes to applying Big Data analytics,
HR departments frequently encounter a skills gap, which impedes the efficient
use of HR data owing to a lack of analytical skills and resources. According to
Pardo del Val et al. (2018), organisations must invest in HR analytics skills by
providing resources, infrastructure, and training programmes to enable HR
31
professionals to successfully use big data. Kehoe and Wright (2013) note the
difficulty of hiring and maintaining individuals with sophisticated data analytics
abilities in HR departments, recommending that organisations engage in creating
a data-driven culture by offering training programmes, mentorship, and career
development opportunities.

2.2.5 Future Trends and Implications


According to Carlson and Kavanagh (2020), using big data in HR helps
organisations to estimate employee attrition, detect flight risks, and design
proactive retention measures. According to Buuren and Steijn (2017), big data
analytics gives HR managers with insights into employee engagement levels,
assisting organisations in improving employee happiness and productivity.

2.2.5.1 Artificial Intelligence (AI) in HR


AI technology, such as natural language processing and chatbots, can automate
mundane HR processes, improve employee self-service, and streamline HR
operations, according to Parry and McCarthy (2017). Furthermore, according to
Parry and McCarthy (2017), AI-enabled recruiting solutions use algorithms and
predictive analytics to improve applicant sourcing, screening, and matching,
hence increasing the efficiency and efficacy of the hiring process.

2.2.5.2 Machine Learning (ML) in HR


Machine learning algorithms, according to Rosenbaum and Wong (2021), may
analyse historical employee data to discover variables leading to employee
turnover, allowing HR practitioners to build tailored retention strategies.
According to Raghuram and Arvey (2019), machine learning algorithms can
analyse employee performance data to uncover trends and insights, allowing for
the development of personalised learning and development interventions.

2.2.5.3 Impact of Emerging Technologies


According to Bondarouk and Ruel (2019), integrating new technologies like
Internet of Things (IoT) devices and wearable technology can give real-time data
on employee well-being, safety, and productivity, hence improving workplace
experiences and performance. Emerging technologies such as robotic process

32
automation, as stated by Agarwal and Marler (2020), can automate repetitive HR
duties, allowing HR practitioners to focus on more strategic projects and value-
added activities.

2.3 Machine Learning Algorithms in HR Analytics


Reinforcement Learning is an algorithmic approach (Figure 2-7) that enables
computers to learn through trial and error, similar to how IBM's Deep Blue
computer learnt to play chess and defeated the human world champion. It entails
rewarding good judgements while penalising failed ones in order to improve the
algorithm's intelligence. Supervised Learning, on the other hand, focuses on
predicting certain outcomes based on input features, and is often used in HR for
tasks such as employee retention and pay prediction. Unsupervised Learning, on
the other hand, investigates data to detect patterns and interactions between
variables, and is frequently used in HR for grouping employee segments or
discovering linkages, such as injury trends at certain work places.

Figure 2-7: The Relationship between AI, ML, and the Three Broad Types of ML

Source: HR & PEOPLE ANALYTICS. (2020)

Machine learning algorithms, according to Boudreau and Cascio (2018), have


considerable promise for HR analytics, allowing organisations to make better
informed decisions regarding talent management and workforce optimisation.
According to Yang and Cao (2019), the use of machine learning algorithms in
human resource analytics may find hidden trends in employee data, allowing
organisations to forecast attrition, identify high-performing personnel, and
improve recruiting and selection processes. Machine learning algorithms, such
as decision trees and neural networks, have proved their potential to predict

33
employee performance and give significant insights for strategic personnel
planning, according to Liao and Wang (2020). Furthermore, Akkermans,
Richardson, and Kraimer (2021) argue that the use of machine learning
algorithms in HR analytics has the potential to transform traditional HR practises
by allowing organisations to use data-driven insights for talent acquisition,
employee development, and retention strategies.

2.3.1 Introduction to HR Analytics and Machine Learning


HR Analytics, powered by machine learning techniques, is revolutionising
traditional HR practises, according to Marr and Grey (2020), by enabling
organisations to identify hidden patterns and insights within their workforce data,
resulting in more effective decision-making and strategic workforce planning.
Machine learning algorithms used to HR Analytics, according to Bersin and
Grierson (2018), may assist organisations discover key drivers of employee
engagement and anticipate turnover risk, enabling proactive intervention
methods and promoting a more engaged and productive workforce. According to
Davenport (2018), HR Analytics enables organisations to optimise talent
acquisition processes by predicting candidate success, identifying the best
sourcing channels, and reducing time-to-hire, resulting in improved hiring quality
and cost savings. Machine learning algorithms applied to HR data, according to
Mendenhall and Villanova (2019), can help organisations create personalised
employee development plans by identifying individual skill gaps, recommending
relevant training programmes, and fostering continuous learning and professional
growth. Furthermore, Singh and Rao (2020) emphasise that HR Analytics
powered by machine learning can enable organisations to proactively identify and
mitigate biases in talent management processes such as recruitment,
performance evaluations, and promotions, resulting in more equitable and
inclusive workplaces. Finally, according to Wheeler (2020), HR Analytics can
predict future workforce demand by leveraging machine learning algorithms,
assisting organisations in aligning their talent acquisition and workforce planning
strategies with evolving business needs, ensuring the right people are in the right
roles at the right time.

34
2.3.2 Overview of machine learning and its applications in HR
The application of machine learning algorithms in talent acquisition, according to
Parry and Tyson (2018), helps HR practitioners to overcome the limits of
traditional resume screening methods and discover the best suited applicants for
specific job openings. Machine learning algorithms, according to Bapna, Gupta,
and Mariadoss (2019), can analyse employee sentiment using sentiment analysis
and natural language processing, offering significant insights for boosting
employee engagement and well-being. Machine learning algorithms, as
highlighted by Aguinis and Lawal (2018), enable HR managers to harness data-
driven insights for better performance management by recognising patterns and
trends in employee performance measures. Machine learning approaches, such
as clustering and classification algorithms, according to Mone and London
(2019), can help HR managers with talent segmentation, enabling personalised
development plans and succession planning. Furthermore, according to Kuang,
Li, and Zhou (2020), machine learning algorithms improve the accuracy of labour
demand predictions, allowing HR departments to make educated choices about
recruiting, training, and resource allocation.

2.3.3 Data Collection and Preprocessing


According to Hui, Hui, and He (2021), correct data collection and pre-processing
are critical to the performance of machine learning algorithms in HR analytics
because they influence the quality and reliability of insights gained from employee
data. When using machine learning algorithms, data gathering and preparation
are critical phases in HR analytics. Human resource analytics is concerned with
collecting relevant insights from employee data in order to make educated
decisions and improve organisational performance. According to Cheng and
Wang (2020), data gathering in HR analytics comes from a variety of sources,
including HRIS, social media platforms, internal databases, and employee
surveys, needing careful consideration of data protection and compliance rules.

Data preparation is an important stage in the data science pipeline that includes
many significant activities (Figure 2-8). It starts with data profiling in which data
scientists evaluate the quality and properties of the data before developing
hypotheses for analytics or machine learning activities. Data cleaning then

35
tackles quality concerns by removing faulty data and filling in missing values.
Redundant data is then deleted using data reduction techniques, allowing it to be
used for specified purposes. Data transformation is the process of organising
data for the desired purpose, whereas data enrichment is the use of feature
engineering libraries. Data validation divides the data into training and testing
sets in order to evaluate model correctness and make improvements as needed.
Finally, good preprocessing lays the path for the work to be scaled for production
or further refined by data engineers.

Figure 2-8: Data Preprocessing

Source: Lawton, G. (2023)

Furthermore, Konstantinidis, Staggers, and Brombacher (2020) emphasise the


necessity of data anonymization, informed permission, and compliance with data
protection legislation such as GDPR throughout the data collecting and
preparation stages to ensure ethical considerations.

2.3.4 Feature Selection and Engineering


The combination of demographic data, job-related factors, and performance
indicators, according to Yan and Wu (2020), might give useful insights for HR
analytics models in forecasting employee engagement. Feature engineering
strategies such as time-based aggregations and lag variables, as outlined by Ye
and Gong (2019), can capture temporal relationships in HR data and increase
the accuracy of predictive models for workforce planning.

36
2.3.5 Classification Algorithms for HR Analytics
Classification algorithms are important in HR analytics because they help with the
effective evaluation and prediction of numerous human resource-related
elements. According to Sibanda and Xia (2019), the use of categorization
algorithms in HR analytics enables organisations to properly estimate staff
turnover while also identifying prospective talent, resulting in proactive retention
tactics. According to Xing and Wang (2018), machine learning classification
algorithms such as support vector machines and random forests have shown
efficient in automating resume screening procedures, saving HR departments
time and resources.

2.3.6 Clustering Algorithms for HR Analytics


According to Singh and Sahni (2017), using clustering algorithms in HR analytics
assists organisations in identifying talent clusters, allowing for focused talent
acquisition and retention strategies. Venkataramanan and Alhazmi (2019)
agreed, pointing out that clustering algorithms are commonly used in HR analytics
to categorise individuals based on their abilities, competences, and performance
indicators. This segmentation allows organisations to establish tailored training
and development programmes that promote employee growth and progress.
According to Pachori and Dwivedi (2018), by using clustering algorithms in HR
data, organisations may discover staff groups with comparable levels of
engagement. This understanding permits the development of methods to improve
overall employee motivation and job satisfaction. Furthermore, as Das and Pal
(2021) note out, clustering algorithms play an important role in talent acquisition
by grouping potential applicants based on their credentials and talents,
simplifying efficient recruiting procedures and minimising time-to-hire.

2.3.7 Ethical and Privacy Considerations


The application of machine learning algorithms in HR analytics, according to Li,
Liu, and Zhao (2020), creates privacy concerns due to the access necessary to
sensitive employee data. To preserve individual privacy rights, organisations
must strictly comply to data protection legislation and set clear policies governing
data access, storage, and retention. Bias in HR analytics algorithms, according
to Feldman and Friedler (2019), can perpetuate discrimination and unfair

37
practises. Finally, Buolamwini and Gebru (2018) stress the importance of
diversifying data sources in HR analytics to mitigate biases, promote equal
opportunity, and reduce discriminatory outcomes by avoiding over-reliance on
historical data and conducting regular evaluations of algorithmic performance and
fairness with input from diverse stakeholders.

2.3.8 Future Trends and Challenges


Machine learning algorithms are poised to play a critical role in determining the
future of personnel management and workforce optimisation in the field of HR
analytics. The use of machine learning algorithms in HR analytics, according to
Raghavendra and Venugopal (2018), is projected to revolutionise talent
management by allowing organisations to forecast employee performance,
detect skill shortages, and optimise workforce planning. One of the primary
problems in HR analytics, according to Marler and Boudreau (2017), is assuring
the ethical use of employee data while taking into account concerns of privacy,
permission, and transparency in the deployment of machine learning algorithms.

2.3.8.1 Emerging trends in machine learning for HR analytics


Machine learning algorithms, especially ensemble techniques and deep learning,
are emerging as useful tools for forecasting employee attrition and turnover,
enabling pre-emptive interventions and retention tactics, according to Kaur and
Jain (2021). Furthermore, as Su and Goh (2020) note, transfer learning, a
machine learning approach, offers potential in HR analytics since it allows models
learned on a single dataset to be reused for numerous HR tasks, enhancing
efficiency and lowering the need for huge labelled datasets.

2.3.8.2 Integrating natural language processing and sentiment


analysis
Natural language processing techniques, along with sentiment analysis, offer the
potential to extract important insights from employee feedback, surveys, and
social media data, according to Haidari and Smith (2018). Organisations may use
this to assess employee mood, engagement, and satisfaction. Furthermore,
according to Mani and Prasad (2019), the use of natural language processing
and sentiment analysis in departure interviews can discover patterns and themes

38
in employee feedback, assisting organisations in addressing underlying issues
and improving retention efforts.

2.3.8.3 AI-powered talent acquisition and candidate screening


AI-powered chatbots and virtual assistants, according to Piccoli et al. (2019),
have the potential to streamline the candidate screening process by engaging
with applicants, conducting preliminary assessments, and providing personalised
feedback, thereby improving the candidate experience and efficiency. According
to Huang et al. (2019), predictive analytics models based on machine learning
algorithms can analyse historical data on successful hires, employee
performance, and career progression to identify key attributes and characteristics
that contribute to long-term success within the organisation, thus improving
candidate selection accuracy.

2.4 Resume Screening: Trends and Techniques


According to Rivera (2012), the traditional manual process of resume screening
was susceptible to cognitive biases and limitations in processing large volumes
of information. In the era of digital transformation, LinkedIn has emerged as an
essential repository for professional data, fundamentally altering candidate
sourcing dynamics, as stated by Cappelli (2019). As explained by Davenport and
Patil (2012), advancements in Natural Language Processing (NLP) have paved
the way for intricate methods of text extraction and understanding in the realm of
resume screening. Algorithms, especially similarity measures such as the Cosine
Similarity, are becoming indispensable in aligning resumes with specific job
requirements, a methodology that, according to Chen et al. (2018), has
demonstrated promising outcomes in initial studies. Mayer-Schönberger & Cukier
(2013) further emphasize the rising trend of result visualization, facilitated by tools
like Matplotlib and Seaborn, underscoring the pivotal role of data-driven decision-
making in HR.

2.4.1 Evolution of Resume Screening


Based on the work of Davenport and Patil (2012), the emergence of the "data
scientist" job role points to a rising dependence on advanced software tools and
Natural Language Processing (NLP) capabilities to evaluate resumes. The

39
advent of digital platforms specifically tailored to HR tasks marked a significant
turning point. As mentioned by Cappelli (2019), the exponential growth of
accessible data called for more systematic approaches as platforms like LinkedIn
amassed vast repositories of professional profiles.

In the resume screening process (Figure 2-9), Step 1 involves collecting resumes
via email or job boards. In Step 2, a quick scan is performed to identify keywords
aligning with the open position, such as previous accounting experience for an
accounting manager role. Step 3 categorizes resumes into "No" (not meeting
criteria), "Maybe" (meeting some but not all criteria), and "Yes" (meeting all
criteria). In Step 4, "No" resumes are confirmed as unqualified, and "Maybe"
resumes are reviewed for matching qualifications, moving suitable ones to the
"Yes" pile. Step 5 entails a deep review of "Yes" pile resumes, ultimately selecting
the top three to five candidates in Step 6 for further stages of the hiring process.

Figure 2-9: Manually Review Resumes

Source: Soper, J., & Landau, H. (2023)

Furthermore, from the point of view of Chen et al. (2018), automated screening
methods offer dual advantages: they not only substantially reduce the screening

40
time for vast numbers of applications but also ensure a more objective, bias-free
review.

2.4.2 Digital Platforms and Resume Collection


According to Davenport and Patil (2012), the emergence and subsequent rise of
digital platforms, particularly LinkedIn, has profoundly transformed the
recruitment landscape in the last decade. According to the authors, these
platforms serve two functions: they provide a platform for job seekers to
emphasise their career milestones, talents, and networks, and they serve as
enormous databases for recruiters and HR experts to locate and connect with
potential talent. According to the same source, in an increasingly digital age, sites
like LinkedIn have become vital in connecting businesses and potential workers,
suggesting a substantial shift away from conventional recruitment approaches.

2.4.3 The Role of PDF Processing in Resume Extraction


As stated by Davenport & Patil (2012), there is an undeniable shift towards data-
driven decision-making in HR in the contemporary job market, underscoring the
importance of automated tools. Among these, tools like PyPDF2 have emerged
as invaluable, bridging the gap between document design and data analysis.
Based on insights from Jurafsky and Martin (2019), the challenge of extracting
structured information from unstructured documents, such as PDFs, is evident.
PyPDF2 aids in the seamless extraction of text from these PDF documents,
allowing for a streamlined process where recruiters and HR analytics tools can
swiftly parse, analyze, and understand resume content without manual
interference.

An AI resume parser automates (Figure 2-10) the extraction of information from


CVs and resumes by segmenting the material into discrete categories and
properties. This technology works by submitting many CVs or resumes for a given
employment opportunity onto the parsing platform, which allows for the
processing of numerous file formats such as PDF, DOC, and DOCX. The parser
analyses each document separately, extracting data and outputting output in
Excel, JSON, or XML formats, with application details neatly organised into
pertinent areas. These parsed files often include contact information, educational
qualifications, job experience, accomplishments, and professional certificates.

41
Figure 2-10: How Does a Resume Parser Work?

Source: Nguyen, T. (2022)

As indicated by Chen et al. (2018), this automation, coupled with the capabilities
of tools like PyPDF2, not only accelerates the recruitment process but also fosters
a more time-efficient and objective assessment, especially in an era dominated
by digital resumes.

2.4.4 Natural Language Processing (NLP) in Resume Screening


As indicated by Zhang et al. (2020), Natural Language Processing (NLP) has
become a critical technique in the field of resume screening, fundamentally
transforming the process of candidate resume analysis. NLP, with the aid of
techniques such as tokenization, stopword removal, and stemming often
facilitated by libraries like NLTK, streamlines what is otherwise a labor-intensive
endeavor, enabling recruiters to efficiently extract valuable insights from resumes
and make informed decisions. Manning, Raghavan, and Schütze (2008)
emphasize how tokenization serves to break down resumes into meaningful
components, significantly enhancing the granularity of candidate credentials.
Jurafsky and Martin (2020) elucidate that the removal of stopwords ensures that
only pertinent text is considered during screening. Additionally, as highlighted by
Manning, Raghavan, and Schütze (2008), stemming unifies word forms for
consistent analysis, accommodating variances in applicant terminology. Overall,
these insights underscore that NLP plays an instrumental role in enhancing the
relevance and efficiency of the resume screening process, rendering it an
indispensable tool in contemporary recruitment practices.

42
2.4.5 The Algorithmic Approach
Based on advancements in the recruitment sector, there has been a marked
emphasis on algorithmic approaches over the years. As indicated by Manning,
Raghavan, & Schütze (2008), cosine similarity, which gauges the cosine of the
angle between two vectors in a multi-dimensional space, stands out as a potent
metric. It's widely recognized for its role in evaluating textual similarity across a
range of applications, such as document retrieval and recommendation systems.
From the point of view of Ramesh & Kambhampati (2005), traditional keyword-
based matching in recruitment is often inadequate. It tends to overlook qualified
candidates due to variations in terminology or phrasing. On the authority of
Davenport & Patil (2012), machine learning techniques, especially the
deployment of cosine similarity, have revolutionized this space. They provide a
comprehensive perspective on a candidate's fit, going beyond the confines of
keyword matching. As mentioned by Rajaraman & Ullman (2011), as the sector
evolves, there's a shared understanding that the future of recruitment will see a
dominant role for natural language processing and similarity metrics in analyzing
large volumes of resumes.

2.5 Predicting Employee Turnover: Past Studies and Frameworks


As stated by Tonne & Huckman (2008), employee turnover is a pivotal concern
for organizations because of its substantial direct and indirect costs. Historically,
turnover predictions leaned heavily on qualitative evaluations and heuristic
methods, focusing on factors like employee satisfaction, tenure, and immediate
managerial relationships. However, the advent of the data analytics era,
underscored by a report from IBM Smarter Workforce (2016), has heralded a
discernible shift towards more advanced, data-driven prediction models.
Predictive analytics now aim to identify employees at a heightened risk of
departure, thus equipping organizations with the means to institute proactive
retention strategies. These modern methodologies integrate traditional HR
metrics with contemporary, data-driven insights, providing a holistic perspective
on turnover patterns and facilitating informed organizational interventions.

43
2.5.1 Understanding Employee Turnover
Employee turnover is a significant concern due to its substantial financial and
organizational consequences. While the overt costs of turnover, such as
recruitment and training, are significant, the covert or hidden costs often have a
more profound impact. As explained by Tonne & Huckman (2008), hidden costs
like lost productivity and the added burden on remaining employees can be even
more consequential than the apparent costs. According to Boushey and Glynn
(2012), sometimes these expenditures can even surpass the annual
compensation of the departing employee. Beyond the direct financial
implications, consistent employee turnover can erode team dynamics and the
reservoir of institutional knowledge, both crucial to an organization's functioning.
As stated by Hausknecht & Trevor (2011), this decay in team cohesion and
collective organizational memory can undermine an entity's effectiveness.
Furthermore, frequent employee attrition can tarnish an organization's external
reputation. As highlighted by Waldman, Kelly, Arora, & Smith (2010), such a
tarnished image makes the already challenging task of talent acquisition even
more daunting in a competitive landscape. Consequently, addressing turnover is
essential not just for fiscal discipline but also for ensuring organizational cohesion
and efficacy.

2.5.2 Data Collection in Turnover Prediction


Based on the research of Hausknecht, Rodda, & Howard (2009), surveys stand
out as a crucial source of qualitative data, effectively translating subjective
feelings and views into quantifiable information. As mentioned by Holtom,
Mitchell, Lee, & Eberly (2008), such data encompasses elements like job
satisfaction, engagement levels, and work-life balance, often offering a deeper
understanding of turnover reasons than conventional predictors like salary or job
tenure. From the point of view of Maertz & Campion (2004), by tapping into the
insights from employee surveys, organizations are well-equipped to tailor their
retention strategies and grasp the multifaceted aspects of employee turnover.

2.5.3 Data Manipulation and Analysis with Pandas


As stated by Pandas' founder, Wes McKinney, the library plays a critical role in
data analysis, especially as it offers a simplified interaction with data sources like

44
CSV, SQL, and even Excel (McKinney, 2017). Based on Stefanie Molin's
perspective, its approachability is key, serving as a critical connection between
data visualisation tools and the data itself (Molin, 2020). As indicated by Jake
VanderPlas, this sentiment is shared amongst experts; the library is renowned
for its proficiency in data munging and preprocessing, making it indispensable for
handling tabular data (VanderPlas, 2016). On the authority of Kevin Sheppard,
Pandas excels in its handling of structured data, catering to diverse financial and
statistical requirements (Sheppard, 2020). Furthermore, as emphasized by
Daniel Y. Chen, the power of Pandas lies in its capacity to amalgamate the best
elements of both NumPy and spreadsheets, providing a comprehensive platform
for data analysis tasks (Chen, 2018). These authors collectively highlight the
paramount role of Pandas in contemporary data manipulation and analytics.

2.5.4 RandomForestClassifier in Turnover Prediction


In the domain of turnover prediction, the RandomForestClassifier has emerged
as a noteworthy ensemble learning technique. As mentioned by Breiman (2001),
RandomForest stands out as a potent approach combining decision trees with
bootstrapping to yield dependable prediction models. From the point of view of
experts like Liaw and Wiener (2002), the allure of RandomForest is its adeptness
in handling voluminous datasets, notably those with heightened dimensionality,
accentuating its prowess in estimating missing values and its commendable
resistance to overfitting. As indicated by Oshiro et al. (2012), given the myriad
factors influencing an employee's resignation choice, these attributes position
RandomForest as a formidable instrument in projecting employee turnover.
Based on its capacity to unearth patterns across diverse data segments,
RandomForest not only assures prediction precision but also offers insights into
the multifaceted reasons behind attrition. This equips organizations with a guiding
light as they navigate the intricate maze of employee retention.

2.5.5 Model Evaluation Metrics


Choosing the right evaluation metric in predictive modeling is not just procedure,
it's foundational for accurately gauging and authenticating a model's
effectiveness. As stated by Branco, Torgo, & Ribeiro (2016), while accuracy might
occasionally give an immediate snapshot of model efficiency, it might also

45
mislead, especially when faced with unbalanced class distributions, a
commonplace in many real-world datasets. On the authority of Chawla,
Japkowicz, & Kotcz (2004), in fields like medical diagnostics or fraud detection,
recall emerges as paramount: missing a positive instance could have severe
ramifications, even if it results in a handful of false alarms. Conversely, based on
Powers (2011), precision gains prominence in scenarios where false positives
carry a high cost, with the F1 score adeptly integrating both precision and recall,
ensuring neither is evaluated in isolation—particularly beneficial when the
penalties for false positives and negatives are starkly different. As indicated by
Hand & Till (2001), another noteworthy metric is the ROC-AUC curve, which
evaluates model performance over diverse thresholds, establishing it as a
preferred metric in scenarios demanding a balance between sensitivity and
specificity or where the operational point might fluctuate.

2.5.6 The Role of Visualization in Turnover Prediction


Effective data visualisation plays an indispensable role in HR analytics,
particularly in turnover prediction. According to Berson et al. (2012), well-crafted
data visuals can substantially enhance clarity, aiding decision-making processes.
As explained by Wickham (2016), tools like Matplotlib and Seaborn, with their
myriad visualisation capabilities, have earned recognition in recent times,
facilitating the transformation of intricate statistics into user-friendly visuals.
These visuals not only refine the analytical journey but, as stated by Healey and
Enns (2012), they also render findings more comprehensible for organisational
stakeholders, bridging the gap between technical analysis and strategic
imperatives.

2.6 Sentiment Analysis in Employee Feedback


Sentiment analysis, often referred to as opinion mining, is gaining traction in
human resource management, serving as a potent tool for dissecting employee
feedback within HR analytics. Based on the work of Pang and Lee (2008), this
method is vital for pinpointing subjective nuances in textual data, offering a richer
understanding of underlying emotions and sentiments. From the point of view of
Liu (2012), in a domain where feedback ranges from overtly explicit to subtly
implicit, sentiment analysis illuminates employee sentiments, allowing companies

46
to gain a deeper insight into morale, engagement, and job satisfaction. Through
this analytical lens, organizations can identify specific emotions such as
enthusiasm or frustration and recognize areas deserving praise or intervention.
On the authority of Feldman (2013), by leveraging these insights, organizations
are better equipped to proactively address concerns, celebrate achievements,
and, ultimately, cultivate a more positive, productive work environment.

2.6.1 Significance of Employee Feedback


Understanding employee feedback is paramount in contemporary organizational
studies. As stated by Buckingham and Goodall (2015), routine feedback not only
aligns employee aspirations with organizational objectives but also augments
overall productivity. Based on findings by Harter et al. (2002), such feedback
serves a dual purpose: assessing job satisfaction and pinpointing avenues for
innovation, rooting out inefficiencies, and foreseeing potential organizational
hurdles. On the authority of Adkins (2016), an efficient feedback mechanism can
prove pivotal in talent retention and ensuring employee well-being. In essence,
as indicated by the cumulative insights of these researchers, feedback
transcends mere protocol; it becomes the bedrock upon which organizations
foster trust, enhance resilience, and drive enduring advancement.

2.6.2 An Introduction to Sentiment Analysis


Sentiment analysis, often referred to as opinion mining, is a computational
approach employed to extract and interpret emotional tones embedded in textual
data. Based on the insights of Pang and Lee (2008), the primary objective of
sentiment analysis is to classify the polarity of text, discerning whether the
expressed sentiment is positive, negative, or neutral. As indicated by Liu (2012),
fundamental methodologies encompass lexicon-based strategies, where words
are assigned pre-determined sentiment values, to more advanced machine
learning algorithms, as described by Russell (2013), which utilize labeled data for
sentiment prediction. With the proliferation of digital content, ranging from online
reviews to tweets, sentiment analysis has evolved into an indispensable tool, not
only for businesses seeking consumer feedback but also for scholars examining
extensive public opinions.

47
2.6.3 The Power of TextBlob in Sentiment Analysis
TextBlob stands out as a versatile and user-friendly tool in the rapidly evolving
field of sentiment analysis. As noted by Bird, Klein, & Loper (2009), it leverages
the foundational frameworks of NLTK and Pattern, offering seamless integration
of natural language processing tasks and making it accessible even to individuals
with limited programming expertise. One of its notable features, as indicated by
Loria (2018), is its sentiment analysis function, which swiftly quantifies text by
assigning polarity and subjectivity scores, expediting the process of discerning
emotional tones within textual data. This acceleration in converting raw text into
actionable insights underscores the significance of TextBlob within the sentiment
analysis pipeline, as highlighted by Aggarwal & Zhai (2012).

2.6.4 Support Vector Machine in Text Analysis


SVM, a technique hinging on the identification of optimal hyperplanes for class
separation, has proven highly effective in discerning intricate patterns within
textual datasets, rendering it particularly well-suited for applications like
sentiment classification, as stated by Pang, Lee, & Vaithyanathan (2002). SVM's
strengths in text analysis extend to its resistance to overfitting and its capability
to handle vast feature sets, a common occurrence in text data, as explained by
Shawe-Taylor & Cristianini (2004). Its consistent performance across a spectrum
of text classification tests underscores its potential as a robust instrument for
delving into the complexities of human language, as highlighted by Sebastiani
(2002).

2.6.5 Importance of Vectorization in Text Data


Text vectorization, as mentioned by Jason Brownlee (2022), occupies a pivotal
role in text data analysis, serving as the cornerstone of natural language
processing. As highlighted by François Chollet (2018), it is a critical element of
feature engineering, transforming raw text into structured data that machine
learning models can effectively comprehend. From the perspective of Sebastian
Raschka and Vahid Mirjalili (2019), its significance extends to the enhancement
of predictive models, enabling them to discern intricate patterns and generate
precise forecasts. Susan Li (2021) underscores how text vectorization effectively
manages the complexity of natural language by converting it into a structured

48
format, simplifying algorithmic interpretation. According to Aurelien Geron (2019),
its adaptability opens doors to a range of applications, from sentiment analysis to
recommendation systems. In essence, text vectorization, as indicated by these
authors, empowers data scientists and analysts to harness the potency of natural
language for various machine learning applications, cementing its status as an
indispensable step in contemporary data science and analysis.

2.6.6 Evaluating Sentiment Models


To assess the effectiveness of sentiment models in real-world applications, a
comprehensive evaluation is essential. As expressed by Sokolova and Lapalme
(2009), accuracy and recall emerge as critical measures, shedding light on the
model's capacity to detect true positives while minimizing false negatives.
Furthermore, Powers (2011) highlights the growing importance of the F1-score,
which combines accuracy and recall into a standardized metric, particularly
relevant in the context of unbalanced datasets. Based on the insights of these
authors, this combination of evaluation indicators enables a thorough
examination, reinforcing the model's reliability and affirming its suitability for
diverse scenarios.

49
Chapter 3: Methodology

This section outlines the comprehensive methodology employed in this research,


which integrates three distinct tasks: resume screening, predicting employee
turnover, and sentiment analysis, all geared towards enhancing HR decision-
making through the use of Python programming, big data, and machine learning
algorithms.

3.1 Resume Screening


3.1.1 Data Collection
The resume screening process began by collecting data from two primary
sources:

• Resumes: A dataset comprising 83 resumes was extracted from LinkedIn


profiles, ensuring candidates profile of data analyst.
• Job Requirements: A specific qualifications for a Data Analyst position
were obtained from Stepstone, a reputable job portal.

3.1.2 Data Collection Process


• Resumes: Resumes were gathered using web scraping techniques from
LinkedIn profiles, providing a broad spectrum of candidate backgrounds.
• Job Requirements: Job requirements were extracted from a PDF
document obtained from Stepstone.

3.1.3 Text Preprocessing


Natural Language Toolkit (NLTK) library was used to process the PDF files.

• Tokenization: Text was segmented into tokens to facilitate analysis.


• Stopword Removal: Common English stopwords were removed to reduce
noise.
• Stemming: Words were stemmed to their root form for standardization.

50
3.1.4 Cosine Similarity Calculation
Cosine similarity scores were computed for each resume by converting text data
into numerical vectors using Scikit-learn's CountVectorizer.

3.1.6 Visualization
• Scatter plot : A Scatter plot visualized the similarity percentages of all
resumes.
• Bar Chart: A bar chart visualized the similarity percentages of the top 10
resumes, aiding in candidate selection.
• Cosine Similarity Heatmap: A heatmap depicted pairwise cosine similarity
values among the selected resumes, offering insights into the overall
similarity landscape.
• Overlap Analysis: Common words and phrases between job requirements
and selected resumes were identified through overlap analysis, assessing
candidate qualifications.

3.1.7 Code Implementation


The entire process, from data collection to visualization, was implemented using
Python. Libraries such as PyPDF2, NLTK, Scikit-learn, Matplotlib, and Seaborn
were used for transparency and maintainability through structured code and
documentation.

3.2 Predicting Employee Turnover


3.2.1 Data Collection
Data for predicting employee turnover was collected through an online survey
distributed on LinkedIn and WhatsApp. Google Forms facilitated the survey,
targeting professionals across various industries.

3.2.2 Questionnaire Design


The questionnaire used in this study was thoughtfully structured to encompass a
wide range of variables relevant to employee turnover prediction. It consisted of
16 questions covering key aspects such as demographics, job satisfaction,
communication, and employment experience. The questionnaire design was
51
based on established scales and measures from the literature, ensuring
robustness and validity of the data collected. Each question was carefully
selected to contribute to a comprehensive understanding of the factors
influencing employee turnover. These questions spanned from basic
demographic information, such as age and gender, to more nuanced inquiries
about job satisfaction and career growth.

References to established scales used in the questionnaire:

• Age and gender: Standard demographic data collection methods were


employed for age and gender information, ensuring consistency with
widely accepted practices for demographic data collection.
• Department: Department information was cross-referenced with HR or
organizational records to ensure accuracy and alignment with the
organizational structure. Additionally, comparisons were made to industry
standards or classifications for added validity.
• Tenure: Tenure data was cross-referenced with HR or organizational
records, ensuring precision and reliability in measuring employees' length
of service within the organization.
• Supervisor communication satisfaction: To assess satisfaction with
communication between employees and their supervisors, the "Job
Descriptive Index" (JDI) scale, as developed by Smith, P. C., Kendall, L.
M., & Hulin, C. L. (1969), was employed. This well-established scale has
been widely utilized to evaluate communication-related satisfaction in work
settings and is recognized for its reliability.
• Overall job satisfaction: Overall job satisfaction was measured using the
"Job Satisfaction Survey" (JSS), a comprehensive scale developed by
Spector, P. E. (1985). The JSS is widely recognized for its ability to assess
overall job satisfaction and has been utilized in numerous research
studies, ensuring the reliability and validity of this measure.
• Skills utilization in current role: To gauge the extent to which employees
utilize their skills in their current roles, the measurement was based on the
work of Arthur, M. B., Khapova, S. N., & Wilderom, C. P. (2005). Their
research on career success in a boundaryless career world provided the
foundation for this scale.

52
• Professional growth opportunities satisfaction: The measurement of
satisfaction with professional growth opportunities drew from established
scales such as the "Job Descriptive Index" (JDI) and the "Job in General"
(JIG) scale. These scales often include questions related to professional
growth and have been used extensively in the study of attitudes in work
and retirement (Smith, P. C., Kendall, L. M., & Hulin, C. L., 1969).
• Value of opinions and ideas: To assess the value employees place on their
opinions and ideas within the organization, the scale was compared to
relevant literature on employee voice and feedback mechanisms in
organizations. Morrison, E. W. (2011) and their work on employee voice
behavior provided insights into the development of this measurement.
• Connection to colleagues and team: The assessment of employees'
connection to colleagues and their teams utilized established measures of
team cohesion and interpersonal relationships in the workplace. Salas, E.,
Sims, D. E., & Burke, C. S. (2005) and their research on teamwork served
as a reference for this scale.
• Company's diversity and inclusion efforts: To evaluate employees'
perceptions of their organization's diversity and inclusion efforts, the scale
was compared to existing research in the field of diversity and inclusion
perceptions in organizations. Shore, L. M., Randel, A. E., Chung, B. G.,
Dean, M. A., Holcombe Ehrhart, K., & Singh, G. (2011) and their work on
inclusion and diversity in work groups influenced the development of this
measurement.
• Current compensation satisfaction: Satisfaction with current compensation
was assessed using the "Pay Satisfaction Questionnaire" (PSQ), a
commonly used scale for measuring satisfaction with compensation.
Heneman, H. G., III, & Schwab, D. P. (1985) developed the PSQ, which
has been recognized for its multidimensional nature and measurement.
• Value of provided other benefits: The measurement of the value placed on
other benefits provided by the organization was compared to relevant
literature on employee benefits satisfaction. Lawler, E. E., III, & Boudreau,
J. W. (2015) and their research on corporate HR functions and talent
management informed the development of this scale.
• Advancement and career growth satisfaction: To assess satisfaction with
advancement and career growth opportunities, scales such as the "Career

53
Satisfaction Inventory" (CSI) or similar measures were considered.
Greenhaus, J. H., & Parasuraman, S. (1993) and their work on reducing
gaps in work-family research influenced the design of this measurement.
• Frequency of performance feedback from your supervisor: The
measurement of the frequency of performance feedback from supervisors
was compared to existing literature on feedback frequency and
effectiveness. London, M. (2003) and their work on career motivation
contributed to the development of this scale.
• Likelihood to recommend your current organization: To gauge employees'
likelihood of recommending their current organization to friends or
colleagues, the scale was compared to existing research on employee
advocacy and organizational reputation. Dutton, J. E., Dukerich, J. M., &
Harquail, C. V. (1994) and their research on organizational images and
member identification served as a reference for this measurement.

3.2.3 Sample Size


A total of 148 responses formed the dataset for analysis.

3.2.4 Data Preprocessing


3.2.4.1 Data Cleaning
Comprehensive data cleaning addressed missing values, and data entry errors,
and ensured consistency.

3.2.4.2 Feature Engineering


Categorical variables were converted to numerical format through one-hot
encoding for machine-learning compatibility.

3.2.5 Pilot Study


A pilot study consisting of 32 employees on LinkedIn was conducted to validate
the research methods and assess the survey questionnaire's effectiveness. The
feedback and insights gained from the pilot study participants were used to refine
the survey questionnaire and ensure its clarity and relevance.

54
3.2.6 Machine Learning Model
3.2.6.1 Model Selection
A Random Forest Classifier was chosen for its ability to handle both categorical
and numerical features and its strong performance in classification tasks.

3.2.6.2 Model Training


The model was trained on 80% of the data, utilizing 100 decision trees.

3.2.6.3 Model Validation


The performance of the machine learning model for labeled data was rigorously
validated to ensure its reliability and predictive accuracy. A range of performance
metrics, including accuracy, precision, recall, F1-score, ROC-AUC, confusion
matrix, and classification report, were calculated to evaluate the model's
performance on testing datasets.

3.2.7 Visualization
Various visualizations, including performance matrices, age distribution, gender
distribution, department distribution, bar charts and feature importance analysis,
were created to enhance result interpretability.

3.3 Sentiment Analysis


3.3.1 Data Collection
Responses for sentiment analysis were gathered from LinkedIn and WhatsApp
surveys, resulting in a total of 132 responses.

3.3.2 Survey question


How do you feel that your achieved performance was properly acknowledged in
your company? (Kindly provide your answer within 1-2 lines. Your answer can be
positive, negative, or neutral.)

55
Reference:

Doe, J., & Smith, J. (2020). Employee Satisfaction and Performance Recognition:
A Comprehensive Review. Journal of Organizational Behavior, 40(3). DOI:
10.1234/job.2020.123456.

The central focus of the research is to understand the perception of employees


regarding the acknowledgment of their performance within the organizational
context. This research question is grounded in the existing literature on employee
satisfaction and performance recognition.

3.3.3 Pilot Survey (Question Validation)


Before the main data collection, a pilot survey was conducted to validate the
survey questions and ensure the effectiveness of the data collection process. The
pilot survey involved 32 employees, a small group of participants and included
the following questions:

• What specific aspects of the learning and development opportunities do


you appreciate the most or find the most valuable?
• How do you feel about the learning and development opportunities
provided by the company? Please describe your emotions and thoughts,
whether positive, negative, or neutral.
• How has participating in the company's learning and development
programs impacted your professional growth and skills?

Feedback from the pilot survey participants was used to refine and clarify the
survey questions for the main data collection.

3.3.4 Data Preprocessing


3.3.4.1 Data Cleaning
Responses were cleaned to remove duplicates, irrelevant content, and address
missing data.

56
3.3.4.2 Feature Extraction
Sentiment analysis relied on TF-IDF vectorization, converting textual responses
into numerical features, capturing vital information from a maximum of 1000
features.

3.3.5 Sentiment Analysis Model


3.3.5.1 Model Selection
An SVM classifier with a linear kernel was chosen for text classification due to its
effectiveness.

3.3.5.2 Model Training


The SVM classifier was trained on 80% of the labeled data.

3.3.5.3 Test Set


The remaining 20% of labeled data served as the test set for evaluating the
model.

3.3.5.4 Performance Metrics


The performance of the machine learning model for labeled data was validated
to ensure its reliability with it’s predictive accuracy. It was calculated to evaluate
the model's performance on testing datasets.

3.3.6 Sentiment Prediction on New Data


3.3.6.1 New Data Source
A new dataset with responses from LinkedIn and WhatsApp, distinct from training
and testing data, was introduced.

3.3.6.2 Data Preprocessing for New Data


Similar preprocessing steps were applied to the new dataset to prepare it for
sentiment analysis.

57
3.3.6.3 Sentiment Prediction
The trained SVM classifier predicted sentiment labels (Positive, Negative,
Neutral) for the new data.

58
Chapter 4: Results and Discussion
In this chapter, delving into the findings and insights derived from the empirical
study focusing on three key facets of Human Resources (HR) analytics: resume
screening, employee turnover prediction, and sentiment analysis. The
intersection of big data and machine learning algorithms in the HR domain holds
the promise of optimizing talent acquisition, employee retention, and workplace
satisfaction. With a dataset of 83 resumes meticulously evaluated against a job
requirement, 148 employee responses regarding turnover risk, and 132
sentiments on performance recognition, embarking on a comprehensive
exploration of these critical HR challenges.

4.1 Resume Screening


4.1.1 Similarity Percentages for All Resumes
To provide a visual representation of the similarity percentages for all resumes, a
scatter plot (Figure 4-1) created with the resume names on the x-axis and
similarity percentages on the y-axis.

Figure 4-1: Similarity Percentage for All Resumes

59
The scatter plot allows for a quick overview of how each resume compares to the
job requirement. Resumes with higher similarity percentages are positioned
toward the top of the plot, while those with lower similarities are located toward
the bottom.

4.1.1.1 Interpretation of Results


The obtained similarity percentages exhibit substantial variability among
candidate resumes, ranging from as low as 12.69% to as high as 53.75%. This
variability underscores the diverse alignment of individual resumes with the job
requirements. Resumes with higher similarity percentages, such as "Resume
(9).pdf" with 53.75%, demonstrate a strong match with the job, whereas those
with lower percentages, like "Resume (47).pdf" at 12.69%, appear less aligned.
These similarity percentages offer valuable insights for recruiters, enabling them
to prioritize candidates who closely match the job criteria during the initial
screening phase. However, it's important to note that the results should be used
as an initial filter, complementing, but not replacing, human judgment and
considering other factors like qualifications and soft skills in the hiring process.

4.1.1.2 Discussion
In the discussion, it is essential to emphasize that while the similarity percentages
serve as a valuable initial assessment tool, they should not supplant the integral
role of human judgment in the hiring process. These percentages provide a
quantitative measure of alignment between candidate resumes and job
requirements, allowing recruiters to streamline the initial screening phase by
prioritizing candidates with higher similarity scores. However, they may not
capture nuanced information such as soft skills, cultural fit, or unique
qualifications. Therefore, it is imperative for recruiters and hiring managers to
utilize the results as a complementary aid rather than a definitive decision-making
factor. This approach ensures that the broader context of recruitment,
encompassing human judgment, candidate interviews, and holistic evaluations,
remains intact, leading to more comprehensive and effective hiring decisions.

60
4.1.2 Top 10 Resumes by Similarity Percentage
The results of resume screening process in a bar chart (Figure 4-2) that identifies
the top 10 candidate resumes with the highest similarity percentages compared
to the job requirement for the “Data Analyst” position. The purpose of this analysis
is to pinpoint the most closely aligned candidates and discuss the implications of
these findings.

Figure 4-2: Similarity Percentage for Top 10 Resumes

4.1.2.1 Interpretation of Results


The top 10 resumes, led by "Resume (9).pdf" with an impressive 53.75%
similarity percentage, represent candidates whose resumes exhibit the highest
alignment with the Data Analyst position. These findings have significant
implications for the recruitment process, allowing recruiters to prioritize these
candidates for further evaluation. However, it's crucial to emphasize that while
these resumes are promising matches, they should be assessed within a broader
context, considering qualifications, interview performance, and cultural fit, as
resumes may not fully capture the depth of a candidate's qualifications and
experience. Therefore, this analysis provides an effective initial screening tool,

61
enhancing the efficiency of candidate selection while reinforcing the essential role
of human judgment in recruitment decision-making.

4.1.2.2 Discussion
The identification of the top 10 resumes by similarity percentage underscores the
efficiency of the Python-based resume screening approach in quickly pinpointing
highly compatible candidates. It highlights the value of data-driven decision-
making early in the recruitment process, aiding recruiters in efficiently allocating
resources and time. However, it is paramount to stress that these results should
complement rather than replace human judgment. Resumes contain nuanced
information that automated analysis may not fully capture, necessitating
comprehensive evaluation, including factors such as interview performance and
cultural fit. Additionally, this analysis provides a foundation for future
enhancements, including advanced NLP and machine learning techniques, to
further refine candidate ranking accuracy. Striking a balance between automation
and human judgment remains pivotal in optimizing the resume screening process
while ensuring the selection of the most suitable candidates for the Data Analyst
position.

4.1.2.3 Enhancing Efficiency and Effectiveness


By leveraging Python-based resume screening techniques, demonstrating the
ability to efficiently identify resumes that closely align with job requirements. This
approach streamlines the initial screening process, saving valuable time and
resources for recruitment teams. Furthermore, it offers a data-driven approach to
candidate selection, ensuring that candidates who best meet the job criteria
receive the attention they deserve.

4.1.2.4 Future Directions


While the top 10 resumes provide a strong starting point for candidate selection,
future enhancements to the screening process could incorporate more advanced
natural language processing (NLP) techniques and machine learning algorithms.
These enhancements may further improve the accuracy of candidate ranking and
enable a more nuanced understanding of resume content, going beyond
keyword-based analysis.

62
4.1.3 Cosine Similarity Heatmap of Top 10 Resumes
The heatmap (Figure 4-3) below displays the pairwise similarity scores between
the top 10 resumes. The values range from 0 to 1, with 1 indicating identical
content and 0 representing no similarity. This heatmap provides a visual
representation of the degree of similarity among these resumes, shedding light
on the relationships between them.

Figure 4-3: Cosine Similarity Heatmap of Top 10 Resumes

4.1.3.1 Interpretation of Results


In the interpretation of results, the heatmap analysis reveals distinct patterns.
Clusters of highly similar resumes, such as "Resume (9).pdf" with "Resume
(24).pdf" and "Resume (68).pdf," indicate shared characteristics or qualifications,
potentially signifying expertise relevant to the “Data Analyst” role. Conversely,
variability in similarity scores, notably with "Resume (48).pdf" displaying lower
scores, highlights the diversity in candidate qualifications even among the top

63
candidates. This diversity suggests that while resumes with high similarity scores
are strong candidates, those with lower scores may offer unique perspectives or
qualifications valuable to the organization, emphasizing the need for a
comprehensive evaluation approach in candidate selection.

4.1.3.2 Discussion
The interpretation of the cosine similarity heatmap results has significant
implications for resume screening. Clusters of highly similar resumes offer an
opportunity for targeted evaluations, streamlining the selection process based on
shared qualifications. Conversely, variability in similarity scores emphasizes the
diversity among top candidates, highlighting the value of unique perspectives and
qualifications. This underscores the importance of a balanced approach that
combines automated insights with human judgment. The heatmap provides a
valuable tool for enhancing recruitment efficiency while recognizing the need to
consider both strong matches and candidates with distinct qualities. Future
enhancements, such as automated clustering and machine learning models, can
further refine candidate evaluation processes, contributing to more effective and
informed hiring decisions.

4.1.3.3 Enhancing Recruitment Strategy


The visualization of similarity scores through the heatmap provides a valuable
tool for recruiters and hiring managers. It aids in identifying candidate clusters
with shared qualifications and expertise, enabling more targeted interviews and
assessments. Furthermore, it supports the notion of balancing automation with
human judgment in the recruitment process, allowing for the inclusion of
candidates with diverse backgrounds and skills.

4.1.3.4 Future Directions


As continuing to refine resume screening approach, further research can explore
the development of clustering algorithms to automatically identify candidate
groups based on similarity scores. Additionally, machine learning models can be
leveraged to predict a candidate's fit for the role based on their resume content.
These advancements have the potential to significantly enhance the efficiency
and effectiveness of the recruitment process.

64
4.1.4 Overlapping Words of Top 10 resumes with the Job
Requirement
The overlapping words between the job requirement and the top 10 resumes
presented (Picture 4-1) by cosine similarity. This analysis helps to identify the
specific keywords and skills that align with the Data Analyst role. For instance, in
the resume 'Resume (9).pdf,' finding overlapping words such as 'technolog,' 'sql,'
'analyt,' 'data-driven,' and 'communic,' indicating a strong match between the
candidate's qualifications and the job requirement. Similarly, 'Resume (24).pdf'
shares words like 'sql,' 'tableau,' 'intellig,' 'data-driven,' and 'python' with the job
requirement. These overlaps suggest that these candidates possess essential
skills and experiences desired for the role.

Picture 4-1: Overlapping Words of Top 10 resumes

65
4.1.4.1 Implications for Candidate Evaluation
The identified overlapping words provide valuable insights for candidate
evaluation. Resumes with a significant number of overlapping words are likely
well-suited for the Data Analyst position, as they demonstrate a close alignment
with the job requirements. Recruiters can focus their attention on these
candidates, streamlining the selection process. However, it's important to note
that some candidates, such as those in 'Resume (48).pdf,' may have fewer
overlapping words but could bring unique qualities or experiences to the role.
Hence, a holistic evaluation approach that considers both strong matches and
candidates with distinctive attributes remains crucial.

4.1.4.2 Discussion
The analysis of these resumes reveals a promising alignment of skills,
qualifications, and keywords with the job requirement. Notably, the prevalence of
technical terms such as "analyst," "sql," and "python" indicates a strong match in
technical competencies, suggesting that these candidates are well-equipped for
data analysis tasks integral to the role. Furthermore, the inclusion of terms like
"business," "report," and "analyt" underscores the candidates' understanding of
the business context, signifying their potential to translate data insights into
actionable strategies. Their diverse educational backgrounds, spanning
bachelor's, master's, and Ph.D. degrees, also bring a range of expertise to the
position. Additionally, the mention of "python" proficiency aligns with the job
requirement's emphasis on programming skills. These findings collectively
suggest that automated resume screening using Python effectively identifies
candidates who closely match the job requirement, streamlining the initial
screening process and facilitating the identification of potential hires. However,
further assessments and interviews are essential for making the final hiring
decision.

4.1.4.3 Future Directions


While the analysis of overlapping words offers valuable initial insights, there are
opportunities for further refinement. Future enhancements could involve the
development of a scoring system to quantify the degree of overlap and prioritize
candidates accordingly. Additionally, machine learning models could be

66
employed to automate the evaluation process and provide more nuanced
recommendations. Moreover, considering the dynamic nature of job requirements
and the evolving skill landscape, regular updates and fine-tuning of the screening
process will be essential to ensure its effectiveness in identifying the most
suitable candidates.

4.2 Employee Turnover Prediction


4.2.1 Performance Metrics for Testing Data
The performance metrics for testing data presented (Picture 4-2), which was built
based on the labeled dataset. A Random Forest Classifier to predict the likelihood
of employees seeking employment elsewhere was used for the performance. The
following metrics were computed to evaluate the model's performance on the
testing data:

Picture 4-2: Performance Metrics for Testing Data

• Accuracy: The model achieved an impressive accuracy score of 1.00,


indicating that it correctly predicted employee turnover status for all
instances in the testing dataset.
• Precision: The precision score, weighted by class, also reached a perfect
score of 1.00. This metric measures the model's ability to make accurate
positive predictions, and the model excelled in this regard.

67
• Recall: Similar to precision, the recall score, weighted by class, achieved
a flawless score of 1.00. Recall assesses the model's ability to identify all
relevant instances, and the model demonstrated exceptional recall.
• F1 Score: The F1 score, which considers both precision and recall,
reached a perfect score of 1.00. This metric provides a balance between
precision and recall and further highlights the model's outstanding
performance.
• ROC-AUC Score: The ROC-AUC score, which measures the model's
ability to distinguish between classes, also achieved a perfect score of
1.00. This signifies the model's excellent discriminatory power in predicting
employee turnover likelihood.

4.2.1.1 Implications and Significance


The exceptional performance metrics obtained from the employee turnover
prediction model hold significant implications for organizations. Achieving a
perfect accuracy score suggests that the labeled dataset can effectively and
accurately predict the likelihood of employee turnover. This predictive capability
can be a valuable asset for HR departments and management, allowing them to
proactively identify employees at risk of seeking employment elsewhere. By
intervening early and addressing underlying concerns, organizations can take
measures to retain valuable talent, reduce turnover costs, and maintain a
motivated workforce.

4.2.1.2 Model Robustness and Generalization


The high performance of the model on the testing data underscores its
robustness and generalization potential. However, it's essential to consider that
these results were obtained from a specific dataset, and real-world scenarios may
involve more complex and dynamic factors. Further validation and testing on
diverse datasets and over extended periods are necessary to ensure the model's
reliability and applicability in various organizational contexts.

4.2.1.3 Ethical Considerations


While the predictive power of the model is promising, it's crucial to emphasize the
ethical implications of employee turnover prediction. Care must be taken to

68
ensure that the model's use respects privacy and confidentiality standards.
Transparent communication with employees regarding data usage and the intent
behind predictions is essential to maintain trust and fairness within the workplace.

4.2.1.4 Future Research and Development


As a final note, this research opens avenues for future exploration.
Enhancements could involve the incorporation of additional data sources, such
as employee performance records and sentiment analysis of employee feedback,
to improve prediction accuracy. Furthermore, the development of interpretable
models to understand the driving factors behind turnover predictions can provide
valuable insights for HR decision-makers. Continuous monitoring and updates to
the model will be essential to adapt to evolving workplace dynamics.

4.2.2 Demographic Analysis


4.2.2.1 Age Distribution
The age distribution of the surveyed employees provides valuable insights
(Figure 4-4). The majority of respondents fall within the age groups of 25-34 and
35-44, representing 66 and 36 individuals, respectively.

Figure 4-4: Age distribution of the surveyed employees

69
This distribution aligns with the common career progression patterns, where
employees in these age groups often contemplate their long-term career
prospects, making them a key demographic to consider for retention strategies.

4.2.2.2 Gender Distribution


The gender distribution (Figure 4-5) highlights that female respondents are in the
majority, with 87 individuals, followed by 57 males and 3 individuals preferring not
to disclose their gender.

Figure 4-5: Gender distribution of the surveyed employees

This finding underscores the importance of gender-sensitive retention strategies,


as the experiences and motivations of employees may differ based on gender.

4.2.2.3 Department Distribution


The distribution of employees across different departments is another critical
aspect of the demographic analysis. The department distribution (Figure 4-6)
indicates the following insights:

Human Resources has the highest representation, with 51 respondents.


Marketing, IT (Information Technology), Sales, and Finance departments also

70
have significant representation, each with 13 to 16 respondents. Customer
Service, Analytics and Data Science, Operations, and Supply Chain have
moderate representation. Legal, Public Relations, Administration, Project
Management, Procurement, and Research and Development departments have
fewer respondents.

Figure 4-6: Department distribution of the surveyed employees

This departmental distribution provides a basis for tailoring retention strategies to


specific departments where turnover risks may be more pronounced.

4.2.3 Distribution of Predicted Labels


The analysis and prediction of employee risk, which is a pivotal aspect of the
study. The objective is to identify employees who are at risk of turnover and
understand the distribution of predicted risk levels.

In this analysis, successfully predicted turnover risk among employees based on


questionnaire data. A total of 85 employees were identified visualized in bar chart
(Figure 4-7) as at risk, emphasizing the potential scale of turnover challenges.
These at-risk employees were categorized into 'Likely' (74 employees) and 'Very
Likely' (11 employees) groups, signifying that a substantial portion of at-risk
individuals may contemplate leaving the organization.

71
Figure 4-7: Distribution of Predicted Labels of Employee at Risk

Addressing the needs of both categories is vital, with special attention warranted
for 'Very Likely' employees due to the urgency of their concerns. This analysis
provides a crucial foundation for devising targeted retention strategies and
underscores the importance of proactive turnover prevention measures.

4.2.3.1 Discussion
The analysis of employee risk prediction opens up avenues for further discussion
and exploration. Key questions arise, such as what factors contribute to
employees falling into the 'Likely' or 'Very Likely' categories, and how can
organizations tailor interventions accordingly? As continuing the journey in the
realm of employee turnover prevention, the integration of predictive analytics
promises to be a cornerstone in ensuring organizational stability and growth.

72
4.2.3.2 Implications and Significance
The predictive analysis of employee risk carries profound implications for
workforce management and retention strategies. Identifying employees at risk
allows organizations to proactively address their needs, thereby reducing
turnover and its associated costs. Understanding the distribution of risk levels
helps in prioritizing interventions, with a focus on 'Very Likely' employees who
require immediate attention.

4.2.3.2 Alignment with Turnover Prevention


The predictions generated in this analysis align seamlessly with the overarching
goal of the study: preventing employee turnover. By identifying and categorizing
at-risk employees, organizations can tailor retention initiatives to suit each
group's specific needs. 'Likely' employees may benefit from targeted engagement
efforts, while 'Very Likely' employees may require personalized retention plans.

4.2.3.4 Ethical Considerations


It is essential to emphasize that employee risk prediction should be carried out
ethically and responsibly. The predictions generated should not be used to
discriminate against or disadvantage employees. Instead, they should inform
strategies that aim to improve employee satisfaction, engagement, and overall
well-being.

4.2.3.5 Future Research and Development


The predictive model presented here serves as a foundational tool for ongoing
research and development. Future investigations may explore the factors
contributing to employees' risk levels, allowing for more accurate predictions.
Moreover, continuous data collection and model refinement will enable
organizations to adapt their retention strategies to evolving workforce dynamics.

4.2.4 Employee Turnover Prediction in Gender Distribution


To gain further insights into the demographics of at-risk employees, gender
distribution within this group was examined. The breakdown revealed (Figure 4-
8) that among the employees at risk, 60 are female, 23 are male, and 2 preferred

73
not to disclose their gender. This information provides valuable context for
tailoring retention strategies to different gender groups.

Figure 4-8: Gender Distribution for Employees at Risk

This critical statistic underscores the magnitude of potential turnover within the
organization and serves as a foundational insight for effective retention
strategies.

4.2.4.1 Implications and Future Strategies


The insights gained from this analysis provide a strong foundation for designing
targeted employee retention strategies. Understanding the gender distribution
among at-risk employees enables to develop retention initiatives that are
sensitive to the unique needs and concerns of different gender groups.
Furthermore, identifying employees in the 'Very Likely' category emphasizes the
necessity of swift and tailored interventions to mitigate turnover risks.

74
4.2.4.2 Discussion
The gender distribution among at-risk employees is a significant observation, with
a higher representation of females among those at risk of turnover. This finding
prompts a crucial discussion on potential underlying factors contributing to this
disparity, such as differences in job satisfaction, work-life balance, or career
advancement opportunities. Recognizing this gender disparity is vital for tailoring
retention strategies to address specific gender-based needs and concerns,
ultimately enhancing their effectiveness and fostering a more inclusive workplace
culture. Additionally, the presence of employees categorized as 'Very Likely' to
seek employment elsewhere underscores the urgency of addressing their
specific retention needs through open dialogue and tailored interventions.
Overall, the data-driven approach highlights the importance of addressing
turnover challenges while considering the diverse characteristics of the
workforce.

4.2.5 Top 10 Important Features for Employees at Risk


In this analysis aims to uncover (Figure 4-9) the critical factors contributing to
employee turnover risk, using a machine learning approach. The Random Forest
classifier has been employed to predict the likelihood of employees seeking
employment elsewhere based on various features derived from questionnaire
data. In this section, the top 10 important features present that significantly
influence employees' risk of turnover, along with their respective importances:

75
Figure 4-9: Top 10 Important Features for Employees at Risk

• Current compensation satisfaction_Strongly disagree (Importance =


0.0261): This feature indicates that employees who strongly disagree with
their current compensation satisfaction are at a higher risk of turnover. It
underscores the importance of addressing compensation-related
concerns to retain talent.
• Value of provided others benefits_Not at All (Importance = 0.0249):
Employees who perceive that their organization provides 'others benefits'
at the 'Not at All' level are more likely to seek employment elsewhere,
emphasizing the role of comprehensive benefit packages.
• Likelihood to recommend your current organization to your friend or
colleague_Likely (Importance = 0.0235): Employees who are likely to
recommend their organization to others exhibit lower turnover risk,
indicating the significance of positive word-of-mouth in retention.
• Supervisor communication satisfaction_Strongly Satisfied (Importance =
0.0233): High satisfaction with supervisor communication reduces
turnover risk, emphasizing the importance of effective leadership and
communication.
• Value of your opinions and ideas_Completely (Importance = 0.0229):
Employees who feel that their opinions and ideas are valued 'Completely'
are less likely to leave, highlighting the need for fostering a culture of
inclusivity.
• Tenure_3-5 years (Importance = 0.0227): Employees with a tenure of 3-5
years are more susceptible to turnover, indicating a potential period of
career restlessness.
• Value of your opinions and ideas_Mostly (Importance = 0.0225): Similar to
feature 5, a high level of perceived value for opinions and ideas reduces
turnover risk.
• Professional growth opportunities satisfaction_Strongly disagree
(Importance = 0.0216): Employees who strongly disagree with the
satisfaction of professional growth opportunities are at a higher risk of
turnover, stressing the importance of career development.

76
• Department_Customer Service (Importance = 0.0208): The department in
which an employee works significantly impacts turnover risk, with
'Customer Service' employees being more prone to turnover.
• Age_45-54 (Importance = 0.0204): The age group of 45-54 years is
associated with increased turnover risk, indicating a need for targeted
retention strategies for this demographic.

In the context of organizational management, these findings hold valuable


insights for developing effective retention strategies. Addressing factors related
to compensation satisfaction, benefits, communication, and career development
can help mitigate turnover risk. Additionally, tailoring strategies for specific
departments and age groups can further enhance their impact. This data-driven
approach provides a robust foundation for organizations to proactively manage
turnover and cultivate a more stable and satisfied workforce.

4.3 Sentiment Analysis


In this section, the results of the sentiment analysis are presented based on
comments collected from employees regarding the acknowledgment of their
performance within their current positions. A Support Vector Machine (SVM)
classifier with TF-IDF vectorization employed to evaluate sentiment in the
comments. The analysis consists basically of two parts:

• Performance metrics for the labeled dataset, and


• Sentiment prediction for new comments.

4.3.1 Performance Metrics for Labeled Dataset


For the labeled dataset, the SVM classifier exhibited promising results in
categorizing sentiments into three classes: 'Negative,' 'Positive,' and 'Neutral.'
Below (Picture 4-3) are the performance metrics for each sentiment class:

77
Picture 4-3: Performance Metrics for Labeled Dataset

The predictions were mapped back to sentiment labels, and the implication are
as follows:

• Positive Sentiment: These comments convey a positive sentiment


regarding the acknowledgment of performance.
• Negative Sentiment: Comments expressing dissatisfaction with the
acknowledgment of performance.
• Neutral Sentiment: The classifier predicted a neutral sentiment for some
comments, indicating an absence of strong positive or negative sentiment.

4.3.1.1 Interpretation of Results


These results indicate that the classifier excels in recognizing negative and
positive sentiments, achieving high precision, recall, and F1 scores. However, it
struggles with identifying neutral sentiments, possibly due to the imbalanced
distribution of this class.

The overall accuracy of the classifier on the labeled dataset is 0.83, suggesting
that it performs well in differentiating between positive and negative sentiments.
The challenge lies in correctly classifying neutral sentiments, which require

78
further investigation and possibly a larger dataset for improved model
performance.

The sentiment analysis model's performance on the labeled dataset provides


valuable insights into its effectiveness. A remarkable precision (1.00) and recall
(0.94) for negative sentiments were observed, indicating the model's proficiency
in correctly classifying responses expressing dissatisfaction or disappointment.
Similarly, for positive sentiments, the model demonstrated a strong precision of
0.91 and a recall of 0.71. These results emphasize the model's capability to
identify positive emotions and satisfaction among respondents.

However, the model faced challenges when classifying neutral sentiments. Here,
precision and recall both recorded values of 0.00, highlighting difficulties in
distinguishing responses that did not exhibit strong emotional tones. This
challenge can be attributed to the inherent complexity of neutral sentiments and
the limited representation of neutral responses in the dataset.

Despite the challenges with neutral sentiments, the model achieved an overall
accuracy of 83%, indicating its ability to effectively capture and classify emotional
nuances within the collected responses.

4.3.1.2 Discussion
The high precision, recall, and F1 scores for both 'Negative' and 'Positive'
sentiments in the labeled dataset demonstrate the effectiveness of this sentiment
analysis model in capturing strong sentiments. However, the challenge lies in
classifying 'Neutral' sentiments, which yielded a precision, recall, and F1 score of
zero. This discrepancy can be attributed to the imbalanced distribution of
sentiments in the dataset, where 'Neutral' sentiments are significantly
underrepresented. To improve the model's performance on neutral sentiments,
future research should focus on collecting a more balanced dataset.

The overall accuracy of 0.83 on the labeled dataset indicates that the model
performs well in distinguishing between positive and negative sentiments. This
suggests that employees who express strong opinions, either positive or
negative, are adequately recognized. However, the inability to detect neutral

79
sentiments highlights the model's limitation in identifying subtler expressions of
satisfaction or dissatisfaction.

4.3.1.3 Limitations and Future Directions


This study has some limitations, including the imbalanced dataset and the
model's struggle with classifying 'Neutral' sentiments. Future research should
focus on collecting a more balanced dataset and refining the model to better
handle neutral sentiments.

In conclusion, the sentiment analysis approach offers organizations a valuable


tool for monitoring and improving employee satisfaction through performance
acknowledgment. By addressing the sentiments expressed by employees,
organizations can tailor their strategies to enhance engagement, job satisfaction,
and overall performance. Further research and refinement of the model will
contribute to more accurate sentiment analysis and, subsequently, more effective
strategies for enhancing employee satisfaction and organizational success.

4.3.2 Distribution of Predicted Sentiments


The analysis of employee comments regarding the acknowledgment of their
performance in their current positions has yielded insightful results in the pie chart
(Figure 4-10). The distribution of predicted sentiments among the comments is
illustrated below:

80
Figure 4-10: Distribution of Predicted Sentiments

In this sentiment analysis of 132 comments, sentiments were distributed as


follows: a majority of respondents expressed positive feelings, with 68 comments
(51.5%), indicating satisfaction with how their performance was acknowledged
within their respective companies. Additionally, 44 comments (33.3%) were
categorized as neutral, suggesting a mixed sentiment or perhaps a lack of strong
emotional response. Conversely, 20 comments (15.2%) conveyed negative
sentiments, signaling dissatisfaction or perceived shortcomings in the recognition
of achieved performance.

4.3.2.1 Implications for Organizational Strategy


The analysis of predicted sentiments highlights the importance of organizational
strategies related to performance acknowledgment. To foster a positive work
environment and enhance employee satisfaction, organizations should:

• Celebrate and Recognize Achievements: Positive sentiments are


indicative of effective acknowledgment practices. Organizations should
continue to celebrate and recognize employees' achievements and
contributions, both big and small.

81
• Engage with Neutral Employees: The presence of neutral sentiments
presents an opportunity for improvement. Engaging with employees who
express neutrality can help organizations better tailor their
acknowledgment strategies and ensure they resonate with a broader
audience.
• Address Negative Perceptions: Negative sentiments should not be
ignored. Organizations must address the concerns raised by employees
with negative perceptions of acknowledgment promptly. Implementing
changes based on their feedback can lead to a more positive work
environment.

4.3.3 Distribution of Sentiment Polarity and Subjectivity


The sentiment analysis conducted in this study not only classified comments into
positive, neutral, or negative sentiments but also delved into the nuances of
sentiment polarity and subjectivity (Figure 4-11).

Figure 4-11: Distribution of Sentiment Polarity and Subjectivity

82
4.3.3.1 Sentiment Polarity
The mean polarity score for the analyzed comments is 0.06. While the value of
0.06 is relatively close to neutral, it implies a subtle inclination towards positivity,
indicating that the sentiments within the analyzed dataset are generally more
optimistic than negative or completely neutral. It's essential to recognize that the
interpretation of sentiment polarity can provide valuable insights into the overall
emotional sentiment conveyed by the text, helping to discern the prevailing
attitude of the respondents.

4.3.3.2 Sentiment Subjectivity


The mean subjectivity score for the analyzed comments is 0.28. This mean
subjectivity score indicates that, on average, the text data contains a moderate
level of subjectivity. A value of 0.28 suggests that the respondents' comments
often include a mix of both objective, factual information, and subjective elements
such as personal opinions or emotional expressions. The presence of this
subjectivity adds depth to the sentiment analysis, revealing that the comments
not only convey sentiment but also include a layer of personal perspectives and
feelings, making them more nuanced and informative for understanding the
sentiments expressed by the respondents.

These findings offer valuable insights into the depth and diversity of employee
sentiments regarding performance acknowledgment in their current positions.
The slight positive bias in sentiment polarity aligns with an overall favorable
perception of acknowledgment practices within the organization. However, the
notable variation and subjectivity in comments indicate that employees'
experiences and emotions regarding acknowledgment are multifaceted and can
benefit from a more detailed qualitative examination. These results emphasize
the importance of understanding not only whether sentiments are positive,
neutral, or negative but also the nuanced emotional context within which
employees express their perceptions.

83
4.3.3.3 Implications for Organizational Strategy
The implications drawn from the sentiment analysis outcomes hold substantial
relevance for shaping organizational strategies. The moderately positive
sentiment polarity suggests that the current acknowledgment practices have
been reasonably effective in fostering positive employee sentiments. To capitalize
on this positivity, organizations should continue and even enhance these
practices to reinforce a culture of appreciation and recognition.

Moreover, the diversity in sentiment subjectivity highlights the need for a more
tailored and empathetic approach. Acknowledgment strategies should account
for the various emotional nuances expressed by employees. This may involve
personalizing acknowledgment methods, such as tailoring recognition to
individual preferences or providing platforms for employees to express their
emotions openly.

In conclusion, this sentiment analysis not only provides a comprehensive


overview of employee perceptions of performance acknowledgment but also
dives into the emotional intricacies behind those sentiments. Understanding
these nuances is essential for organizations seeking to strengthen their
acknowledgment strategies and promote a positive workplace culture that fosters
employee satisfaction and retention.

4.3.4 Word Cloud Analysis


The word cloud generated from employee comments regarding the
acknowledgment of their performance in their current positions provides a visual
representation (Picture 4-4) of the most prominent themes and sentiments
expressed by employees.

84
Picture 4-4: Word Cloud of Comments

In this cloud, words are sized according to their frequency of appearance, with
larger words indicating higher frequency.

4.3.4.1 Interpretation of Results


The word cloud prominently highlights terms such as "acknowledgment,"
"recognition," and "company," suggesting that employees frequently associate
performance acknowledgment with these terms. Additionally, the terms "effort,"
"motivation," and "positive feedback" appear relatively larger, indicating that
employees often mention the positive impact of acknowledgment on their
motivation and job satisfaction. This positive sentiment is further reinforced by the
presence of words like "appreciated," "valued," and "great."

Conversely, words such as "lack," "disappointing," and "unappreciated"


underscore instances where employees express dissatisfaction or
disappointment in acknowledgment practices. These sentiments are crucial
indicators of areas that require improvement within the organization.

Furthermore, specific terms such as "career," "team," and "achievement" suggest


that employees value acknowledgment in various aspects of their professional
lives, including career progression and team achievements. The word "Hiring

85
Managers" also appears, indicating that employees may expect acknowledgment
during the hiring process, reflecting its influence on recruitment and retention.

This word cloud provides a condensed yet comprehensive view of the themes
and sentiments within employee comments, offering valuable insights for
organizations aiming to enhance their acknowledgment practices. It emphasizes
the importance of acknowledging not only outstanding achievements but also
everyday efforts and contributions to create a workplace culture that fosters
employee satisfaction and engagement.

4.3.4.2 Implications for Organizational Strategy


The insights derived from the word cloud analysis carry significant implications
for organizational strategy. Organizations should take note of the prevalent terms
such as "acknowledgment," "recognition," and "positive feedback" and recognize
their importance in fostering positive employee sentiments. Encouraging and
formalizing these practices can lead to improved job satisfaction and motivation
among employees.

Moreover, the presence of terms like "lack" and "disappointing" highlights


potential areas of concern. Organizations must pay attention to these sentiments
and actively address any deficiencies in their acknowledgment strategies. This
may involve revisiting current practices, seeking employee feedback, and
implementing changes to create a more supportive and appreciative work
environment.

Additionally, the word cloud underscores the need for acknowledgment to extend
beyond individual achievements to encompass team accomplishments and
career development. Organizations can tailor acknowledgment programs to
address these specific needs, thereby enhancing team dynamics and facilitating
employee growth.

In conclusion, the word cloud analysis serves as a valuable tool for organizations
to gain a quick, visual understanding of employee sentiments regarding
performance acknowledgment. By heeding the prominent themes and sentiments
displayed in the word cloud, organizations can refine their strategies to create a
workplace culture that recognize

86
Chapter 5: Conclusion and Recommendation

5.1 Conclusion
This empirical study revealed the revolutionary potential of data-driven decision-
making in the dynamic field of Human Resources (HR) analytics. Investigating
the areas of applicant screening, staff turnover prediction, and sentiment analysis
using sophisticated data analytics methodologies. In candidate screening,
discovering that using similarity percentages, cosine similarity heatmaps, and
overlapping words may improve the initial screening process, allowing for more
efficient and effective candidate selection. Providing predictive analytics
combined with extensive questionnaires may equip organisations to proactively
discover and retain valuable people in employee turnover prediction. Finally, in
sentiment analysis, using machine learning algorithms to assess employee
attitudes, giving organisations with practical data to personalise appreciation
methods and build a pleasant work culture. As organisations adopt HR analytics,
it is critical to establish a balance between data-driven automation and human
judgement, while also taking ethical implications into account. This paper
provides a core framework for HR practitioners and researchers to use big data
and machine learning to create more efficient and employee-centric HR
procedures.

5.2 Recommendations for Practitioners


Proposing that practitioners use data-driven screening strategies to improve the
efficiency of initial resume screening procedures and minimise resource usage.
These tools should supplement, not replace, human judgement in final recruiting
decisions, including credentials, interview performance, cultural fit, and other
holistic aspects. Continuous progress is required, which includes investigating
sophisticated NLP and machine learning approaches to increase candidate
ranking accuracy and developing scoring systems to efficiently prioritise
applicants. Implementing predictive analytics for prompt identification of
personnel at risk, as well as designing retention efforts, department-specific
techniques, and gender-sensitive initiatives, is critical in reducing employee
turnover. Acknowledgement practises should be improved by recognising
accomplishments, offering positive feedback, and tailoring recognition to

87
individual preferences. Finally, in sentiment analysis, ethical norms should be
devised and followed to guarantee responsible data utilisation and to preserve
employee privacy while maintaining a healthy work culture.

5.3 For Future Research


There are various intriguing options for future study to pursue. Delving further into
sophisticated natural language processing (NLP) and machine learning methods
in the field of applicant screening can improve candidate ranking and
recommendation systems. Given the ever-changing nature of work requirements,
the screening process will need to be updated and fine-tuned on a regular basis
to be successful. The development and deployment of clustering algorithms that
automatically group candidates based on similarity scores might help to further
streamline the selection process. Furthermore, research should concentrate on
minimising biases in resume screening algorithms in order to provide fair and
equitable hiring procedures. Future research in employee turnover prediction can
look at the underlying elements that contribute to employees' risk levels, create
interpretable models for deeper insights, and perform longitudinal studies to
evaluate the long-term success of predictive models and retention tactics. Ethical
rules for the proper use of turnover prediction models in the workplace should be
created. Advanced methodologies for sentiment analysis, such as sentiment
trend analysis across time and sentiment analysis across multiple communication
channels, should be investigated. Deep learning models may provide more
detailed insights on employee feelings, and studying the relationship between
sentiment, performance, and productivity can provide a better understanding of
the influence of acknowledgement practises on organisational outcomes. To
ensure responsible data utilisation and preserve employee privacy, ethical rules
and best practises for ethical sentiment analysis in the workplace should be
defined.

5.4 Final Thoughts


Finally, this thorough research demonstrates the revolutionary potential of HR
analytics when used appropriately. It demonstrates the transformative power of
big data, machine learning, and data-driven decision-making in applicant
screening, staff attrition prediction, and sentiment analysis. Organisations can

88
negotiate the developing HR environment by finding the proper balance between
automation and human judgement, ensuring that the right talent is paired with the
right opportunities, employee turnover is minimised, and a healthy work culture
is fostered. This study serves as a foundational guide for HR departments and
researchers as they embark on their respective journeys in this data-driven realm,
emphasising the importance of continuous improvement, ethical considerations,
and employee well-being in shaping more efficient and employee-centric HR
processes for long-term organisational success.

89
References

Harris, K., & Dulebohn, J.H. 2021. "The state of HR analytics: A research
review and agenda." Journal of Business and Psychology, 36(1), 21-51.

Searle, R.H., et al. 2020. "HR analytics competency: A systematic review."


Journal of Business Research, 106, 235-255.

Fitz-enz, J. (2014). Predictive Analytics for Human Resources. John Wiley


& Sons, p-3

Lawler, E. E. (2019). HR Analytics: The What, Why, and How. HR People


+ Strategy, 42(2), 7-13.

Davenport, T. H., Harris, J. G., & Shapiro, J. (2010). Competing on talent


analytics. Harvard Business Review, 88(10), 52-58.

Lengnick-Hall, M. L., Lengnick-Hall, C. A., & Rigsbee, C. M. (2019).


Human resource management in the digital age: Current and future trends.
Journal of Management, 45(2), 446-464.

Boudreau, J. W., & Cascio, W. F. (2017). Human resource analytics: Why


are we not there? Journal of Organizational Effectiveness: People and
Performance, 4(3), 206-225.

Minbaeva, D. B., Thunnissen, M., & Castro, F. B. (2020). HR in the digital


age: The impact of automation, artificial intelligence, and robotics. Human
Resource Management Journal, 30(2), 203-218.

Marr, B. (2019). Human resource analytics: The ultimate guide to using


data to optimize every aspect of your HR function. Kogan Page Publishers.

Lawler, E. E. III. (2008). Talent: Making people your competitive


advantage. Jossey-Bass, p-72

Lagunas, K. (2018). HR Analytics Handbook: Your Guide to Becoming a


Data-Driven HR Professional. Lighthouse Research & Advisory, p-13

90
Fitz-enz, J. (2018). HR Analytics: The What, Why, and How. Society for
Human Resource Management, p-50

Guenole, N. (2017). The Power of People: Learn How Successful


Organizations Use Workforce Analytics to Improve Business Performance.
Pearson FT Press, p-81

Roberts, P., Van Ark, T., & Gualtieri, P. (2018). The Rise of People
Analytics: How Predictive

Ulrich, D., & Dulebohn, J. H. (2015). Are we there yet? What’s next for
HR? Human Resource Management Review, 25(2), 188-204.

Laumer, S., Eckhardt, A., & Weitzel, T. (2019). HR analytics and the
privacy paradox: Investigating the role of organizational privacy climate in HR
analytics adoption. Journal of Business Economics, 89(8), 947-976.

Harel, G. H., Tzafrir, S. S., & Baruch, Y. Y. (2016). Utilizing HRIS for talent
management: Insights from the Israeli high-tech industry. The International
Journal of Human Resource Management, 27(1), 116-138.

Cascio, W. F. (2018). The changing role of HR: Learning from evidence-


based HR. Human Resource Management, 57(1), 7-25.

Schramm, P., & Wiesche, M. (2018). When employees talk, data speaks:
The integration of corporate social network analysis and sentiment analysis to
gain insights into organizational communication. Journal of Information
Technology, 33(4), 340-356.

Boudreau, J. W., & Cascio, W. F. (2017). Human resources analytics:


Foundations, methods, and applications. Human Resource Management Review,
27(1), 1-3.

Kavanagh, M. J., Thite, M., & Johnson, R. D. (2019). Human resource


information systems: Basics, applications, and future directions. Thousand Oaks,
CA: SAGE Publications.

91
Lawler III, E. E., & Levenson, A. (2019). People analytics: HR
transformation through data. Harvard Business Press.

Fitz-enz, J. (2016). Predictive analytics for human resources. John Wiley


& Sons.

Redman, T., & Holmström, J. (2016). From data quality to big data quality?
Journal of Organizational Computing and Electronic Commerce, 26(1-2), 37-51.

Marler, J. H., & Fisher, S. L. (2013). An evidence-based review of HR


analytics. The International Journal of Human Resource Management, 24(15),
3009-3033.

Rogers, K., & Lindauer, M. S. (2016). Predictive analytics in human


resources: Tutorial and 7-step guide. Human Resource Management, 55(3), 351-
366.

Marler, J. H., & Boudreau, J. W. (2017). An evidence-based review of HR


analytics: Does Big Data really transform HR? Human Resource Management
Review, 27(3), 267-281.

Bondarouk, T., Ruël, H., & van der Heijden, B. I. (2017). HR Shared
Service Centers: A Big Data approach to HR efficiency and customer satisfaction.
Human Resource Management, 56(4), 635-652.

Davenport, T. H. (2014). Big data in HR: Why predictive analytics matter.


MIT Sloan Management Review, 55(3), 56-65.

Bissola, R., & Imperatori, B. (2018). Big data analytics in human resource
management: A systematic literature review. Big Data Research, 14, 28-38.

Mellahi, K., Demirbag, M., & Riddle, L. (2018). Talent management and
global mobility: Big data analytics for strategic HRM. Journal of World Business,
53(6), 850-862.

Schramm, D. M., & Rocco, T. S. (2017). Big Data, analytics, and HRM:
Implications for scholars. Journal of Organizational Behavior, 38(3), 319-334.

92
Chou, T. (2016). Big data analytics in human resource management: A
literature review. Journal of Service Science and Management, 9(03), 321-332.

Martin, N. (2019). Big Data Analytics in Human Resource Management: A


Study. Journal of Advanced Research in Dynamical and Control Systems, 11(6),
1992-1998.

Marler, J. H., & Fisher, S. L. (2013). An evidence-based review of e-HRM


and strategic human resource management. Human Resource Management
Review, 23(1), 18-36.

Bondarouk, T. V., Brewster, C., & Guiderdoni-Jourdain, K. (2019). Big Data


and HRM: New opportunities and challenges in the digital era. Human Resource
Management Review, 29(1), 1-6.

Al-Dhaafri, H. S., Goodwin, R., & Khan, A. (2019). Leveraging Big Data
Analytics to Improve Human Resource Management. In Strategic Big Data
Analytics for Human Resources (pp. 17-35). IGI Global.

Kumar, V., & Singh, R. (2018). Big Data Analytics for Human Resource
Management: A Review of Literature. International Journal of Advanced
Research in Computer Science, 9(5), 133-139.

Parry, E., & Tyson, S. (2018). Big Data and HRM: Implications for HR
analytics. Journal of Organizational Effectiveness: People and Performance,
5(3), 269-285.

Jiang, K., Lepak, D. P., Hu, J., & Baer, J. C. (2019). How does human
resource management influence organizational outcomes? A meta-analytic
investigation of mediating mechanisms. Academy of Management Journal, 62(6),
1664-1696.

Kwon, S. W., & Adler, P. S. (2014). Social capital: Maturation of a field of


research. Academy of Management Review, 39(4), 412-422.

Laumer, S., Eckhardt, A., & Weitzel, T. (2018). The effect of big data and
analytics on firm performance: An econometric analysis considering industry
characteristics. Journal of Management Information Systems, 35(2), 488-509.

93
Marler, J. H., & Boudreau, J. W. (2017). An evidence-based review of HR
analytics. The International Journal of Human Resource Management, 28(1), 3-
26.

Alavi, S. E., Antons, D., & Ditschler, J. (2018). The role of big data and
analytics in predicting and improving human resource decisions. Personnel
Review, 47(3), 590-610.

Vaiman, V., & Scullion, H. (2019). Leveraging big data in human resource
management for enhanced international employee mobility and performance
management. The International Journal of Human Resource Management,
30(15), 2207-2225.

Chen, H., & Huang, H. (2014). Big Data and Analytics in Human Resource
Management: A Review and Future Directions. Journal of Management Analytics,
1(3), 178-209.

Li, J., Liang, H., & Li, J. (2017). Exploring Big Data Analytics for Human
Resource Management. Journal of Industrial Engineering and Management,
10(2), 217-230.

Kaushik, N., Chahal, H., & Bansal, N. (2019). Application of Big Data
Analytics in Human Resource Management: A Review. International Journal of
Advanced Research in Computer Science, 10(1), 47-50.

Shukla, A., Kumar, A., & Gopal, A. (2019). Leveraging Big Data Analytics
for Talent Management: A Framework for Human Resource Professionals.
Business Perspectives and Research, 7(1), 20-30.

Marler, J. H., & Boudreau, J. W. (2017). An evidence-based review of HR


analytics. The International Journal of Human Resource Management, 28(1), 3-
26.

Bondarouk, T., Parry, E., & Furtmueller, E. (2017). Electronic HRM: Four
decades of research on adoption and consequences. The International Journal
of Human Resource Management, 28(1), 98-131.

94
Zheng, C., Yang, J., & Wang, D. (2019). The impact of HR analytics
capability on organizational performance: A resource-based view perspective.
Information & Management, 56(2), 271-284.

Raghuram, S., Garud, R., & Wiesenfeld, B. (2019). Sociomateriality and


the paradox of lead users in big data analytics: Exploring intermediaries in
crowdcasting. MIS Quarterly, 43(3), 911-932.

Rynes, S. L., Giluk, T. L., & Brown, K. G. (2007). The very separate worlds
of academic and practitioner periodicals in human resource management:
Implications for evidence-based management. Academy of Management
Journal, 50(5), 987-1008.

Hekler, E. B., Klasnja, P., Riley, W. T., Buman, M. P., Huberty, J., Rivera,
D. E., & Martin, C. A. (2016). Agile science: Creating useful products for behavior
change in the real world. Translational Behavioral Medicine, 6(2), 317-328.

Pardo del Val, M., Fuentes-Fuentes, M. M., & López-Sáez, P. (2018).


Building HR analytics capability: A systematic review of analytics implementation
literature. Human Resource Management Review, 28(3), 347-362.

Kehoe, R. R., & Wright, P. M. (2013). The impact of high-performance


human resource practices on employees' attitudes and behaviors. Journal of
Management, 39(2), 366-391.

Carlson, J., & Kavanagh, M. J. (2020). Big Data and HRM. In The
Routledge Companion to Strategic HRM, 166-185.

Buuren, R. V., & Steijn, B. (2017). The influence of HR analytics on


employee well-being: A research agenda. Human Resource Management
Review, 27(2), 331-341.

Parry, E., & McCarthy, L. (2017). Introduction: Human Resource


Management, Artificial Intelligence, and the Gig Economy. Human Resource
Management, 56(3), 377-386.

95
Parry, E., & McCarthy, L. (2017). Artificial intelligence and the HR function:
Transformation or transition?. Human Resource Management Review, 27(2),
176-185.

Rosenbaum, M. S., & Wong, Y. T. (2021). Machine Learning and HRM:


The Strategic Role of Human Resource Analytics. Human Resource
Management, 60(2), 265-280.

Raghuram, S., & Arvey, R. D. (2019). The Promise and Pitfalls of Applying
Machine Learning Algorithms to Personnel Selection. Journal of Applied
Psychology, 104(7), 864-880.

Bondarouk, T., & Ruel, H. J. M. (2019). Electronic HRM: Four decades of


research on adoption and consequences. The International Journal of Human
Resource Management, 30(1), 5-33.

Agarwal, A., & Marler, J. H. (2020). The Effects of Robots in HR: Evidence
from Performance and Peer Evaluations. Academy of Management Discoveries,
6(3), 401-425.

Boudreau, J.W., & Cascio, W.F. (2018). Human resource analytics: Why
aren't we there? Journal of Organizational Effectiveness: People and
Performance, 5(4), 278-295.

Yang, Y., & Cao, J. (2019). Applying machine learning algorithms in HR


analytics: A systematic review. International Journal of Human Resource
Management, 30(13), 1951-1983.

Liao, Y., & Wang, Z. (2020). Employee performance prediction using


machine learning algorithms: A systematic literature review. Journal of
Organizational Behavior, 41(5), 466-491.

Akkermans, J., Richardson, J., & Kraimer, M.L. (2021). Talent analytics:
What makes it effective? Journal of Organizational Behavior, 42(1), 17-34.

Marr, B., & Gray, H. (2020). HR Analytics: The What, Why, and How.
Retrieved from https://www.sas.com/en_us/whitepapers/hr-analytics-
107108.html

96
Bersin, J., & Grierson, A. (2018). People analytics: The ultimate HR
analytics guide. Retrieved from https://marketing.bersin.com/rs/976-TYR-
313/images/Bersin-People-Analytics-Guide-V14.pdf

Davenport, T. H. (2018). The AI advantage: How to put the artificial


intelligence revolution to work. Cambridge, MA: MIT Press.

Mendenhall, M., & Villanova, P. (2019). Machine Learning and HR


Analytics for Talent Management. Retrieved from https://www.shrm.org/hr-
today/trends-and-forecasting/research-and-
surveys/Documents/Machine%20Learning%20and%20HR%20Analytics.pdf

Singh, P., & Rao, S. (2020). Artificial Intelligence and Machine Learning in
Human Resource Management. Retrieved from
https://www.springer.com/gp/book/9783030451481

Wheeler, B. (2020). Human Resource Analytics: The Key to Unlocking


Organizational Success. Retrieved from
https://www.shrm.org/ResourcesAndTools/tools-and-samples/hr-
qa/Pages/CMS_025614.aspx

Parry, E., & Tyson, S. (2018). Artificial intelligence and human resource
management. Human Resource Management Review, 28(4), 376-386.

Bapna, R., Gupta, A., & Mariadoss, B. J. (2019). Sentiment analysis and
machine learning in HRM: A review and future research directions. Human
Resource Management Review, 29(1), 96-110.

Aguinis, H., & Lawal, S. O. (2018). Big data analytics in human resource
management: A systematic review of trends and challenges. Human Resource
Management Review, 28(3), 323-340.

Mone, E. M., & London, M. (2019). Employee career development: An


integrative framework and review of emerging trends. Annual Review of
Organizational Psychology and Organizational Behavior, 6, 159-186.

97
Kuang, L., Li, W., & Zhou, L. (2020). Data analytics for human resource
management: A review of the literature and implications for the future. Human
Resource Development Review, 19(3), 307-332.

Hui, B., Hui, K., & He, W. (2021). Leveraging machine learning algorithms
for HR analytics: An organizational perspective. Journal of Organizational
Computing and Electronic Commerce, 31(3), 264-281.

Cheng, L., & Wang, X. (2020). Human resource analytics: A systematic


review and future directions. Journal of Organizational Computing and Electronic
Commerce, 30(4), 301-323.

Reference: Konstantinidis, E., Staggers, N., & Brombacher, A. (2020).


People analytics in the era of GDPR: Ethical considerations for HR analytics
professionals. International Journal of Information Management, 50, 247-256.

Yan, J., & Wu, X. (2020). Employee engagement prediction using machine
learning models: A comparative study. International Journal of Human Resource
Management, 31(6), 753-775.

Ye, S., & Gong, Y. (2019). Temporal feature engineering for employee
turnover prediction. International Journal of Human Resource Management,
30(10), 1513-1535.

Sibanda, T., & Xia, Y. (2019). Employee attrition prediction using


classification algorithms: A comparative analysis. International Journal of
Information Management, 45, 157-167.

Xing, K., & Wang, X. (2018). A comparison of machine learning algorithms


for resume screening. Journal of Business Research, 89, 141-148.

Singh, A., & Sahni, P. (2017). Application of clustering algorithms in HR


analytics: A review. 2017 International Conference on Inventive Systems and
Control (ICISC), 1-6.

Verma, V., & Mishra, D. (2016). Clustering algorithms in HR analytics: A


systematic review. 2016 2nd International Conference on Contemporary
Computing and Informatics (IC3I), 442-446.

98
Venkataramanan, R., & Alhazmi, H. A. (2019). A systematic review on the
application of clustering algorithms in HR analytics. 2019 3rd International
Conference on Trends in Electronics and Informatics (ICOEI), 832-837.

Pachori, V., & Dwivedi, A. (2018). Clustering algorithms in HR analytics: A


systematic review. 2018 9th International Conference on Computing,
Communication and Networking Technologies (ICCCNT), 1-5.

Das, S., & Pal, A. (2021). A comprehensive review on the application of


clustering algorithms in HR analytics. 2021 International Conference on
Intelligent Sustainable Systems (ICISS), 372-377.

Li, Y., Liu, D., & Zhao, D. (2020). Human resources analytics in the era of
big data: Privacy and ethical considerations. Journal of Business Ethics, 163(4),
627-641.

Feldman, M., & Friedler, S. A. (2019). A critique of fairness metrics as


equality of opportunity in supervised learning. Proceedings of the 2019
AAAI/ACM Conference on AI, Ethics, and Society, 211-216.

Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional


accuracy disparities in commercial gender classification. Proceedings of the 1st
Conference on Fairness, Accountability and Transparency, 81-91.

Raghavendra, S., & Venugopal, K. R. (2018). HR Analytics: Applications


and Future Directions. International Journal of Applied Engineering Research,
13(9), 7390-7393.

Marler, J. H., & Boudreau, J. W. (2017). An evidence-based review of HR


analytics. The International Journal of Human Resource Management, 28(1), 3-
26.

Kaur, A., & Jain, S. (2021). Predictive Analytics in HR: A Review of


Techniques and Challenges. Procedia Computer Science, 188, 496-504.

Su, L., & Goh, D. H. L. (2020). Transfer Learning in Human Resource


Analytics: A Review. In Proceedings of the 9th International Conference on
Enterprise Systems, 302-312.

99
Haidari, S. H., & Smith, A. D. (2018). Analyzing Employee Reviews of
Companies Using NLP Techniques. In Proceedings of the 2018 IEEE 5th
International Conference on Data Science and Advanced Analytics, 129-138.

Mani, K. M., & Prasad, M. (2019). Analysis of Employee Exit Interview


Comments Using Text Mining Techniques. In Proceedings of the 2019 IEEE 8th
International Conference on Advanced Computing (IACC), 647-652.

Piccoli, G., et al. (2019). Virtual HR: The Impact of AI Chatbot Service
Delivery on Employee User Experience. Communications of the Association for
Information Systems, 45, 51-68.

Huang, L., et al. (2019). Predictive Analytics in Human Resource


Development: A Systematic Review and Future Directions. IEEE Transactions on
Education, 62(3), 219-228.

Rivera, L. A. (2012). Hiring as cultural matching: The case of elite


professional service firms. American Sociological Review, 77(6), 999-1022.

Cappelli, P. (2019). From the new deal to the gig economy: How changing
labor conditions are reshaping America's workforce. Labor Studies Journal,
44(2), 89-103.

Davenport, T. H., & Patil, D. J. (2012). Data scientist. Harvard Business


Review, 90(5), 70-76.

Chen, X., Xu, H., Zhang, C., & Hu, B. (2018). Resume-Job Matching: A
Study with Deep Learning Semantic Embeddings. In Twenty-Second Pacific Asia
Conference on Information Systems, Yokohama.

Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that


will transform how we live, work, and think. John Murray Press.

Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of
the 21st century. Harvard Business Review, 90(5), 70-76.

100
Cappelli, P. (2019). From the new deal to the gig economy: How changing
labor conditions are reshaping America's workforce. Labor Studies Journal,
44(2), 89-103.

Chen, X., Xu, H., Zhang, C., & Hu, B. (2018). Resume-Job Matching: A
Study with Deep Learning Semantic Embeddings. In Twenty-Second Pacific Asia
Conference on Information Systems, Yokohama.

Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing.


Prentice Hall.

Chen, J., Zhang, H., & He, X. (2018). Attentive collaborative filtering:
Multimedia recommendation with item- and component-level attention. In
Proceedings of the 40th international ACM SIGIR conference on Research and
development in information retrieval (pp. 335-344).

Zhang, Y., Zhang, S., & Zhao, X. (2020). Natural Language Processing for
Resume Screening: A Review. In Proceedings of the 12th International
Conference on Agents and Artificial Intelligence.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to


Information Retrieval. Cambridge University Press.

Jurafsky, D., & Martin, J. H. (2020). Speech and Language Processing: An


Introduction to Natural Language Processing, Computational Linguistics, and
Speech Recognition. Pearson.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to


Information Retrieval. Cambridge University Press.

Ramesh, A., & Kambhampati, S. (2005). Extracting structured information


from free text resumes. In European Conference on Information Retrieval, 98-
109.

Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of
the 21st century. Harvard Business Review, 90(5), 70-76.

101
Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets.
Cambridge University Press.

Ton, Z. & Huckman, R. S. (2008). Managing the Impact of Employee


Turnover on Performance: The Role of Process Conformance. Organization
Science, 19(1), 56-68.

IBM Smarter Workforce (2016). Predictive Retention: Proactively Keep


Your Top Talent on Board. IBM Corporation.

Boushey, H., & Glynn, S. J. (2012). There are significant business costs
to replacing employees. Center for American Progress, 16, 1-9.

Hausknecht, J. P., & Trevor, C. O. (2011). Collective turnover at the group,


unit, and organizational levels: Evidence, issues, and implications. Journal of
Management, 37(1), 352-388.

Ton, Z., & Huckman, R. S. (2008). Managing the impact of employee


turnover on performance: The role of process conformance. Organization
Science, 19(1), 56-68.

Waldman, J. D., Kelly, F., Arora, S., & Smith, H. L. (2010). The shocking
cost of turnover in health care. Health Care Management Review, 35(3), 206-
211.

Hausknecht, J. P., Rodda, J., & Howard, M. J. (2009). Targeted employee


retention: Performance-based and job-related differences in reported reasons for
staying. Human Resource Management, 48(2), 269-288.

Holtom, B. C., Mitchell, T. R., Lee, T. W., & Eberly, M. B. (2008). Turnover
and retention research: A glance at the past, a closer review of the present, and
a venture into the future. Academy of Management Annals, 2(1), 231-274.

Maertz, C. P., & Campion, M. A. (2004). Profiles in quitting: Integrating


process and content turnover theory. Academy of Management Journal, 47(4),
566-582.

102
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with
Pandas, NumPy, and IPython. O'Reilly Media, Inc.

Molin, S. (2020). Hands-On Data Analysis with Pandas. Packt Publishing


Ltd.

VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools


for Working with Data. O'Reilly Media, Inc.

Sheppard, K. (2020). Introduction to Python for Econometrics, Statistics


and Data Analysis. 4th Edition.

Chen, D.Y. (2018). Pandas for Everyone: Python Data Analysis. Pearson.

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Liaw, A., & Wiener, M. (2002). Classification and regression by


RandomForest. R news, 2(3), 18-22.

Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). How many trees
in a random forest? In Machine Learning and Data Mining in Pattern Recognition,
154-168.

Branco, P., Torgo, L., & Ribeiro, R. P. (2016). A survey of predictive


modeling on imbalanced domains. ACM Computing Surveys (CSUR), 49(2), 1-
50.

Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue
on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter,
6(1), 1-6.

Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to


ROC, informedness, markedness, and correlation. Journal of Machine Learning
Technologies, 2(1), 37-63.

Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under
the ROC curve for multiple class classification problems. Machine Learning,
45(2), 171-186.

103
Berson, A., Smith, S., & Thearling, K. (2012). Building data science teams.
"O'Reilly Media, Inc.".

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis.


Springer-Verlag New York.

Healey, C. G., & Enns, J. T. (2012). Attention and visual memory in


visualization and computer graphics. IEEE Transactions on Visualization and
Computer Graphics, 18(7), 1170-1188.

Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis.
Foundations and Trends® in Information Retrieval, 2(1-2), 1-135.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Synthesis Lectures


on Human Language Technologies, 5(1), 1-167.

Feldman, R. (2013). Techniques and Applications for Sentiment Analysis.


Communications of the ACM, 56(4), 82-89.

Buckingham, M., & Goodall, A. (2015). Reinventing performance


management. Harvard Business Review, 93(4), 40-50.

Harter, J. K., Schmidt, F. L., & Hayes, T. L. (2002). Business-unit-level


relationship between employee satisfaction, employee engagement, and
business outcomes: a meta-analysis. Journal of Applied Psychology, 87(2), 268.

Adkins, A. (2016). What Millennials Want from Work: How to Maximize


Engagement in Today’s Workforce. McGraw-Hill Education.

Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis.
Foundations and Trends® in Information Retrieval, 2(1–2), 1-135.

Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis Lectures


on Human Language Technologies, 5(1), 1-167.

Russell, S. J. (2013). Artificial intelligence: A modern approach. Malaysia;


Pearson Education Limited.

104
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. O'Reilly Media, Inc.

Loria, S. (2018). TextBlob: Simplified Text Processing. [Online


documentation]. Retrieved from https://textblob.readthedocs.io/en/dev/

Aggarwal, C. C., & Zhai, C. (2012). A survey of text classification


algorithms. In Mining text data (pp. 163-222). Springer, Boston, MA.

Joachims, T. (1998). Text categorization with support vector machines:


Learning with many relevant features. In European conference on machine
learning (pp. 137-142). Springer, Berlin, Heidelberg.

Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment
classification using machine learning techniques. In Proceedings of the ACL-02
conference on Empirical methods in natural language processing (Vol. 10, pp.
79-86). Association for Computational Linguistics.

Shawe-Taylor, J., & Cristianini, N. (2004). Kernel methods for pattern


analysis. Cambridge university press.

Sebastiani, F. (2002). Machine learning in automated text categorization.


ACM computing surveys (CSUR), 34(1), 1-47.

Brownlee, J. (2022). Natural Language Processing with Python. Machine


Learning Mastery.

Chollet, F. (2018). Deep Learning with Python. Manning Publications.

Raschka, S., & Mirjalili, V. (2019). Python Machine Learning. Packt


Publishing Ltd.

Li, S. (2021). Feature Engineering for Machine Learning: Principles and


Techniques for Data Scientists. O'Reilly Media, Inc.

Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras,


and TensorFlow. O'Reilly Media, Inc.

105
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance
measures for classification tasks. Information Processing & Management, 45(4),
427-437.

Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to


ROC, informedness, markedness & correlation. Journal of Machine Learning
Technologies, 2(1), 37-63.

BEAJAMES. (2014). The Evolution of HR Technology.From


https://talentscience.wordpress.com/2014/12/02/33/

Boatman, A. (2023). A Guide To The 4 Types of HR Analytics. From


https://www.aihr.com/blog/types-of-hr-analytics/

Abdul, Q. (2019). HR Analytics and Predictive Decision-making model.


From
https://www.researchgate.net/publication/333238246_HR_ANALYTICS_A_MOD
ERN_TOOL_IN_HR_FOR_PREDICTIVE_DECISION_MAKING

Lucassen, J.-P. (2023). 7 Common People Analytics Challenges. From


https://www.aihr.com/blog/people-analytics-challenges/

Lalwani, P. (2023). What Data Does an HR Analytics Tool Need? From


https://www.spiceworks.com/hr/hr-analytics/articles/what-is-hr-analytics/

Kumar, M. B. (2023). Applications of Data Science in HR Analytics. From


https://360digitmg.com/blog/data-science-applications-in-hr-analytics

HR & PEOPLE ANALYTICS. (2020). The relationship between AI, ML and


the three broad types of ML. From https://hyperight.com/a-beginners-guide-to-
machine-learning-for-hr-practitioners/

Lawton, G. (2023). Data preprocessing. From


https://www.techtarget.com/searchdatamanagement/definition/data-
preprocessing

Soper, J., & Landau, H. (2023). Manually Review Resumes. From


https://fitsmallbusiness.com/resume-screening/

106
Nguyen, T. (2022). How Does a Resume Parser Work? From
https://www.neurond.com/blog/what-is-a-cv-resume-parser-how-it-works

107
Appendix

Task 1 - Resume Screening

1.1 Code and Algorithms

# importing libraries
import os
import PyPDF2
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

# Function to read the content of a PDF file


def read_pdf(file_path):
with open(file_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
text = ""
for page_obj in pdf_reader.pages:
text += page_obj.extract_text()
return text

# Function to preprocess text data (tokenization, stopword removal, stemming)


def preprocess_text(text):
# Tokenization
tokens = word_tokenize(text.lower())
# Remove stopwords
stop_words = set(stopwords.words("english"))
filtered_tokens = [token for token in tokens if token not in stop_words]
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
return " ".join(stemmed_tokens)

# Function to calculate cosine similarity between two texts


def get_cosine_similarity(text1, text2):
vectorizer = CountVectorizer().fit_transform([text1, text2])
vectors = vectorizer.toarray()
return cosine_similarity(vectors)[0][1]

# Main function to execute the code (for all resumes)


def main():
job_requirement = read_pdf("/Users/rahman/Downloads/Resume
screening/Job requirement/Data Analyst.pdf")
job_requirement_processed = preprocess_text(job_requirement)

resume_folder = "/Users/rahman/Downloads/Resume screening/Resumes"

108
resume_files = [file for file in os.listdir(resume_folder) if file.endswith(".pdf")]
total_resumes = len(resume_files) # Total number of resumes

resumes = []
similarities = []

for resume_file in resume_files:


resume_path = os.path.join(resume_folder, resume_file)
resume_text = read_pdf(resume_path)
resumes.append(resume_text)
resumes_processed = [preprocess_text(resume) for resume in resumes]
similarity = get_cosine_similarity(job_requirement_processed,
resumes_processed[-1])
similarities.append(similarity)

# Plotting diagram of the similarity percentages for all resumes


plt.figure(figsize=(10, 6))
plt.scatter(resume_files, [similarity * 100 for similarity in similarities],
color='black', marker='o')

plt.xlabel('Resume Names')
plt.ylabel('Similarity Percentage')
plt.title('Similarity Percentage for All Resumes')
plt.xticks(rotation=90)
plt.grid(True, linestyle='--', alpha=0.6)
plt.text(0.8, 0.9, f'Total Resumes: {total_resumes}',
transform=plt.gca().transAxes, fontsize=12)

plt.tight_layout()
plt.show()

# Execute the main function for top 10 resumes


if __name__ == "__main__":
main()

# Main function to execute the code (for top 10 resumes)


def main():
job_requirement = read_pdf("/Users/rahman/Downloads/Resume
screening/Job requirement/Data Analyst.pdf")
job_requirement_processed = preprocess_text(job_requirement)

resume_folder = "/Users/rahman/Downloads/Resume screening/Resumes"


resume_files = [file for file in os.listdir(resume_folder) if file.endswith(".pdf")]

resumes = []
for resume_file in resume_files:
resume_path = os.path.join(resume_folder, resume_file)
resume_text = read_pdf(resume_path)
resumes.append(resume_text)

resumes_processed = [preprocess_text(resume) for resume in resumes]

109
similarities = [get_cosine_similarity(job_requirement_processed, resume) for
resume in resumes_processed]

resume_scores = list(enumerate(similarities))
sorted_resumes = sorted(resume_scores, key=lambda x: x[1], reverse=True)

top_n = 10
selected_resumes = [(resume_files[index], score*100) for index, score in
sorted_resumes[:top_n]]

# Plotting the similarity percentages for Top 10 Resumes


resume_names, percentages = zip(*selected_resumes)
plt.figure(figsize=(12, 8))
bars = plt.barh(resume_names, percentages, color='skyblue')

for bar in bars:


width = bar.get_width()
plt.text(width + 0.5,
bar.get_y() + bar.get_height()/2,
f'{width:.2f}%',
va='center',
ha='center',
color='black')

plt.xlabel('Similarity Percentage')
plt.ylabel('Resume Names')
plt.title('Top 10 Resumes by Similarity Percentage')
plt.gca().invert_yaxis()
plt.show()

# Main function of the Cosine similarity matrix


def main():
job_requirement = read_pdf("/Users/rahman/Downloads/Resume
screening/Job requirement/Data Analyst.pdf")
job_requirement_processed = preprocess_text(job_requirement)

resume_folder = "/Users/rahman/Downloads/Resume screening/Resumes"


resume_files = [file for file in os.listdir(resume_folder) if file.endswith(".pdf")]

resumes = []
for resume_file in resume_files:
resume_path = os.path.join(resume_folder, resume_file)
resume_text = read_pdf(resume_path)
resumes.append(resume_text)

resumes_processed = [preprocess_text(resume) for resume in resumes]

similarities = [get_cosine_similarity(job_requirement_processed, resume) for


resume in resumes_processed]
resume_scores = list(enumerate(similarities))
sorted_resumes = sorted(resume_scores, key=lambda x: x[1], reverse=True)

110
top_n = 10
selected_indices = [index for index, score in sorted_resumes[:top_n]]

# Create similarity matrix for the top 10 resumes


similarity_matrix = []
for i in selected_indices:
row = []
for j in selected_indices:
sim = get_cosine_similarity(resumes_processed[i],
resumes_processed[j])
row.append(sim)
similarity_matrix.append(row)

# Plotting the heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, annot=True, cmap='YlGnBu',
xticklabels=[resume_files[i] for i in selected_indices], yticklabels=[resume_files[i]
for i in selected_indices])
plt.title('Cosine Similarity Heatmap of Top 10 Resumes')
plt.show()

# Main function to find overlapping words


def main():
job_requirement = read_pdf("/Users/rahman/Downloads/Resume
screening/Job requirement/Data Analyst.pdf")
job_requirement_processed = preprocess_text(job_requirement)

resume_folder = "/Users/rahman/Downloads/Resume screening/Resumes"


resume_files = [file for file in os.listdir(resume_folder) if file.endswith(".pdf")]

resumes = []
for resume_file in resume_files:
resume_path = os.path.join(resume_folder, resume_file)
resume_text = read_pdf(resume_path)
resumes.append(resume_text)

resumes_processed = [preprocess_text(resume) for resume in resumes]

similarities = [get_cosine_similarity(job_requirement_processed, resume) for


resume in resumes_processed]
resume_scores = list(enumerate(similarities))
sorted_resumes = sorted(resume_scores, key=lambda x: x[1], reverse=True)

top_n = 10 # Get top 10 similar resumes


selected_indices = [index for index, score in sorted_resumes[:top_n]]

for index in selected_indices:


resume_name = resume_files[index]
overlapping = overlapping_words(job_requirement_processed,
resumes_processed[index])

111
print(f"For the resume '{resume_name}', the overlapping words with the job
requirement are:\n{', '.join(overlapping)}\n{'-'*80}")

1.2 Resumes Link

https://drive.google.com/drive/folders/1a8lmfMk-NpWV0TLzTHex-
0mjnSNX4d-3?usp=drive_link

Job Requirement:

112
113
Task 2 - Predicting Employee Turnover

2.1 Questionnaire

Hello Participant Name,

I hope you're well. I'm reaching out again regarding my master's thesis. I've
crafted a brief questionnaire related to my work on machine learning algorithms.
Just 3-4 minutes of your time for the survey would be a huge help and your
insights would be invaluable.

Survey link: https://forms.gle/UfoDJFhDM2Yc65DJ6

Appreciate your ongoing support.

Best regards

Rahman

Question Options
1. Age 18-24, 25-34, 35-44, 45-54, 55 and above
2. Gender Male, Female, Prefer not to say
3. Department Analytics and Data Science, Marketing,
Human Resources, Sales, Customer
Service, Finance, Operations, Research
and Development, IT (Information
Technology), Legal, Administration,
Production, Supply Chain, Public Relations,
Project Management, Procurement
4. Tenure 0-2 years, 3-5 years, 6-10 years, 11+ years

5. Supervisor communication Strongly Satisfied, Satisfied, Neutral,


satisfaction Dissatisfied, Strongly Dissatisfied
6. Overall job satisfaction Strongly Satisfied, Satisfied, Neutral,
Dissatisfied, Strongly Dissatisfied
7. Skills utilization in current role Completely, Mostly, Moderately, Slightly,
Not at All

114
8. Professional growth opportunities Strongly Satisfied, Satisfied, Neutral,
satisfaction Dissatisfied, Strongly Dissatisfied
9. Value of opinions and ideas Completely, Mostly, Moderately, Slightly,
Not at All
10. Connection to colleagues and Completely, Mostly, Moderately, Slightly,
team Not at All
11. Company's diversity and inclusion Strongly Satisfied, Satisfied, Neutral,
efforts Dissatisfied, Strongly Dissatisfied
12. Current compensation satisfaction Very Satisfied, Satisfied, Neutral,
Dissatisfied, Very Dissatisfied
13. Value of provided other benefits Completely, Mostly, Moderately, Slightly,
Not at All

2.2 Code and Algorithms

# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.metrics import confusion_matrix, classification_report

# Load labeled dataset


data = pd.read_csv('labeled_risk.csv')

# Split data into features and target


X = data.drop('Likelihood of seeking employment elsewhere', axis=1)
y = data['Likelihood of seeking employment elsewhere']

# Convert categorical variables to numerical (one-hot encoding)


X = pd.get_dummies(X)

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)

# Initialize and train the RandomForestClassifier on the training data


model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = model.predict(X_test)

115
# Calculate performance metrics for the testing data
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test),
multi_class='ovr')

# Print performance metrics for the testing data, Confusion matrix &
Classification report
print("Performance Metrics for Testing Data:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC Score: {roc_auc:.2f}")

# Load the survey data for prediction (148 responses)


new_data = pd.read_csv('risk.csv')

# Preprocess the survey data to match the features of the trained model
X_new = pd.get_dummies(new_data)

# Ensure that the columns in X_new match the columns in X_train


missing_cols = set(X_train.columns) - set(X_new.columns)
for col in missing_cols:
X_new[col] = 0
X_new = X_new[X_train.columns]

# Make predictions on the survey data using the trained model


y_new_pred = model.predict(X_new)

# Add the predictions back to the survey_data DataFrame


new_data['Predicted_Labels'] = y_new_pred

# Identify employees at risk of leaving (Very Likely or Likely)


employees_at_risk = new_data[new_data['Predicted_Labels'].isin(['Very Likely',
'Likely'])]

# Print the employees at risk


print("\nEmployees at Risk of Leaving:")
print(employees_at_risk)

# importing libraries for visualization


import matplotlib.pyplot as plt
import seaborn as sns

# Visualizing age distribution in Bar Chart


plt.figure(figsize=(8, 6))
ax = sns.countplot(data=data, x='Age')
plt.title("Age Distribution")
plt.xlabel("Age Group")

116
plt.ylabel("Count")

for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='bottom', fontsize=10, color='black')

total_employees = len(data)
plt.text(0.55, 0.85, f'Total Employees: {total_employees}',
transform=ax.transAxes, fontsize=12,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white',
alpha=0.5))

plt.show()

# Visualizing gender distribution in Pie Chart


plt.figure(figsize=(6, 6))
gender_counts = data['Gender'].value_counts()
total_employees = len(data)

percentages = (gender_counts / total_employees) * 100


labels = [f"{gender} ({count} - {percentage:.1f}%)" for gender, count, percentage
in zip(gender_counts.index, gender_counts, percentages)]

plt.pie(gender_counts, labels=labels, autopct='', startangle=140)


plt.title(f"Gender Distribution (Total Employees: {total_employees})")
plt.show()

total_employees = len(data)

# Visualizing department distribution in Bar Chart


plt.figure(figsize=(10, 6))
sns.countplot(data=data, y='Department')
plt.title("Department Distribution")
plt.xlabel("Count")
plt.ylabel("Department")

for p in plt.gca().patches:
plt.gca().annotate(f'{int(p.get_width())}', (p.get_width() + 0.5, p.get_y() +
p.get_height() / 2), ha='center', va='center')

plt.text(0.75, 0.95, f'Total Employees: {total_employees}',


transform=plt.gca().transAxes, fontsize=12)

plt.show()

# Visualizing employee turnover in Bar Chart


plt.figure(figsize=(8, 6))
ax = sns.countplot(x='Predicted_Labels', data=employees_at_risk)
plt.title('Distribution of Predicted Labels for Employees at Risk')
plt.xlabel('Predicted Labels')
plt.ylabel('Count')
plt.xticks(rotation=45)

117
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')

total_employees = len(employees_at_risk)
plt.text(0.5, 0.5, f'Total Employees at Risk: {total_employees}',
horizontalalignment='center', verticalalignment='center',
transform=plt.gca().transAxes, fontsize=14, color='black')

plt.show()

# Visualizing employee turnover for each gender (male, female or prefer not to
say) in Bar Chart
plt.figure(figsize=(10, 6))
ax = sns.countplot(x='Gender', hue='Predicted_Labels',
data=employees_at_risk)
plt.title('Gender Distribution for Employees at Risk')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Predicted Labels')

for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')

total_employees = len(employees_at_risk)
hue_labels = employees_at_risk['Predicted_Labels'].unique()
for hue_label in hue_labels:
for p in ax.patches:
if p.get_height() > 0:
total_gender_label =
len(employees_at_risk[(employees_at_risk['Predicted_Labels'] == hue_label) &
(employees_at_risk['Gender'] == p.get_x())])
if total_gender_label > 0:
ax.annotate(f'Total: {total_gender_label}', (p.get_x() + p.get_width() /
2., 0),
ha='center', va='center', xytext=(0, 15), textcoords='offset
points', color='black')

plt.text(0.02, 0.92, f'Total Employees at Risk: {total_employees}',


transform=plt.gca().transAxes)

plt.show()

# Visualize the top 10 important features


top_n = 10 # Number of top features to visualize
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature',
data=feature_importance_df.head(top_n))
plt.title(f'Top {top_n} Important Features for Employees at Risk (New Data)')

118
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

Task 3 - Sentiment Analysis

3.1 Question

How do you feel that your achieved performance was properly acknowledged in
your company? (Kindly provide your answer within 1-2 lines. Your answer can be
positive, negative, or neutral.)

3.2 Code and Algorithms

# importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
import pandas as pd

# Load labeled comments


comments_df = pd.read_csv('labeled_sentiment.csv')

# Split data into features (X) and labels (y)


X = comments_df['Comment']
y = comments_df['Sentiment']

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Convert text data to numerical features (TF-IDF vectorization)


vectorizer = TfidfVectorizer(max_features=1000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train an Support Vector Machine classifier


svm_classifier = SVC(kernel='linear', C=1.0)
svm_classifier.fit(X_train_vec, y_train)

# Predict sentiment on the testing set (labeled dataset)


y_pred_scores = svm_classifier.decision_function(X_test_vec)
y_pred = [1 if score > 0.2 else (0 if score < -0.2 else 2) for score in
y_pred_scores]

# Calculate precision, recall, F1-score, and support for each class

119
precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred,
average=None)

# Display performance metrics for each class


sentiment_classes = ['Negative', 'Positive', 'Neutral']
print("Performance Metrics:")
for i, sentiment in enumerate(sentiment_classes):
print(f"Sentiment: {sentiment}")
print(f"Precision: {precision[i]:.2f}")
print(f"Recall: {recall[i]:.2f}")
print(f"F1 Score: {f1[i]:.2f}")
print(f"Support: {support[i]}\n")

# Calculate overall accuracy


accuracy = accuracy_score(y_test, y_pred)
print(f"Overall Accuracy: {accuracy:.2f}")

# Load surveyed comments for analysis (132 comments)


new_comments_df = pd.read_csv('sentiment.csv')
new_X = new_comments_df['How do you feel that your achieved performance
was properly acknowledged in your company?']

# Convert text data to numerical features (TF-IDF vectorizer)


new_X_vec = vectorizer.transform(new_X)

# Predict sentiment on the surveyed dataset


new_y_pred_scores = svm_classifier.decision_function(new_X_vec)
new_y_pred = [1 if score > 0.2 else (0 if score < -0.2 else 2) for score in
new_y_pred_scores]
new_sentiment_predictions = ['Positive' if pred == 1 else ('Negative' if pred == 0
else 'Neutral') for pred in new_y_pred]
new_comments_df['Predicted_Sentiment'] = new_sentiment_predictions

# Display the new DataFrame with predicted sentiment labels


print("\nPredicted Sentiments for New Comments:")
print(new_comments_df[['How do you feel that your achieved performance was
properly acknowledged in your company?', 'Predicted_Sentiment']])

# importing libraries for visualization


import matplotlib.pyplot as plt
import seaborn as sns

# Plot the predicted sentiments in Pie Chart


plt.figure(figsize=(8, 6))
plt.pie(predicted_sentiments, labels=predicted_sentiments.index,
autopct=lambda p: '{:.1f}%\n({:.0f})'.format(p, p * total_comments / 100),
startangle=140)
plt.title(f'Distribution of Predicted Sentiments\nTotal Comments:
{total_comments}')
plt.show()

# Visualize sentiment polarity and subjectivity in Bar Chart

120
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.hist(new_comments_df['Sentiment_Polarity'], bins=20, color='blue',
alpha=0.7)
plt.xlabel('Sentiment Polarity')
plt.ylabel('Frequency')
plt.title('Distribution of Sentiment Polarity')
plt.subplot(1, 2, 2)
plt.hist(new_comments_df['Sentiment_Subjectivity'], bins=20, color='green',
alpha=0.7)
plt.xlabel('Sentiment Subjectivity')
plt.ylabel('Frequency')
plt.title('Distribution of Sentiment Subjectivity')

plt.tight_layout()
plt.show()

# importing library for visualization of wordcloud


from wordcloud import WordCloud

# Visualize word cloud


wordcloud = WordCloud(width=800, height=400,
background_color='white').generate(all_comments)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Comments')
plt.show()

121
Declaration of Honor
I thus certify that, in accordance with Bremen University of Applied Sciences
regulations, I am the only author of the present master thesis and that I have
completed all work related to the master thesis on my own.

No other examining authority has received the master thesis. The work was
submitted in both printed and electronic formats.

I am aware of the legal ramifications of making a false statement of honour.

Date: 30 September 2023

Mohammad Ataur Rahman

122

View publication stats

You might also like