You are on page 1of 17

Banking Academy of Vietnam

International School of Business

A report

on

EMPLOYEE ATTRITION CLASSIFICATION

Lecturer: Vu Trong Sinh

Student: Nguyen Dang Hieu _ CA9-200

Nguyen Anh Tuan _ CA9-175

Tran Xuan Bach _ CA10-093

Class: CityU9D

Course: Fundamentals of AI
Hanoi, 27/12/2023

2
Table of Contents
Contribution of each member....................................................................................................2
Abstract....................................................................................................................................1
I. Introduction.......................................................................................................................2
II. Report............................................................................................................................3
1. Understand the task...........................................................................................................3
2. Describe the "experience" to perform the task...................................................................3
3. Implement one or several machine learning algorithms to solve the task on the dataset......8
4. Evaluate the performance of the trained model................................................................11
III. Conclusion...................................................................................................................12
Reference................................................................................................................................13
Name Mission Contribution Percentage

All of us run the model together

1. Trần Xuân Bách - Abstract 33.3333%


- Introduction section

- Understand the task

2. Nguyễn Anh Tuấn - Describe the "experience" to perform 33.33333%


the task

- Conclusion section

3. Nguyễn Đăng Hiếu - Implement one or several machine 33.33333%


learning algorithm to solve the task on
the dataset

- Evaluate the performance of the


trained model

Contribution of each member

2
Abstract

Employee attrition, or turnover, is a critical concern for organizations, impacting costs,


productivity, and work environment. This report addresses the challenge by developing a
machine-learning model to predict when employees are likely to quit. The dataset comprises
37 variables, with 33 selected for analysis. The study involves 949 training samples and 527
test samples, with attributes such as age, job satisfaction, and work-life balance.
Preprocessing includes handling missing values and removing irrelevant features. Three
machine-learning algorithms, Logistic Regression, RandomForestClassifier, and
KNeighborsClassifier, are employed. Evaluation metrics include Accuracy and F1-Score.
Results indicate that RandomForestClassifier and KNeighborsClassifier outperform Logistic
Regression, with accuracy ranging from 70% to 88%. The model is categorized as mid-
strength, providing valuable insights for organizations to enhance retention strategies and
maintain a stable workforce.

1
I. Introduction
Employee attrition, or turnover, is a critical concern for organizations, referring to the rate at
which employees leave a company within a certain period. This phenomenon has significant
implications for an organization's performance, as it can lead to increased costs, loss of
productivity, and a negative impact on the work environment. Understanding the factors that
contribute to employee churn is essential for developing effective retention strategies and
maintaining a stable and productive workforce. Research on employee turnover has
highlighted various contributing factors, including job satisfaction, work environment,
compensation, and career development opportunities. For instance, Mobley (1982)
emphasized the importance of recruitment, career planning, working conditions, and
organizational communication in improving employee retention.
Additionally, the negative impacts of high turnover rates have been linked to decreased
productivity, increased training costs, and customer dissatisfaction. In summary, the
employee attrition model plays a crucial role in helping organizations understand, predict,
and address employee turnover. By leveraging data and analytics, organizations can gain
valuable insights into the factors contributing to churn and develop strategies to foster a more
stable and engaged workforce. This, in turn, can lead to improved performance, reduced
costs, and a more positive work environment. The major achievement is understanding why
and when employees are most likely to leave can lead to actions to improve employee
retention as well as possibly planning new hiring. The main goal of this project is to approach
the problem systematically and build a machine-learning model based on certain
characteristics to predict which employees will quit the job.

2
II. Report
1. Understand the task

● Source of dataset:
https://drive.google.com/drive/u/0/folders/1mtvD1NgEM8HnN0QwDp-vKxmwrz2z-
BVR
● Predicted learning: Use this dataset to train and produce a model to predict when
employees will quit.
● Type of problem: Classification
● Purpose: Review the data and create a model to calculate the likelihood of employees
leaving the company with the provided data files.
● Input: The input dataset consisted of 37 variables and selected input data for analysis.
● Output: prediction when employees are going to quit
● Reason: This is a standard supervised classification problem where the label is a
binary variable, No (active employee), (employee quits). In this study, our target
variable Y is the probability of an employee leaving the company.
● Possible classes:
- No (employee does not quit)
- Yes (employee quits)
2. Describe the "experience" to perform the task

● How many data samples in the dataset you collected?


- Number of train samples: 949
- Number of test samples: 527
● How many attributes of each data sample in the dataset?
Remove attributes that have 1 feature or have several features equal to the number of
samples. Because these attributes are not statistically significant.

So we have removed the following attributes:'EmpID', 'EmployeeCount', 'EmployeeNumber',


'Over18', and 'StandardHours'.
- Number of train features: 33
- Number of test features: 32

3
Train:
1. Age: classification and analysis of employment based on age (numerical).
2. Age Group: classify individuals into different age groups (categorical).
3. Attrition: is the departure of employees from the organization (categorical).
4. Business Travel: The practice of employees or professionals making trips, often to
other locations or cities, on behalf of their company to conduct work-related activities,
such as meetings, conferences, or visits to clients (categorical).
5. Daily Rate: the specified amount of wages for a given job paid by the day or hour
(numerical).
6. Department: it means where you work in the company, for example, your
department could be accounting, customer service, or legal (categorical).
7. Distance From Home: refers to the straight-line distance, in kilometers, between a
person's residence and his or her usual place of work (numerical).
8. Education: a standardized system or set of identification codes used to represent
levels of education or learning on the job (numerical).
9. Education Field: This code helps organize and classify educational degrees or
qualifications based on related fields (categorical).
10. Environment Satisfaction: used to evaluate or quantify an individual's satisfaction or
satisfaction with their work environment (numerical).
11. Gender: Giving work also identifies or implies the gender of the person doing the
work (categorical).
12. Hourly Rate: the amount of money an employee earns for each hour worked. Are
hourly employees and will be paid for all hours worked (numerical).
13. Job Involvement: the degree to which employees identify with their work, actively
participate in it and derive a sense of self-worth from it (numerical).
14. Job Level: different types of responsibilities and leadership within the organization,
from employees to executive management (numerical).
15. Job Role: a set of specific responsibilities and tasks assigned to an employee within
an organization (categorical).
16. Job Satisfaction: employee satisfaction level with their job (numerical).
17. Marital Status: code used to classify an individual's marital status for organizational
or statistical purposes (categorical).

4
18. Monthly Income: taxonomy used to categorize or represent different levels of
income for individuals in the context of job applications, surveys, or databases
(numerical).
19. Salary Slab: categorize salary groups into different ranges or tables. Helps organize
and analyze salary data without revealing individual salaries (categorical).
20. Monthly Rate: monthly payments or remuneration vary in the context of work. Helps
anonymize data related to various payments (numerical).
21. NumCompanies Worked: classification used to represent the number of companies
an individual has worked for in a particular context, such as in a job application
(numerical).
22. Over Time: shows an employee's work hours when the employee exceeds their
regularly scheduled work hours (categorical).
23. Percent Salary Hike: represents the percentage increase in salary an individual
receives (numerical).
24. Performance Rating: job measurement step in which the analyst observes a worker's
performance and records a value that represents that performance relative to the
analyst's concept of standard performance (numerical).
25. Relationship Satisfaction: employee satisfaction level with their job (numerical).
26. Stock Option Level: classification used to express the degree or extent of stock
options offered to employees in a job or organization (numerical).
27. Total Working Years: used to classify or group individuals based on the total
number of years they have worked (numerical).
28. Training Times Last Year: categorizing the frequency or number of times an
individual has undergone training or participated in learning activities during a
specified time (numerical).
29. Work-Life Balance: Analyze data that employees maintain harmonious relationships
in work and personal life (numerical).
30. Years At Company: the length or number of years an individual has worked at a
company (numerical).
31. Years In Current Role: the length or number of years an individual has worked at a
company (numerical).
32. Years Since Last Promotion: sorting is used to display individuals based on the time
since their last promotion (numerical).

5
33. Years With CurrManager: the length of time or number of years an individual has
worked under the current manager (numerical).
Test
1. Age: classification and analysis of employment based on age (numerical).
2. Age Group: classify individuals into different age groups (categorical).
3. Business Travel: The practice of employees or professionals making trips, often to
other locations or cities, on behalf of their company to conduct work-related activities,
such as meetings, conferences, or visits to clients (categorical).
4. Daily Rate: the specified amount of wages for a given job paid by the day or hour
(numerical).
5. Department: it means where you work in the company, for example, your
department could be accounting, customer service, or legal (categorical).
6. Distance From Home: refers to the straight-line distance, in kilometers, between a
person's residence and his or her usual place of work (numerical).
7. Education: a standardized system or set of identification codes used to represent
levels of education or learning on the job (numerical).
8. Education Field: This code helps organize and classify educational degrees or
qualifications based on related fields (categorical).
9. Environment Satisfaction: used to evaluate or quantify an individual's satisfaction or
satisfaction with their work environment (numerical).
10. Gender: Giving work also identifies or implies the gender of the person doing the
work (categorical).
11. Hourly Rate: the amount of money an employee earns for each hour worked. Are
hourly employees and will be paid for all hours worked (numerical).
12. Job Involvement: the degree to which employees identify with their work, actively
participate in it and derive a sense of self-worth from it (numerical).
13. Job Level: different types of responsibilities and leadership within the organization,
from employees to executive management (numerical).
14. Job Role: a set of specific responsibilities and tasks assigned to an employee within
an organization (categorical).
15. Job Satisfaction: employee satisfaction level with their job (numerical).
16. Marital Status: code used to classify an individual's marital status for organizational
or statistical purposes (categorical).

6
17. Monthly Income: taxonomy used to categorize or represent different levels of
income for individuals in the context of job applications, surveys, or databases
(numerical).
18. Salary Slab: categorize salary groups into different ranges or tables. Helps organize
and analyze salary data without revealing individual salaries (categorical).
19. Monthly Rate: monthly payments or remuneration vary in the context of work. Helps
anonymize data related to various payments (numerical).
20. NumCompanies Worked: classification used to represent the number of companies
an individual has worked for in a particular context, such as in a job application
(numerical).
21. Over Time: shows an employee's work hours when the employee exceeds their
regularly scheduled work hours (categorical).
22. Percent Salary Hike: represents the percentage increase in salary an individual
receives (numerical).
23. Performance Rating: job measurement step in which the analyst observes a worker's
performance and records a value that represents that performance relative to the
analyst's concept of standard performance (numerical).
24. Relationship Satisfaction: employee satisfaction level with their job (numerical).
25. Stock Option Level: classification used to express the degree or extent of stock
options offered to employees in a job or organization (numerical).
26. Total Working Years: used to classify or group individuals based on the total
number of years they have worked (numerical).
27. Training Times Last Year: categorizing the frequency or number of times an
individual has undergone training or participated in learning activities during a
specified time (numerical).
28. Work-Life Balance: Analyze data that employees maintain harmonious relationships
in work and personal life (numerical).
29. Years At Company: the length or number of years an individual has worked at a
company (numerical).
30. Years In Current Role: the length or number of years an individual has worked at a
company (numerical).
31. Years Since Last Promotion: sorting is used to display individuals based on the time
since their last promotion (numerical).

7
32. Years With CurrManager: the length of time or number of years an individual has
worked under the current manager (numerical).
● Is there a label for each data sample
- Is it a supervised or unsupervised problem: Supervised
Because the data is labeled
● How many missing values in the dataset?
We have 1 missing value.It is "YearsWithCurrManager"

Using the "fillna" function fills the missing values in the "YearsWithCurrManager" column
of both the ' train dataset ' and 'test dataset' with the value 0.
● How many "noise" data?
"Noise" data attributes include:
- Missing Data: "YearsWithCurrManager"
- Irrelevant Features: 'EmpID', 'EmployeeCount', 'EmployeeNumber', 'Over18', and
'StandardHours'.
Therefore, these attributes were processed or removed before training the data
3. Implement one or several machine learning algorithms to solve the task on the
dataset.

3.1. Preprocessing
In this step, we will use two main methods to process data: using onehot_encode to create
dummy variables and using "replace" to assign values to attributes whose data type is
"object".
Use the method “replace” for attributes:
● 'Gender' and 'OverTime': use the assignment method instead of creating dummy
variables because if you use the dummy variable method, the two variables in these
attributes will become two dependent variables because this attribute only has 2
unique values. for example, the variable "male" will be the dependent variable of
"female" and The variable "Yes" will be the dependent variable of "
● 'BusinessTravel', 'SalarySlab' and 'AgeGroup': use the assignment method
because these attributes have variables that represent level and magnitude, so applying
the dummy variable method is not reasonable.

8
And apply onehot_encode to the remaining attributes 'Department', 'EducationField',
'JobRole', and 'MaritalStatus':

3.2. Several machine learning algorithms were used in the report


In the project, we used several machine learning algorithms to solve the task on the dataset
(namely: Logistic Regression, RandomForestClassifier, and KNeighborsClassifier). The main
reasons for choosing these algorithms are listed below:
3.2.1. Logistic Regression
Definition: Logistic regression is a statistical analysis method to predict a binary outcome,
such as yes or no, based on prior observations of a data set.
Reasons:
● Effective in binary classification problems, when needing to predict results falling
into one of two specific classes, Yes/No (1/0)
● Can be used effectively when there are many independent variables. It does not
require assumptions about the distribution of the independent variable and can handle
situations well when the number of independent variables is large.
3.2.2. RandomForestClassifier
Definition: Random forest is a commonly used machine learning algorithm that combines the
output of multiple decision trees to reach a single result. Its ease of use and flexibility have
fueled its adoption, as it handles both classification and regression problems.
Reasons:
● RandomForest can be used for both regression and classification problems.
● It can evaluate the importance of each feature during the training process, helping to
determine which features contribute most to the prediction decision. So it is suitable
for files with a lot of data like this.

9
3.2.3. KNeighborsClassifier
Definition: KNeighborsClassifier is an algorithm that effectively categorizes data points
according to the trends found in that said point's nearest data points or neighbors.
Reasons:
● Usually works well on data with complex and non-linear structures. It can properly
position complex models where linear models may have difficulty.
● Works well in multidimensional space, where the dimensionality of the data is large.
3.3. Calculate Accuracy and F1 score
Using sklearn MinMaxScaler to normalize the range of independent variables of “X_train”
and “X_test”.
Feature Scaling using MinMaxScaler shrinks the range such that the range is now between 0
and n. Machine Learning algorithms perform better when input numerical variables fall
within a similar scale. In this case, we are scaling between 0 and 10.

Use the iterative function to calculate accuracy and f1 score for each algorithm mentioned
which must ensure that the length of the test data set 'y_test' and the tested data set 'y_pred'
have the same length (both have a length of 248).

Next, calculate and print the results of accuracy and f1 sorcery for each algorithm listed in the
'classifiers' set.

10
4. Evaluate the performance of the trained model

In this project, we choose the specific classification model as Logistic Regression,


RandomForestClassifier, and KNeighborsClassifier so we use Accuracy, F1-score, and ROC
curve for evaluation.
Accuracy is used to calculate the proportion of the total number of correct predictions. It is
the number of correct predictions divided by the total number of predictions. Moreover, this
metric is one of the most common classification metrics, accuracy is very intuitive and easy
to understand and implement: It ranges from 0 to 100 percent or 0 to 1. Can be used to predict
a model that is not too complicated. Accuracy helps to evaluate the predictive performance of
a model on a set of data. The higher the accuracy, the more accurate their configuration.
The F1 Score tries to find the balance between precision and recall by calculating their
harmonic mean.t is a measure of a test’s accuracy where the highest possible value is 1 and
the higher this value, the better the result. This indicates perfect precision and recall.

Logistic Regression Random Forest KNN

Accuracy 0.701 0.887 0.875

F1-Score 0.755 0.852 0.861

Overall, out of the 3 models running, we found that 2 models: Random Forest Classifier and
KNeighborsClassifier have similar accuracy and F1 results, around 0.88 respectively. and
0.85, which are higher than Logistic Regression which is 0.701 and 0.75 respectively.
Summarizing the results from the metrics (Accuracy, F1 score), we evaluate our running
model as a model with mid-strength power. This is simply because the numbers obtained
from the measurements range from 70% to 85%, this number is relatively high but has a quite
large error of up to 15%. So we conclude the model has mid-strength power.

11
III. Conclusion
To sum up, the main goal of this project is to build a machine-learning model based on given
attributes that can predict which employees will quit. We were provided by the instructor
with 5 data files on Kaggle including: " train.csv, public_test.csv, private_test.csv,
public_test_with_labels.csv, and sample_submission.csv.
The steps are described in the following table:
Step Description
1 Read data
● Train_data = train.csv
● Test_data = public_test.csv + private_test.csv
● y_test = public_test_with_labels.csv

2 Use the algorithm to calculate the number of the uniqueness of each


attribute, then eliminate noisy values and fill in the missing values.
3 Use replace function with some special attributes and use onehot_encode
by creating a dummy variable with remaining attributes to get "object"
property to "int64".
4 Use MinmaxScaler to normalize the values
5 Initialize the model and calculate
To fit the dataset, we utilize models of Logistic Regression, RandomForestClassifier, and
KNeighborsClassifier so we use Accuracy and F1-score for evaluating. After running all 3
models above, Logistic Regression is the model that gives the worst results compared to the
other 2 models. However, the numbers obtained from the measures range from 70% to 85%,
which represents that this is a mid-strength model.

12
Reference

Educative answers - trusted answers to developer questions. Educative. (n.d.).

https://www.educative.io/answers/kneighborsclassifier-in-scikit-learn

F1 score in Machine Learning: Intro & Calculation. V7. (n.d.).

https://www.v7labs.com/blog/f1-score-guide

Laura Lancaster, P. (2023, September 4). Effects of high employee turnover. Group 19644.

https://stratus.hr/resources/effects-of-high-employee-turnover

Lawton, G., Burns, E., & Rosencrance, L. (2022, January 20). What is logistic regression? -

definition from Searchbusinessanalytics. Business Analytics.

https://www.techtarget.com/searchbusinessanalytics/definition/logistic-regression

Learning, M. (2023, September 14). How can you use accuracy as an evaluation metric?.

How to Use Accuracy as an Evaluation Metric for Machine Learning.

https://www.linkedin.com/advice/0/how-can-you-use-accuracy-evaluation-metric-

skills-machine-learning

What is Random Forest?. IBM. (n.d.). https://www.ibm.com/topics/random-forest

-forest

13

You might also like