Professional Documents
Culture Documents
Employee Attrition Classification
Employee Attrition Classification
A report
on
Class: CityU9D
Course: Fundamentals of AI
Hanoi, 27/12/2023
2
Table of Contents
Contribution of each member....................................................................................................2
Abstract....................................................................................................................................1
I. Introduction.......................................................................................................................2
II. Report............................................................................................................................3
1. Understand the task...........................................................................................................3
2. Describe the "experience" to perform the task...................................................................3
3. Implement one or several machine learning algorithms to solve the task on the dataset......8
4. Evaluate the performance of the trained model................................................................11
III. Conclusion...................................................................................................................12
Reference................................................................................................................................13
Name Mission Contribution Percentage
- Conclusion section
2
Abstract
1
I. Introduction
Employee attrition, or turnover, is a critical concern for organizations, referring to the rate at
which employees leave a company within a certain period. This phenomenon has significant
implications for an organization's performance, as it can lead to increased costs, loss of
productivity, and a negative impact on the work environment. Understanding the factors that
contribute to employee churn is essential for developing effective retention strategies and
maintaining a stable and productive workforce. Research on employee turnover has
highlighted various contributing factors, including job satisfaction, work environment,
compensation, and career development opportunities. For instance, Mobley (1982)
emphasized the importance of recruitment, career planning, working conditions, and
organizational communication in improving employee retention.
Additionally, the negative impacts of high turnover rates have been linked to decreased
productivity, increased training costs, and customer dissatisfaction. In summary, the
employee attrition model plays a crucial role in helping organizations understand, predict,
and address employee turnover. By leveraging data and analytics, organizations can gain
valuable insights into the factors contributing to churn and develop strategies to foster a more
stable and engaged workforce. This, in turn, can lead to improved performance, reduced
costs, and a more positive work environment. The major achievement is understanding why
and when employees are most likely to leave can lead to actions to improve employee
retention as well as possibly planning new hiring. The main goal of this project is to approach
the problem systematically and build a machine-learning model based on certain
characteristics to predict which employees will quit the job.
2
II. Report
1. Understand the task
● Source of dataset:
https://drive.google.com/drive/u/0/folders/1mtvD1NgEM8HnN0QwDp-vKxmwrz2z-
BVR
● Predicted learning: Use this dataset to train and produce a model to predict when
employees will quit.
● Type of problem: Classification
● Purpose: Review the data and create a model to calculate the likelihood of employees
leaving the company with the provided data files.
● Input: The input dataset consisted of 37 variables and selected input data for analysis.
● Output: prediction when employees are going to quit
● Reason: This is a standard supervised classification problem where the label is a
binary variable, No (active employee), (employee quits). In this study, our target
variable Y is the probability of an employee leaving the company.
● Possible classes:
- No (employee does not quit)
- Yes (employee quits)
2. Describe the "experience" to perform the task
3
Train:
1. Age: classification and analysis of employment based on age (numerical).
2. Age Group: classify individuals into different age groups (categorical).
3. Attrition: is the departure of employees from the organization (categorical).
4. Business Travel: The practice of employees or professionals making trips, often to
other locations or cities, on behalf of their company to conduct work-related activities,
such as meetings, conferences, or visits to clients (categorical).
5. Daily Rate: the specified amount of wages for a given job paid by the day or hour
(numerical).
6. Department: it means where you work in the company, for example, your
department could be accounting, customer service, or legal (categorical).
7. Distance From Home: refers to the straight-line distance, in kilometers, between a
person's residence and his or her usual place of work (numerical).
8. Education: a standardized system or set of identification codes used to represent
levels of education or learning on the job (numerical).
9. Education Field: This code helps organize and classify educational degrees or
qualifications based on related fields (categorical).
10. Environment Satisfaction: used to evaluate or quantify an individual's satisfaction or
satisfaction with their work environment (numerical).
11. Gender: Giving work also identifies or implies the gender of the person doing the
work (categorical).
12. Hourly Rate: the amount of money an employee earns for each hour worked. Are
hourly employees and will be paid for all hours worked (numerical).
13. Job Involvement: the degree to which employees identify with their work, actively
participate in it and derive a sense of self-worth from it (numerical).
14. Job Level: different types of responsibilities and leadership within the organization,
from employees to executive management (numerical).
15. Job Role: a set of specific responsibilities and tasks assigned to an employee within
an organization (categorical).
16. Job Satisfaction: employee satisfaction level with their job (numerical).
17. Marital Status: code used to classify an individual's marital status for organizational
or statistical purposes (categorical).
4
18. Monthly Income: taxonomy used to categorize or represent different levels of
income for individuals in the context of job applications, surveys, or databases
(numerical).
19. Salary Slab: categorize salary groups into different ranges or tables. Helps organize
and analyze salary data without revealing individual salaries (categorical).
20. Monthly Rate: monthly payments or remuneration vary in the context of work. Helps
anonymize data related to various payments (numerical).
21. NumCompanies Worked: classification used to represent the number of companies
an individual has worked for in a particular context, such as in a job application
(numerical).
22. Over Time: shows an employee's work hours when the employee exceeds their
regularly scheduled work hours (categorical).
23. Percent Salary Hike: represents the percentage increase in salary an individual
receives (numerical).
24. Performance Rating: job measurement step in which the analyst observes a worker's
performance and records a value that represents that performance relative to the
analyst's concept of standard performance (numerical).
25. Relationship Satisfaction: employee satisfaction level with their job (numerical).
26. Stock Option Level: classification used to express the degree or extent of stock
options offered to employees in a job or organization (numerical).
27. Total Working Years: used to classify or group individuals based on the total
number of years they have worked (numerical).
28. Training Times Last Year: categorizing the frequency or number of times an
individual has undergone training or participated in learning activities during a
specified time (numerical).
29. Work-Life Balance: Analyze data that employees maintain harmonious relationships
in work and personal life (numerical).
30. Years At Company: the length or number of years an individual has worked at a
company (numerical).
31. Years In Current Role: the length or number of years an individual has worked at a
company (numerical).
32. Years Since Last Promotion: sorting is used to display individuals based on the time
since their last promotion (numerical).
5
33. Years With CurrManager: the length of time or number of years an individual has
worked under the current manager (numerical).
Test
1. Age: classification and analysis of employment based on age (numerical).
2. Age Group: classify individuals into different age groups (categorical).
3. Business Travel: The practice of employees or professionals making trips, often to
other locations or cities, on behalf of their company to conduct work-related activities,
such as meetings, conferences, or visits to clients (categorical).
4. Daily Rate: the specified amount of wages for a given job paid by the day or hour
(numerical).
5. Department: it means where you work in the company, for example, your
department could be accounting, customer service, or legal (categorical).
6. Distance From Home: refers to the straight-line distance, in kilometers, between a
person's residence and his or her usual place of work (numerical).
7. Education: a standardized system or set of identification codes used to represent
levels of education or learning on the job (numerical).
8. Education Field: This code helps organize and classify educational degrees or
qualifications based on related fields (categorical).
9. Environment Satisfaction: used to evaluate or quantify an individual's satisfaction or
satisfaction with their work environment (numerical).
10. Gender: Giving work also identifies or implies the gender of the person doing the
work (categorical).
11. Hourly Rate: the amount of money an employee earns for each hour worked. Are
hourly employees and will be paid for all hours worked (numerical).
12. Job Involvement: the degree to which employees identify with their work, actively
participate in it and derive a sense of self-worth from it (numerical).
13. Job Level: different types of responsibilities and leadership within the organization,
from employees to executive management (numerical).
14. Job Role: a set of specific responsibilities and tasks assigned to an employee within
an organization (categorical).
15. Job Satisfaction: employee satisfaction level with their job (numerical).
16. Marital Status: code used to classify an individual's marital status for organizational
or statistical purposes (categorical).
6
17. Monthly Income: taxonomy used to categorize or represent different levels of
income for individuals in the context of job applications, surveys, or databases
(numerical).
18. Salary Slab: categorize salary groups into different ranges or tables. Helps organize
and analyze salary data without revealing individual salaries (categorical).
19. Monthly Rate: monthly payments or remuneration vary in the context of work. Helps
anonymize data related to various payments (numerical).
20. NumCompanies Worked: classification used to represent the number of companies
an individual has worked for in a particular context, such as in a job application
(numerical).
21. Over Time: shows an employee's work hours when the employee exceeds their
regularly scheduled work hours (categorical).
22. Percent Salary Hike: represents the percentage increase in salary an individual
receives (numerical).
23. Performance Rating: job measurement step in which the analyst observes a worker's
performance and records a value that represents that performance relative to the
analyst's concept of standard performance (numerical).
24. Relationship Satisfaction: employee satisfaction level with their job (numerical).
25. Stock Option Level: classification used to express the degree or extent of stock
options offered to employees in a job or organization (numerical).
26. Total Working Years: used to classify or group individuals based on the total
number of years they have worked (numerical).
27. Training Times Last Year: categorizing the frequency or number of times an
individual has undergone training or participated in learning activities during a
specified time (numerical).
28. Work-Life Balance: Analyze data that employees maintain harmonious relationships
in work and personal life (numerical).
29. Years At Company: the length or number of years an individual has worked at a
company (numerical).
30. Years In Current Role: the length or number of years an individual has worked at a
company (numerical).
31. Years Since Last Promotion: sorting is used to display individuals based on the time
since their last promotion (numerical).
7
32. Years With CurrManager: the length of time or number of years an individual has
worked under the current manager (numerical).
● Is there a label for each data sample
- Is it a supervised or unsupervised problem: Supervised
Because the data is labeled
● How many missing values in the dataset?
We have 1 missing value.It is "YearsWithCurrManager"
Using the "fillna" function fills the missing values in the "YearsWithCurrManager" column
of both the ' train dataset ' and 'test dataset' with the value 0.
● How many "noise" data?
"Noise" data attributes include:
- Missing Data: "YearsWithCurrManager"
- Irrelevant Features: 'EmpID', 'EmployeeCount', 'EmployeeNumber', 'Over18', and
'StandardHours'.
Therefore, these attributes were processed or removed before training the data
3. Implement one or several machine learning algorithms to solve the task on the
dataset.
3.1. Preprocessing
In this step, we will use two main methods to process data: using onehot_encode to create
dummy variables and using "replace" to assign values to attributes whose data type is
"object".
Use the method “replace” for attributes:
● 'Gender' and 'OverTime': use the assignment method instead of creating dummy
variables because if you use the dummy variable method, the two variables in these
attributes will become two dependent variables because this attribute only has 2
unique values. for example, the variable "male" will be the dependent variable of
"female" and The variable "Yes" will be the dependent variable of "
● 'BusinessTravel', 'SalarySlab' and 'AgeGroup': use the assignment method
because these attributes have variables that represent level and magnitude, so applying
the dummy variable method is not reasonable.
8
And apply onehot_encode to the remaining attributes 'Department', 'EducationField',
'JobRole', and 'MaritalStatus':
9
3.2.3. KNeighborsClassifier
Definition: KNeighborsClassifier is an algorithm that effectively categorizes data points
according to the trends found in that said point's nearest data points or neighbors.
Reasons:
● Usually works well on data with complex and non-linear structures. It can properly
position complex models where linear models may have difficulty.
● Works well in multidimensional space, where the dimensionality of the data is large.
3.3. Calculate Accuracy and F1 score
Using sklearn MinMaxScaler to normalize the range of independent variables of “X_train”
and “X_test”.
Feature Scaling using MinMaxScaler shrinks the range such that the range is now between 0
and n. Machine Learning algorithms perform better when input numerical variables fall
within a similar scale. In this case, we are scaling between 0 and 10.
Use the iterative function to calculate accuracy and f1 score for each algorithm mentioned
which must ensure that the length of the test data set 'y_test' and the tested data set 'y_pred'
have the same length (both have a length of 248).
Next, calculate and print the results of accuracy and f1 sorcery for each algorithm listed in the
'classifiers' set.
10
4. Evaluate the performance of the trained model
Overall, out of the 3 models running, we found that 2 models: Random Forest Classifier and
KNeighborsClassifier have similar accuracy and F1 results, around 0.88 respectively. and
0.85, which are higher than Logistic Regression which is 0.701 and 0.75 respectively.
Summarizing the results from the metrics (Accuracy, F1 score), we evaluate our running
model as a model with mid-strength power. This is simply because the numbers obtained
from the measurements range from 70% to 85%, this number is relatively high but has a quite
large error of up to 15%. So we conclude the model has mid-strength power.
11
III. Conclusion
To sum up, the main goal of this project is to build a machine-learning model based on given
attributes that can predict which employees will quit. We were provided by the instructor
with 5 data files on Kaggle including: " train.csv, public_test.csv, private_test.csv,
public_test_with_labels.csv, and sample_submission.csv.
The steps are described in the following table:
Step Description
1 Read data
● Train_data = train.csv
● Test_data = public_test.csv + private_test.csv
● y_test = public_test_with_labels.csv
12
Reference
https://www.educative.io/answers/kneighborsclassifier-in-scikit-learn
https://www.v7labs.com/blog/f1-score-guide
Laura Lancaster, P. (2023, September 4). Effects of high employee turnover. Group 19644.
https://stratus.hr/resources/effects-of-high-employee-turnover
Lawton, G., Burns, E., & Rosencrance, L. (2022, January 20). What is logistic regression? -
https://www.techtarget.com/searchbusinessanalytics/definition/logistic-regression
Learning, M. (2023, September 14). How can you use accuracy as an evaluation metric?.
https://www.linkedin.com/advice/0/how-can-you-use-accuracy-evaluation-metric-
skills-machine-learning
-forest
13