Professional Documents
Culture Documents
A Main project report submitted in the partial fulfilment of the requirement for the award
of the degree of
BACHELOR OF TECHNOLOGY
in
CERTIFICATE
submitted by
T. VYSHALI Regd.No.18131A05I3
U. GOWTHAMI DEVI Regd.No.18131A05I9
V. VAISHNAVI Regd.No.18131A05K2
N. L. AVANTHIKA Regd.No.18131A05N3
in their VIII semester in partial fulfilment of the requirements for the Award of
Degree of Bachelor of Technology in Computer Science and Engineering
During the academic year 2021-2022
ii
DECLARATION
We hereby declare that this main project entitled “HUMAN RESOURCE ANALYTICS” is a
bonafide work done by us and submitted to the Department of Computer Science and
Engineering, Gayatri Vidya Parishad college of engineering (autonomous) Visakhapatnam,
in partial fulfilment for the award of the degree of B. Tech is of own and it is not submitted to
any other university or has been published any time before.
PLACE:VISAKHAPATNAM T. VYSHALI(18131A05I3)
U. GOWTHAMI DEVI(18131A05I9)
V. VAISHNAVI(18131A05K2)
N.L. AVANTHIKA(18131A05N3)
iii
ACKNOWLEDGEMENT
We consider it our privilege to express our deepest gratitude to Dr. D.N.D. HARINI, Associate
professor and Head of the Department of Computer Science and Engineering, for her valuable
suggestions and constant motivation that greatly helped the project work to get successfully
completed.
We are extremely thankful to Mrs. P. AKHILA, Assistant Professor, Computer Science and
Engineering for giving us an opportunity to do this project and providing us support and guidance
which helped us to complete the project on time.
We also thank our coordinator, Dr. CH. SITA KUMARI, Sr. Assistant Professor, Department
of Computer Science and Engineering, for the kind suggestions and guidance for the successful
completion of our project work.
We also thank all the members of the staff in Computer Science and Engineering for their sustained
help in our pursuits. We thank all those who contributed directly or indirectly in successfully
carrying out his work.
iv
ABSTRACT
Nowadays, employee attrition became a serious issue regarding a company’s competitive advantage. It’s
very expensive to find, hire and train new talents. Few years back it was done manually but it is an era of
machine learning and data analytics. Now, company’s HR department uses some data analytics tool to
identify which areas to be modified to make most of its employees to stay. In any industry, attrition is a
big problem, whether it is about employee attrition of an organization or customer attrition of an e-
commerce site. If we can accurately predict which customer or employee will leave their current company
or organization, then it will save much time, effort, and cost of the employer and help them to hire or
acquire substitutes in advance, and it would not create a problem in the ongoing progress of an
organization. Here comparative analysis between various machine learning approaches such as Naive
Bayes, decision tree, random forest, and logistic regression is presented. The presented result will help us
in identifying the behavior of employees who can be attired over the next time.
KEYWORDS:
Attrition, Logistic regression, Gaussian Naïve Bayes, Random Forest Classifier, Gradient
Boosting Classifier.
v
TABLE OF CONTENTS
1. INTRODUCTION 1
2. SOFTWARE REQUIREMENT ANALYSIS 2
2.1 SOFTWARE DESCRIPTION 2
2.2 PANDAS 3
2.2.1 INTRODUCTION 3
2.2.2 OPERATIONS USING PANDAS 3
2.2.3 PANDAS OBJECT 4
2.3 NUMPY 4
2.3.1 INTRODUCTION 4
2.3.2 OPERATIONS USING NUMPY 5
2.4 SEABORN 5
2.4.1 INTRODUCTION 5
2.4.2 OPERATIONS USING SEABORN 5
2.5 FOLIUM 6
2.5.1 INTRODUCTION 6
2.5.2 OPERATIONS USING FOLIUM 6
2.6 MATPLOTLIB 6
2.6.1 INTRODUCTION 6
2.7 SCIKIT LEARN 7
2.7.1 INTRODUCTION 7
2.7.2 OPERATIONS USING SCIKIT LEARN 7
2.8 MACHINE LEARNING 8
2.8.1 INTRODUCTION 8
2.8.2 RANDOM FOREST CLASSIFIER 8
2.8.3 GUASSIAN NAÏVE BAYES 9
2.8.4 GRADIENT BOOSTING CLASSIFIER 11
vi
3. SOFTWARE SYSTEM DESIGN 12
3.1 PROCESS FLOW DIAGRAM 13
3.2 CLASS DIAGRAM 14
3.3 INTERACTION DIAGRAM 15
3.3.1 SEQUENCE DIAGRAM 16
3.3.2 COLLABORATION DIAGRAM 17
3.4 ACTIVITY DIAGRAM 18
3.5 USE CASE DIAGRAM 20
4. SRS DOCUMENT 21
4.1 FUNCTIONAL REQUIREMENTS 22
4.2 NON FUNCTIONAL REQUIREMENTS 22
4.3 MINIMUM HARDWARE REQUIREMENTS 23
4.4 MINIMUM SOFTWARE REQUIREMENTS 23
5. TESTING 24
5.1 TESTING STRATEGIES 24
6. OUTPUT 26
6.1 SYSTEM IMPLEMENTATION 26
6.2 SOURCE CODE 28
6.3 OUTPUT SCREENS 36
7. CONCLUSION 58
8. REFERENCES 59
vi
1.INTRODUCTION
HR teams put constant efforts to improve their hiring process to bring in the best talent into the
organization. Even when hiring managers focus on behavioral and cultural-fit aspects of any candidate
along with impressive experience and skill sets, many times the HR teams are unable to evaluate the
long-term success of a future candidate, leading to high voluntary attrition.
The key to success in an organization is the ability to attract and retain top talents. It is vital for the
Human Resource (HR) Department to identify the factors that keep employees and those which prompt
them to leave.
Organizations could do more to prevent the loss of good people. Organizations invest significant
resources in hiring & training new employees, along with running training programs for their existing
employees.
All of this is done with presumption of improving employee productivity, with a significant gestation
period. High voluntary attrition can be detrimental to both the organization’s growth as well as the
existing employees’ morale, business continuity and contributes to a significant impact on the bottom
line.
1
2.SOFTWARE REQUIREMENT ANALYSIS
2
2.2 PANDAS
2.2.1 INTRODUCTION
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. Prior to Pandas,
Python was majorly used for data munging and preparation. It hadvery little
contribution towards data analysis. Pandassolved this problem. Using Pandas, five
typical steps can beaccomplished in the processing andanalysis of data, regardless
of the origin of data — load, prepare, manipulate, model, and analyse.
Python with Pandas is used in a wide range of fields inc.
3
2.2.3 PANDAS Object
The most important pandas function to read csv files and do operations
onit. read_csv is used to do the task.
Syntax: pd.read_csv(“filename”)
It reads the comma separated file of the given filename.
Pandas DataFrame is two-dimensional size-mutable, potentiallyheterogeneous tabular data structure
with labelled axes (rows and columns).A Data frame is a two-dimensional data structure, i.e., data is
aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal
components, the data, rows, and columns.
Syntax
Obj = pd.DataFrame(list)
It creates a dataframe of the given list.
2.3 NUMPY
2.3.1 INTRODUCTION
NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray was
also developed, having some additional functionalities. In 2005, Travis Oliphant created NumPy
package by incorporating the features of Numarray into Numeric package. There are many
contributors to this open source project.
4
2.3.2 OPERATIONS USING NUMPY
Using NumPy, a developer can perform the following operations –
Mathematical and logical operations on arrays.
Fourier transforms and routines for shape manipulation.
2.4.1 INTRODUCTION
Seaborn is an amazing visualization library for statistical graphics plotting in
Python. It provides beautiful default styles and color palettes to make statistical
plots more attractive. It is built on the top of matplotlib library and also closely
integrated to the data structures from pandas.
Seaborn aims to make visualization the central part of exploring and understanding data.
It provides dataset-oriented APIs, so that switching between different visual
representations for same variables to better understand the dataset.
5
2.5 FOLIUM
2.5.1 INTRODUCTION
Folium is built on the data wrangling strengths of the Python ecosystem and the mapping
strengths of the Leaflet.js (JavaScript) library. Simply, manipulating the data in Python,
then visualizing it on a leaflet map via Folium. Folium makes it easy to visualize data
that’s been manipulated in Python, on an interactive Leafletmap. This library has a
number of built-in tilesets from OpenStreetMap, Mapbox etc.
2.6 MATPLOTLIB
2.6.1 INTRODUCTION
Matplotlib is one of the most popular Python packages used for data visualization.It is a
cross-platform library for making 2D plots from data in arrays. Matplotlib is written in
Python and makes use of numpy,the numerical mathematics extension of
Python.Matplotlib along with NumPy can be considered as the open source of
MATPLOTLIB.
6
2.7 SCIKIT LEARN
2.7.1 INTRODUCTION
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modelling including classification, regression,clustering and dimensionality reduction
via a consistence interface in Python.
Scikit-learn is an indispensable part of the Python machine learning toolkit at JPMorgan.
It is very widely used across all parts of the bank for classification, predictive analytics,
and very many other machine learning tasks. Its straightforward API, its breadth of
algorithms, and the quality of its documentation combine to make scikit-learn
simultaneously very approachable and very powerful.
7
2.8 MACHINE LEARNING ALGORITHM
2.8.1 INTRODUCTION
Machine learning algorithms build a mathematical model based on sample data, known as "training
data", in order to make predictions or decisions without being explicitly programmed to do so.
Machine learning is closely related to computational statistics, which focuses on making predictions
using computers.
8
Fig.2.8.2.1 Random Forest Classifier
9
Fig.2.8.3.1 Likelihood of Features
10
2.8.4 GRADIENT BOOSTING CLASSIFIER:
In Gradient Boosting, each predictor tries to improve on its predecessor by reducing the errors. But the
fascinating idea behind Gradient Boosting is that instead of fitting a predictor on the data at each
iteration, it actually fits a new predictor to the residual errors made by the previous predictor.
For every instance in the training set, it calculates the residuals for that instance, or, in other words, the
observed value minus the predicted value.
Once it has done this, it build a new Decision Tree that actually tries to predict the residuals that was
previously calculated. However, this is where it gets slightly tricky in comparison with Gradient
Boosting Regression.
When building a Decision Tree, there is a set number of leaves allowed. This can be set as a parameter
by a user, and it is usually between 8 and 32. This leads to two of the possible outcomes:
Multiple instances fall into the same leaf
A single instance has its own leaf
Unlike Gradient Boosting for Regression, where we could simply average the instance values to get an
output value, and leave the single instance as a leaf of its own, we have to transform these values using
a formula:
11
3.SOFTWARE SYSTEM DESIGN
System design is the process of designing the elements of a system such as the architecture, modules and
components, the different interfaces of those components and the data that goes through that system.
System Analysis is the process that decomposes a system into its component pieces for the purpose of
defining how well those components interact to accomplish the set requirements.
The purpose of the System Design process is to provide sufficient detailed data and information about
the system. The purpose of the design phase is to plan a solution of the problem specified by the
requirement document. This phase is the first step in moving from problem domain to the solution
domain. The design of a system is perhaps the most critical factor affecting the quality of the software,
and has a major impact on the later phases, particularly testing and maintenance.
The design activity is often divided into two separate phase-system design and detailed design. System
design, which is sometimes also called top-level design, aims to identify the modules that should be in
the system, the specifications of these modules, and how they interact with each other to produce the
desired results.
A design methodology is a systematic approach to creating a design by application of set of techniques
and guidelines. Most methodologies focus on system design. The two basic principles used in any design
methodology are problem partitioning and abstraction. Abstraction is a concept related to problem.
12
3.1 PROCESS FLOW DIAGRAM:
13
3.2 CLASS DIAGRAM:
The class diagram can be used to show the classes, relationships, interface, association, and
collaboration. UML is standardized in class diagrams.
The main purpose to use class diagrams are:
• This is the only UML which can appropriately depict various aspects of OOPs concept.
• Proper design and analysis of application can be faster and efficient.
• Each class is represented by a rectangle having a subdivision of three compartments name, attributes
and operation.
• There are three types of modifiers which are used to decide the visibility of attributes and operations.
• + is used for public visibility(for everyone)
• # is used for protected visibility (for friend and derived).
• – is used for private visibility (for only me)
14
3.3 INTERACTION DIAGRAMS:
From the term Interaction, it is clear that the diagram is used to describe some type of interactions
among the different elements in the model. This interaction is a part of dynamic behavior of the system.
This interactive behavior is represented in UML by two diagrams known as Sequence diagram and
Collaboration diagram. The basic purpose of both the diagrams are similar. Sequence diagram
emphasizes on time sequence of messages and collaboration diagram emphasizes on the structural
organization of the objects that send and receive messages. The purpose of interaction diagrams is to
visualize the interactive behavior of the system. Visualizing the interaction is a difficult task. Hence,
the solution is to use different types of models to capture the different aspects of the interaction.
Sequence and collaboration diagrams are used to capture the dynamic nature but from a different angle.
The purpose of interaction diagram is –
• To capture the dynamic behaviour of a system.
• To describe the message flow in the system.
• To describe the structural organization of the objects.
• To describe the interaction among objects. The main purpose of both the diagrams are similar as they
are used to capture the dynamic behavior of a system.
However, the specific purpose is more important to clarify and understand. Sequence diagrams are
used to capture the order of messages flowing from one object to another. Collaboration diagrams are
used to describe the structural organization of the objects taking part in the interaction. A single
diagram is not sufficient to describe the dynamic aspect of an entire system, so a set of diagrams are
used to capture it as a whole. Interaction diagrams are used when we want to understand the message
flow and the structural organization. Message flow means the sequence of control flow from one object
to another.
15
3.3.1 SEQUENCE DIAGRAM:
The sequence diagram has four objects (Customer, Order, SpecialOrder and NormalOrder).The
following diagram shows the message sequence for SpecialOrder object and the same can be used in
case of NormalOrder object. It is important to understand the time sequence of message flows. The
message flow is nothing but a method call of an object. The first call is sendOrder () which is a method
of Order object. The next call is confirm () which is a method of SpecialOrder object and the last call
is Dispatch () which is a method of SpecialOrder object. The following diagram mainly describes the
method calls from one object to another, and this is also the actual scenario when the system is running
16
3.3.2 COLLABORATION DIAGRAM:
The second interaction diagram is the collaboration diagram. It shows the object organization as seen
in the following diagram. In the collaboration diagram, the method call sequence is indicated by some
numbering technique. The number indicates how the methods are called one after another. We have
taken the same order management system to describe the collaboration diagram. Method calls are
similar to that of a sequence diagram. However, difference being the sequence diagram does not
describe the object organization, whereas the collaboration diagram shows the object organization. To
choose between these two diagrams, emphasis is placed on the type of requirement. If the time
sequence is important, then the sequence diagram is used. If organization is required, then collaboration
diagram is used
17
3.4 ACTIVITY DIAGRAM:
Activity diagram is defined as a UML diagram that focuses on the execution and flow of the behavior
of a system instead of implementation. It is also called object-oriented flowchart. Activity diagrams
consist of activities that are made up of actions which apply to behavioral modeling technology.
Activity diagrams are used to model processes and workflows. The essence of a useful activity diagram
is focused on communicating a specific aspect of a system's dynamic behavior. Activity diagrams
capture the dynamic elements of a system. Activity diagram is similar to a flowchart that visualizes
flow from one activity to another activity. Activity diagram is identical to the flowchart, but it is not a
flowchart. The flow of activity can be controlled using various control elements in the UML diagram.
In simple words, an activity diagram is used to activity diagrams that describe the flow of execution
between multiple activities. Activity Diagram Notations: - Activity diagrams symbol can be generated
by using the following notations:
• Initial states: The starting stage before an activity takes place is depicted as the initial state
• Final states: The state which the system reaches when a specific process ends is known as a Final
State
• State or an activity box:
• Decision box: It is a diamond shape box which represents a decision with alternate paths. It represents
the flow of control.
FLOW OF OUR ACTIVITY DIAGRAM IS: -
Initially the HR provide the employee dataset
Then data preprocessing and data analysis done by the actor.
The actor performs various model training, validation and predictions takes place.
Based on the accuracy, employee attrition and prevention strategies takes place.
HR takes the required measures based on the result.
18
Fig.3.4 Activity Diagram
19
3.5 USECASE DIAGRAM
Use case diagram is used to represent the dynamic behavior of a system. It encapsulates
the system's functionality by incorporating use cases, actors, and their relationships. It
models the tasks, services, and functions required by a system/subsystem of an
application. It depicts the high-level functionality of a system and also tells how the user
handles a system.
20
4.SRS DOCUMENT
SRS is a document created by a system analyst after the requirements are collected from
various stakeholders. SRS defines how the intended software will Interact with
hardware, external interfaces, speed of operation, response time of system, portability of
software across various platforms, maintainability, speed of recovery after crashing,
Security, Quality,Limitations etc.
The requirements received from clients are written in natural language. It is the
responsibility of the system analyst to document the requirements in technical language
so that they can be comprehended and useful by the software development team. The
introduction of the software requirement specification states the goals and objectives of
the software, describing it in the context of the computer-base system. The SRS includes
an information description, functional description, behavioral description, validation
criteria.
The purpose of this document is to present the software requirements ina precise and
easily understood manner. This document provides the functional, performance, design
and verification requirements of the software to be developed.
After requirement specifications are developed, the requirements mentioned in this
document are validated. Users might ask for illegal,impractical solutions or experts may
interpret the requirements incorrectly. This results in a huge increase in cost if not nipped
in the bud.
21
4.1 FUNCTIONAL REQUIREMENTS
22
4.3 MINIMUM HARDWARE REQUIREMENTS
RAM : 8GB.
Processor : Intel core i3 or Above.
Hard disk space : 1TB.
23
5.TESTING
Testing is the process of detecting errors. Testing performs a very critical role for quality
assurance and for ensuring the reliability of software. The results of testing are used later on
during maintenance also. Purpose of Testing: The aim of testing is often to demonstrate that a
program works by showing that it has no errors. The basicpurpose of testing phase is to detect the
errors that may be present in the program. Hence one should not start testing with the intent of
showing that a program works, but the intent should be to show that a program doesn’t work.
Testing Objectives: The main objective of testing is to uncover a host of errors, systematically
and with minimum effort and time.
Unit Testing
It focuses on smallest unit of software design. In this, testing an individual unit or group of inter
related units will be done. It is often done by programmer by using sample input and observing
its corresponding outputs. A unit may be an individual function, method, procedure, module, or
object. It is a White Box testing technique that is usually performed by the developer.
Integration Testing
The testing of combined parts of an application to determine if they function correctly together is
Integration testing. In this, the program is constructed and testedin small increments, where errors
are easier to isolated and correct;interfaces are more likely to be tested completely; and systematic
test approach may be applied. This testing can be done by using two different methods
1. Top Down Integration Testing
2. Bottom Up Integration Testing.
24
System Testing
System Testing is a type of software testing that is performed on a complete integratedsystem to
evaluate the compliance of the system with the corresponding requirements.System testing detects
defects within both the integrated units and the whole system.
The result of system testing is the observed behaviour of a component or a system when it is tested.
System Testing is a black-boxtesting.
Acceptance Testing
Acceptance Testing is a method of software testing where a system is tested for acceptability. It is
a formal testing according to user needs, requirements and business processes conducted to
determine whether a system satisfies the acceptance criteriaor not and to enable the users,
customers or other authorized entities to determine whether to accept the system or not.
25
6.OUTPUT
6.1.1 INTRODUCTION:
HR teams put constant efforts to improve their hiring process to bring in the best talent into
the organization.
Even when hiring managers focus on behavioral and cultural-fit aspects of any candidate
along with impressive experience and skill sets, many times the HR teams are unable to
evaluate the long-term success of a future candidate, leading to high voluntary attrition.
The key to success in an organization is the ability to attract and retain top talents.
It is vital for the Human Resource (HR) Department to identify the factors that keep
employees and those which prompt them to leave. Organizations could do more to prevent
the loss of good people.
27
6.2 SOURCE CODE:
#file_name = (r"C:\Users\AVV8E5744\HR-Employee-Attrition.csv")
emp_data = pd.read_csv("HR-Employee-Attrition.csv")
print('Dataset dimension: {} rows, {} columns'.format(emp_data.shape[0], emp_data.shape[1]))
# Let's add 2 features for Exploratory Data Analysis: Employee left and not left
emp_df['Attrition_Yes'] = emp_df['Attrition'].map({'Yes':1, 'No':0}) # 1 means Employee Left
emp_df['Attrition_No'] = emp_df['Attrition'].map({'Yes':0, 'No':1}) # 1 means Employee didnt leave
# Let's look into the new dataset and identify features for which plots needs to be build for
categorical features
emp_df.head()
def generate_frequency_graph(col_name):
# Plotting of Employee Attrition against feature(col_name)
emp_grp = emp_df.groupby(col_name).agg('sum')[['Attrition_Yes', 'Attrition_No']]
temp_grp['Percentage Attrition'] = temp_grp['Attrition_Yes'] / (temp_grp['Attrition_Yes'] +
temp_grp['Attrition_No']) * 100
print(temp_grp)
emp_df.groupby(col_name).agg('sum')[['Attrition_Yes', 'Attrition_No']].plot(kind='bar',
stacked=False, color=['red', 'green'])
plt.xlabel(col_name)
plt.ylabel('Attrition')
# Features to remove
feat_to_remove = ['EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours']
emp_proc_df.drop( feat_to_remove , axis = 1, inplace = True )
print('Dataset dimension: {} rows, {} columns'.format(emp_proc_df.shape[0],
emp_proc_df.shape[1]))
full_col_names = emp_proc_df.columns.tolist()
num_col_names = emp_proc_df.select_dtypes(include=[np.int64,np.float64]).columns.tolist() # Get
numerical feature names
30
num_col_names = list(set(num_col_names) - set(num_cat_col_names)) # Numerical features w/o
Ordered Categorical features
cat_col_names = list(set(full_col_names) - set(num_col_names) - set(target)) # Categorical &
Ordered Categorical features
print('Total number of numerical features: ', len(num_col_names))
print('Total number of categorical & ordered categorical features: ', len(cat_col_names))
cat_emp_df = emp_proc_df[cat_col_names]
num_emp_df = emp_proc_df[num_col_names]
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
# Settings
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['figure.figsize'] = (16, 4)
pd.options.display.max_columns = 500
# Let's create dummy variables for each categorical attribute for training our calssification model
for col in num_col_names:
if num_emp_df[col].skew() > 0.80:
num_emp_df[col] = np.log1p(num_emp_df[col])
num_emp_df.head()
# Let's create dummy variables for each categorical attribute for training our calssification model
for col in cat_col_names:
col_dummies = pd.get_dummies(cat_emp_df[col], prefix=col)
cat_emp_df = pd.concat([cat_emp_df, col_dummies], axis=1)
31
# Use the pandas apply method to numerically encode our attrition target variable
attrition_target = emp_proc_df['Attrition'].map({'Yes':1, 'No':0})
# Drop categorical feature for which dummy variables have been created
cat_emp_df.drop(cat_col_names, axis=1, inplace=True)
cat_emp_df.head()
num_corr_df = num_emp_df[['MonthlyIncome', 'CompRatioOverall', 'YearWithoutChange1',
'DistanceFromHome']]
corr_df = pd.concat([num_corr_df, attrition_target], axis=1)
corr = corr_df.corr()
plt.figure(figsize = (10, 8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.axes_style("white")
#sns.heatmap(data=corr, annot=True, mask=mask, square=True, linewidths=.5, vmin=-1, vmax=1,
cmap="YlGnBu")
sns.heatmap(data=corr, annot=True, square=True, linewidths=.5, vmin=-1, vmax=1,
cmap="YlGnBu")
plt.show()
32
# Split data into train and test sets as well as for validation and testing
X_train, X_val, y_train, y_val = train_test_split(final_emp_df, attrition_target,
test_size= 0.30, random_state=42)
print("Stratified Sampling: ", len(X_train), "train set +", len(X_val), "validation set")
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score,
f1_score
33
for model in models:
model.fit(X, y,)
y_pred = model.predict(X)
score = cross_val_score(model, X, y, cv=5, scoring=scoring)
sensitivity, specificity, accuracy, precision, roc_score = gen_model_performance(y, y_pred)
scores = cross_val_score(model, X, y, cv=5)
model_results = model_results.append({"Model": model._class.name_,
"Accuracy": accuracy, "Precision": precision,
"CV Score": scores.mean()*100.0,
"Sensitivity": sensitivity, "Specificity": specificity,
"ROC Score": roc_score}, ignore_index=True)
return model_results
model_results = evaluate_model_score(X_train, y_train)
model_results
rfc_model = RandomForestClassifier()
refclasscol = X_train.columns
34
mid_imp_df = importances[importances.importance<=0.015]
mid_imp_df = mid_imp_df[mid_imp_df.importance>=0.0050]
mid_imp_df.plot.bar();
del mid_imp_df
selection = SelectFromModel(rfc_model, threshold = 0.002, prefit=True)
X_train_select = selection.transform(X_train)
X_val_select = selection.transform(X_val)
print('Train dataset dimension before Feature Selection: {} rows, {}
columns'.format(X_train.shape[0], X_train.shape[1]))
print('Train dataset dimension after Feature Selection: {} rows, {}
columns'.format(X_train_select.shape[0], X_train_select.shape[1]))
model_results = evaluate_model_score(X_train_select, y_train)
model_results
final_rfc_model = RandomForestClassifier()
final_rf_scores = cross_val_score(final_rfc_model, X_train_select, y_train, cv=5)
final_rfc_model.fit(X_train_select, y_train)
y_trn_pred = final_rfc_model.predict(X_train_select)
sensitivity, specificity, accuracy, precision, roc_score = gen_model_performance(y_train,
y_trn_pred)
print("Train Accuracy: %.2f%%, Precision: %.2f%%, CV Mean Score=%.2f%%,
Sensitivity=%.2f%%, Specificity=%.2f%%" %
(accuracy, precision, final_rf_scores.mean()*100.0, sensitivity, specificity))
print('*****************************\n')
y_val_pred = final_rfc_model.predict(X_val_select)
sensitivity, specificity, accuracy, precision, roc_score = gen_model_performance(y_val, y_val_pred)
print("Validation Accuracy: %.2f%%, Precision: %.2f%%, Sensitivity=%.2f%%,
Specificity=%.2f%%" % (accuracy, precision, sensitivity, specificity))
print('*****************************\n')
35
6.3 OUTPUT SCREENS:
Dataset Info:
This contains metadata of IBM HR Employee dataset. The above dataset that was downloaded from
Kaggle.
The dataset contains
35 attributes
1470 entries
The target attribute of the dataset is “Attrition” which determines whether the employee will leave
the company or stay in the company in the Yes or No format.
36
Data frame:
The CSV file “HR Employee Attrition.csv” is loaded into a dataframe for the ease of accessing of
attributes and other operations.
37
Attrition Target Variable Distribution:
The above snapshot depicts the percentage of samples that were classified as “Yes” and the
percentage of samples that were classified as “No”. Here, value_counts() function is applied to count
the number of “Yes” and “No” in the given data. The number of samples which are classified as
“No” are 1233 which makes 83.88% of total samples and “Yes “ being 16.12% with 237 samples.
38
Attrition Distribution Bar Plot:
The above frequency percentage of “Yes” and “No” is represented graphically in the form of a bar
plot. Matplotlib library is used and the function which is used to plot the graph is “plot” function. The
title of the graph “Attrition Distribution” is set using the function “set_title”.
39
Adding categorical variable:
Two features have been added into the dataframe “Attrition_Yes” and “Attrition_No” . If the Attrition is “Yes”
for that particular row then Attrition_Yes will be 1 and thus making Attrition_No will be 0 and similarly with the
“Attrition_No” will be 1 and “Attrition_Yes” will be 0 when the Attrition is “No”.
40
BusinessTravel Vs Attrition:
The Visualization is performed to compare the attrition and other attributes . This is the snapshot of
one such attribute “BusinessTravel” . It contains three classes “Non-Travel”, “Travel_Frequently”
and “Travel_Rarely”.”Travel_Rareely” is the clas that has the highest number of “No” for the
attribute “Attrition”.
41
MaritalStatus Vs Attrition:
The above snapshot is the plotting of Employee Attrition and Marital status. The Marital Status has
the classes “Divorced”, “Married” and “Single”. As the snapshot shows the employees with marital
status “married “ are having the attrition as “No” .The Employees with “Single” marital status are
having more percentile of attrition as “Yes”.
42
JobRole Vs Attrition:
The above snapshot represents the plotting between the jobrole and attrition. There are 9 different
types of classes in the jobrole. The class that has the highest percentile of Attriton_No is
“Sales_executives”. The class that has the highest percentile of “Attrtion_Yes” is “Laboratory
Technician”.
43
JobSatisfaction Vs Attrition:
The above snapshot represents the plotting of “JobSatisfaction” and “Attriton”.The JobSatisfaction
has 4 levels of satisfaction. The employees with jobSatisfaction “4” are having the highest percentile
of “Attriton_No”. The JobSatisfaction with level 3 are having highest percentile of “Attrtion_Yes”.
44
WorkLifeBalance Vs Attrition:
The above snapshot represents the plotting of “WorkLifeBalance” and “Attrition”. The
WorkLifeBalance has 4 levels of ranking. The WorklifeBalance of level “3” has the highest
percentile of “Attrition_No” and the level “3” also has the highest percentile of “Attrition_Yes”.
45
EnvironmentSatisfaction Vs Attrition:
46
Addition of new features:
Tenure per job : Usually, people who have worked with many companies but for small periods at
every organization tend to leave early as they always need a change of Organization to keep them
going.
Years without Change : For any person, a change either in role or job level or responsibility is needed
to keep the work exciting to continue. We create a variable to see how many years it has been for an
employee without any sort of change using Promotion, Role and Job Change as a metric to cover
different variants of change.
Compensation Ratio : compensation Ratio is the ratio of the actual pay of an Employee to the
midpoint of a salary range. The salary range can be that of his/her department or organization or role.
The benchmark numbers can be a organization’s pay or Industry average.
47
Removing features:
48
Creating dummy variables:
The above snapshot represents creating dummy variables for each categorical variable having more
than 2 classes. Here one less than the number of categories of variables will be created for that
particular attribute. Machine Learning model works only on numerical datasets, hence, categorical
features needed to be transformed into numerical features. One of the best strategy is to convert each
category value into a new column and assigns 1 or 0 (True/False) value to the column.
This has the benefit of not weighting a value improperly but does have the downside of adding more
columns to the data set. This approach is also called as "One Hot Encoding". We can use Pandas
feature get_dummies to achieve this transformation.
49
Correlation matrix:
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the above table
shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a
more advanced analysis, and as a diagnostic for advanced analyses.
The values range between -1.0 and 1.0. A calculated number greater than 1.0 or less than -1.0 means that there
was an error in the correlation measurement. A correlation of -1.0 shows a perfect negative correlation, while a
correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship
between the movement of the two variables.
Here identifying the relationship between attrition and other important features is performed.
The above snapshot represents MonthlyIncome and YearWithoutChange1 are positively correlated and
YearwithoutChange1 and Attrition are negatively correlated.
50
Function for generating the train and test data:
The above snapshot represents the splitting of data which is to be further used for training and testing
of the data. For all of the models both the training and validation phases are carried out.
train_test_split( ) method is used here to split the data . 70% of the data is used for training and 30%
is used for testing. Data will be trained on 70% which is on 1029 entries of the data and the model
will be validated on the testing data which is the unseen data for the model and here the testing data
is 30% which is 441 entries of our data.
51
Model performance evaluation:
The classification models performance will be evaluated based on certain metrics such as accuracy,
precision, sensitivity, specificity, CV score etc.
52
Performance Evaluation Of Algorithms:
After evaluating the models based on the metrics calculated, we have evaluated the training data for
the algorithms chosen which are Random Forest Classifier, Logistic regression, Gradient Boosting
Classifier, Gaussian NB and of all the algorithms
53
Feature Importance:
Feature Importance refers to techniques that calculate a score for all the input features for a given
model. The scores simply represent the “importance” of each feature. A higher score means that the
specific feature will have a larger effect on the model that is being used to predict a certain variable.
Like a correlation matrix, feature importance allows you to understand the relationship between the
features and the target variable. Feature importance can be used to reduce the dimensionality of the
model. The higher scores are usually kept and the lower scores are deleted as they are not important for
the model.
54
Validating the performance of Algorithms on test data :
55
Model with high Accuracy
The final model chosen is logistic model. The accuracy of the model is 87.53 percentile.
56
Predicting the attrition for an employee:
The above snapshot represents the testcase performed on the final model.
57
7.CONCLUSION
We applied some machine learning techniques in order to identify the factors that may contribute to an
employee leaving the company and, above all, to predict the likelihood of individual employees leaving
the company. First, we assess statistically the data and then we classified them. The dataset was
processed, dividing it into the training phase and the test phase, guaranteeing the same distribution of
the target variable (through the holdout technique).
We selected various classification algorithms and, for each of them, we carried out the training and
validation phases. To evaluate the algorithm’s performance, the predicted results were collected and
fed into the respective confusion matrices. From these it was possible to calculate the basic metrics
necessary for an overall evaluation (precision, recall, accuracy, f1 score, ROC curve, AUC, etc.) and
to identify the most suitable classifier to predict whether an employee was likely to leave the company.
The algorithm that produced the best results for the available dataset was the Gaussian Naïve Bayes
classifier: it revealed the best recall rate (0.54), a metric that measures the ability of a classifier to find
all the positive instances, and achieved an overall false negative rate equal to 4.5% of the total
observations. Results obtained by the proposed automatic predictor demonstrate that the main attrition
variables are monthly income, age, overtime, distance from home.
The results obtained from the data analysis represent a starting point in the development of increasingly
efficient employee attrition classifiers. The use of more numerous datasets or simply to update it
periodically, the application of feature engineering to identify new significant characteristics from the
dataset and the availability of additional information on employees would improve the overall
knowledge of the reasons why employees leave their companies and, consequently, increase the time
available to personnel departments to assess and plan the tasks required to mitigate this risk
58
8.REFERENCES
1.Data Set :
https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology
59