You are on page 1of 66

HUMAN RESOURCE ANALYTICS

A Main project report submitted in the partial fulfilment of the requirement for the award
of the degree of

BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE AND ENGINEERING


by
T. VYSHALI Regd.No.18131A05I3
U. GOWTHAMI DEVI Regd.No.18131A05I9
V. VAISHNAVI Regd.No.18131A05K2
N. L. AVANTHIKA Regd.No.18131A05N3

Under the esteemed guidance of


Mrs. P. AKHILA
(Assistant Professor)
Department of Computer Science and Engineering

GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING


(AUTONOMOUS)
(Affiliated to JNTU, Kakinada, AP)
VISAKHAPATNAM-530048
2021-2022
i
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)
VISAKHAPATNAM-530048

CERTIFICATE

This is to certify that the main project report entitled

“HUMAN RESOURCE ANALYTICS” being

submitted by

T. VYSHALI Regd.No.18131A05I3
U. GOWTHAMI DEVI Regd.No.18131A05I9
V. VAISHNAVI Regd.No.18131A05K2
N. L. AVANTHIKA Regd.No.18131A05N3

in their VIII semester in partial fulfilment of the requirements for the Award of
Degree of Bachelor of Technology in Computer Science and Engineering
During the academic year 2021-2022

Mrs. P. Akhila Dr. D.N.D. Harini


Assistant Professor Head of the Department
Project Guide Computer Science and Engineering

ii
DECLARATION

We hereby declare that this main project entitled “HUMAN RESOURCE ANALYTICS” is a
bonafide work done by us and submitted to the Department of Computer Science and
Engineering, Gayatri Vidya Parishad college of engineering (autonomous) Visakhapatnam,
in partial fulfilment for the award of the degree of B. Tech is of own and it is not submitted to
any other university or has been published any time before.

PLACE:VISAKHAPATNAM T. VYSHALI(18131A05I3)
U. GOWTHAMI DEVI(18131A05I9)
V. VAISHNAVI(18131A05K2)
N.L. AVANTHIKA(18131A05N3)

iii
ACKNOWLEDGEMENT

We thank Dr. A.B.KOTESWARA RAO principal, Gayatri Vidya Parishad College of


Engineering (Autonomous) for extending his utmost support and cooperation in providing all the
provisions for the successful completion of the project.

We consider it our privilege to express our deepest gratitude to Dr. D.N.D. HARINI, Associate
professor and Head of the Department of Computer Science and Engineering, for her valuable
suggestions and constant motivation that greatly helped the project work to get successfully
completed.

We are extremely thankful to Mrs. P. AKHILA, Assistant Professor, Computer Science and
Engineering for giving us an opportunity to do this project and providing us support and guidance
which helped us to complete the project on time.

We also thank our coordinator, Dr. CH. SITA KUMARI, Sr. Assistant Professor, Department
of Computer Science and Engineering, for the kind suggestions and guidance for the successful
completion of our project work.

We also thank all the members of the staff in Computer Science and Engineering for their sustained
help in our pursuits. We thank all those who contributed directly or indirectly in successfully
carrying out his work.

TONTEPU VYSHALI (Regd.No: 18131A05I3)


UDAMALA GOWTHAMI DEVI (Regd.No: 18131A05I9)
VEEERAMALLA VAISHNAVI (Regd.No: 18131A05K2)
NIMISHAKAVI AVANTHIKA (Regd.No: 18131A05N3)

iv
ABSTRACT

HUMAN RESOURCE ANALYTICS

Nowadays, employee attrition became a serious issue regarding a company’s competitive advantage. It’s
very expensive to find, hire and train new talents. Few years back it was done manually but it is an era of
machine learning and data analytics. Now, company’s HR department uses some data analytics tool to
identify which areas to be modified to make most of its employees to stay. In any industry, attrition is a
big problem, whether it is about employee attrition of an organization or customer attrition of an e-
commerce site. If we can accurately predict which customer or employee will leave their current company
or organization, then it will save much time, effort, and cost of the employer and help them to hire or
acquire substitutes in advance, and it would not create a problem in the ongoing progress of an
organization. Here comparative analysis between various machine learning approaches such as Naive
Bayes, decision tree, random forest, and logistic regression is presented. The presented result will help us
in identifying the behavior of employees who can be attired over the next time.

KEYWORDS:
 Attrition, Logistic regression, Gaussian Naïve Bayes, Random Forest Classifier, Gradient
Boosting Classifier.

v
TABLE OF CONTENTS

1. INTRODUCTION 1
2. SOFTWARE REQUIREMENT ANALYSIS 2
2.1 SOFTWARE DESCRIPTION 2
2.2 PANDAS 3
2.2.1 INTRODUCTION 3
2.2.2 OPERATIONS USING PANDAS 3
2.2.3 PANDAS OBJECT 4
2.3 NUMPY 4
2.3.1 INTRODUCTION 4
2.3.2 OPERATIONS USING NUMPY 5
2.4 SEABORN 5
2.4.1 INTRODUCTION 5
2.4.2 OPERATIONS USING SEABORN 5
2.5 FOLIUM 6
2.5.1 INTRODUCTION 6
2.5.2 OPERATIONS USING FOLIUM 6
2.6 MATPLOTLIB 6
2.6.1 INTRODUCTION 6
2.7 SCIKIT LEARN 7
2.7.1 INTRODUCTION 7
2.7.2 OPERATIONS USING SCIKIT LEARN 7
2.8 MACHINE LEARNING 8
2.8.1 INTRODUCTION 8
2.8.2 RANDOM FOREST CLASSIFIER 8
2.8.3 GUASSIAN NAÏVE BAYES 9
2.8.4 GRADIENT BOOSTING CLASSIFIER 11

vi
3. SOFTWARE SYSTEM DESIGN 12
3.1 PROCESS FLOW DIAGRAM 13
3.2 CLASS DIAGRAM 14
3.3 INTERACTION DIAGRAM 15
3.3.1 SEQUENCE DIAGRAM 16
3.3.2 COLLABORATION DIAGRAM 17
3.4 ACTIVITY DIAGRAM 18
3.5 USE CASE DIAGRAM 20
4. SRS DOCUMENT 21
4.1 FUNCTIONAL REQUIREMENTS 22
4.2 NON FUNCTIONAL REQUIREMENTS 22
4.3 MINIMUM HARDWARE REQUIREMENTS 23
4.4 MINIMUM SOFTWARE REQUIREMENTS 23
5. TESTING 24
5.1 TESTING STRATEGIES 24
6. OUTPUT 26
6.1 SYSTEM IMPLEMENTATION 26
6.2 SOURCE CODE 28
6.3 OUTPUT SCREENS 36
7. CONCLUSION 58
8. REFERENCES 59

vi
1.INTRODUCTION

HR teams put constant efforts to improve their hiring process to bring in the best talent into the
organization. Even when hiring managers focus on behavioral and cultural-fit aspects of any candidate
along with impressive experience and skill sets, many times the HR teams are unable to evaluate the
long-term success of a future candidate, leading to high voluntary attrition.
The key to success in an organization is the ability to attract and retain top talents. It is vital for the
Human Resource (HR) Department to identify the factors that keep employees and those which prompt
them to leave.
Organizations could do more to prevent the loss of good people. Organizations invest significant
resources in hiring & training new employees, along with running training programs for their existing
employees.
All of this is done with presumption of improving employee productivity, with a significant gestation
period. High voluntary attrition can be detrimental to both the organization’s growth as well as the
existing employees’ morale, business continuity and contributes to a significant impact on the bottom
line.

1
2.SOFTWARE REQUIREMENT ANALYSIS

2.1 SOFTWARE DESCRIPTION


Colaboratory, or “Colab” for short, is a product from Google Research. Colaballows
anybody to write and execute arbitrary python code through thebrowser,and is especially
well suited to machine learning, data analysis and education. More technically, Colab is
a hosted Jupyter notebook service thatrequires no setup to use, while providing free
access to computing resources including GPUs. Collab works with most major browsers,
and is most thoroughly tested with the latest versions of Chrome, Firefox and Safari.
The amount of memory available in Collab virtual machines varies over time(butis stable
for the lifetime of the VM). (Adjusting memory over time allowsus to continue to offer
Colab for free.) You may sometimes be automaticallyassigneda VM with extra memory
when Colab detects that you are likely to need it. Usersinterested in having more memory
available to them in Colab, and more reliably,may be interested in Colab Pro and Pro+
or Colab GCP Marketplace VMs.
Google Colab provides tons of exciting features that any modern IDE offers,andmuch
more. Some of the most exciting features are listed below.
 Interactive tutorials to learn machine learning and neural networks.
 Write and execute Python 3 code without having a local setup.
 Execute terminal commands from the Notebook.
 Import datasets from external sources such as Kaggle.
 Save your Notebooks to Google Drive.
 Import Notebooks from Google Drive.

2
2.2 PANDAS

2.2.1 INTRODUCTION
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. Prior to Pandas,
Python was majorly used for data munging and preparation. It hadvery little
contribution towards data analysis. Pandassolved this problem. Using Pandas, five
typical steps can beaccomplished in the processing andanalysis of data, regardless
of the origin of data — load, prepare, manipulate, model, and analyse.
Python with Pandas is used in a wide range of fields inc.

2.2.2 OPERATIONS USING PANDAS


Using PANDAS, the following functions can be performed.
 Loading data this is read a CSV file.
 Column Insertion and deletion.
 Data selection and sorting.
 Column and Row renaming.
 Handling missing values and duplicated data.
 Data Exploration and Visualisation.

3
2.2.3 PANDAS Object
The most important pandas function to read csv files and do operations
onit. read_csv is used to do the task.
Syntax: pd.read_csv(“filename”)
It reads the comma separated file of the given filename.
Pandas DataFrame is two-dimensional size-mutable, potentiallyheterogeneous tabular data structure
with labelled axes (rows and columns).A Data frame is a two-dimensional data structure, i.e., data is
aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal
components, the data, rows, and columns.
Syntax
Obj = pd.DataFrame(list)
It creates a dataframe of the given list.

2.3 NUMPY

2.3.1 INTRODUCTION
NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of
multidimensional array objects and a collection of routines for processing of array.
Numeric, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray was
also developed, having some additional functionalities. In 2005, Travis Oliphant created NumPy
package by incorporating the features of Numarray into Numeric package. There are many
contributors to this open source project.

4
2.3.2 OPERATIONS USING NUMPY
Using NumPy, a developer can perform the following operations –
 Mathematical and logical operations on arrays.
 Fourier transforms and routines for shape manipulation.

 Operations related to linear algebra. NumPy has in-built functions


forlinear algebra and random number generation.

2.4 SEABORN: Data exploration and Visualization

2.4.1 INTRODUCTION
Seaborn is an amazing visualization library for statistical graphics plotting in
Python. It provides beautiful default styles and color palettes to make statistical
plots more attractive. It is built on the top of matplotlib library and also closely
integrated to the data structures from pandas.

Seaborn aims to make visualization the central part of exploring and understanding data.
It provides dataset-oriented APIs, so that switching between different visual
representations for same variables to better understand the dataset.

2.4.2 OPERATIONS USING SEABORN


Used for data visualization and finding out the patterns.

5
2.5 FOLIUM

2.5.1 INTRODUCTION
Folium is built on the data wrangling strengths of the Python ecosystem and the mapping
strengths of the Leaflet.js (JavaScript) library. Simply, manipulating the data in Python,
then visualizing it on a leaflet map via Folium. Folium makes it easy to visualize data
that’s been manipulated in Python, on an interactive Leafletmap. This library has a
number of built-in tilesets from OpenStreetMap, Mapbox etc.

Command to install foliummodule pip


install folium

2.5.2 OPERATIONS USING FOLIUM


Plotting maps with Folium is easier than all think. Folium provides the folium. Map()
class which takes location parameter in terms of latitude and longitude and generates a
map around it.

2.6 MATPLOTLIB

2.6.1 INTRODUCTION
Matplotlib is one of the most popular Python packages used for data visualization.It is a
cross-platform library for making 2D plots from data in arrays. Matplotlib is written in
Python and makes use of numpy,the numerical mathematics extension of
Python.Matplotlib along with NumPy can be considered as the open source of
MATPLOTLIB.

Syntax: import matplotlib.pyplot as plt


Pyplot is a state-based interface to a Matplotlib module which provides a MATLAB-
interface.

6
2.7 SCIKIT LEARN

2.7.1 INTRODUCTION
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modelling including classification, regression,clustering and dimensionality reduction
via a consistence interface in Python.
Scikit-learn is an indispensable part of the Python machine learning toolkit at JPMorgan.
It is very widely used across all parts of the bank for classification, predictive analytics,
and very many other machine learning tasks. Its straightforward API, its breadth of
algorithms, and the quality of its documentation combine to make scikit-learn
simultaneously very approachable and very powerful.

2.7.2 OPERATIONS USING SCIKIT LEARN:


Used for algorithms like Gaussian NB, Random Forest Classifier, Logistic Regression, Gradient
Boosting Classifier.
Important features of scikit-learn:
 Simple and efficient tools for predictive data analysis.
 It features various classification,regression,and clustering algorithms including support vector
machines,random forests,gradient boosting,k-means etc.
 Accessible to everybody, and reusable in various contexts.Built on NumPy, SciPy, and
matplotlib.
 Open source, commercially usable - BSD license

7
2.8 MACHINE LEARNING ALGORITHM

2.8.1 INTRODUCTION
Machine learning algorithms build a mathematical model based on sample data, known as "training
data", in order to make predictions or decisions without being explicitly programmed to do so.
Machine learning is closely related to computational statistics, which focuses on making predictions
using computers.

2.8.2 RANDOM FOREST CLASSIFIER


The random forest classifier is a supervised learning algorithm which can be used for regression and
classification problems. It is among the most popular machine learning algorithms due to its high
flexibility and ease of implementation.
it consists of multiple decision trees just as a forest has many trees. On top of that, it uses
randomness to enhance its accuracy and combat overfitting, which can be a huge issue for such a
sophisticated algorithm. These algorithms make decision trees based on a random selection of data
samples and get predictions from every tree. After that, they select the best viable solution through
votes.
HOW DOES IT WORK:
Assuming dataset has “m” features, the random forest will randomly choose “k” features where k <
m. Now, the algorithm will calculate the root node among the k features by picking a node that has
the highest information gain.
After that, the algorithm splits the node into child nodes and repeats this process “n” times. Now it
contains a forest with n trees. Finally, it performs bootstrapping, ie, combine the results of all the
decision trees present in the forest. It’s certainly one of the most sophisticated algorithms as it builds
on the functionality of decision trees.
Technically, it is an ensemble algorithm. The algorithm generates the individual decision trees
through an attribute selection indication. Every tree relies on an independent random sample. In a
classification problem, every tree votes and the most popular class is the end result. On the other
hand, in a regression problem, the average of all the tree outputs needs to be computed and that
would be the end result.

8
Fig.2.8.2.1 Random Forest Classifier

2.8.3 GAUSSIAN NAÏVE BAYES:


When working with continuous data, an assumption often taken is that the continuous values
associated with each class are distributed according to a normal (or Gaussian) distribution. The
likelihood of the features is assumed to be-
Sometimes assume variance
is independent of Y (i.e., σi),
or independent of Xi (i.e., σk)
or both (i.e., σ)
Gaussian Naive Bayes supports continuous valued features and models each as conforming to a Gaussian
(normal) distribution.
An approach to create a simple model is to assume that the data is described by a Gaussian distribution with no
co-variance (independent dimensions) between dimensions. This model can be fit by simply finding the mean
and standard deviation of the points within each label, which is all what is needed to define such a distribution.
The illustration indicates how a Gaussian Naive Bayes (GNB) classifier works. At every data point, the z-
score distance between that point and each class-mean is calculated, namely the distance from the class mean
divided by the standard deviation of that class.
Thus, we see that the Gaussian Naive Bayes has a slightly different approach and can be used efficiently.

9
Fig.2.8.3.1 Likelihood of Features

Fig.2.8.3.2 Gaussian Naïve Bayes

10
2.8.4 GRADIENT BOOSTING CLASSIFIER:
In Gradient Boosting, each predictor tries to improve on its predecessor by reducing the errors. But the
fascinating idea behind Gradient Boosting is that instead of fitting a predictor on the data at each
iteration, it actually fits a new predictor to the residual errors made by the previous predictor.
For every instance in the training set, it calculates the residuals for that instance, or, in other words, the
observed value minus the predicted value.
Once it has done this, it build a new Decision Tree that actually tries to predict the residuals that was
previously calculated. However, this is where it gets slightly tricky in comparison with Gradient
Boosting Regression.
When building a Decision Tree, there is a set number of leaves allowed. This can be set as a parameter
by a user, and it is usually between 8 and 32. This leads to two of the possible outcomes:
Multiple instances fall into the same leaf
A single instance has its own leaf
Unlike Gradient Boosting for Regression, where we could simply average the instance values to get an
output value, and leave the single instance as a leaf of its own, we have to transform these values using
a formula:

Fig.2.8.4 Regression Average

11
3.SOFTWARE SYSTEM DESIGN

System design is the process of designing the elements of a system such as the architecture, modules and
components, the different interfaces of those components and the data that goes through that system.
System Analysis is the process that decomposes a system into its component pieces for the purpose of
defining how well those components interact to accomplish the set requirements.
The purpose of the System Design process is to provide sufficient detailed data and information about
the system. The purpose of the design phase is to plan a solution of the problem specified by the
requirement document. This phase is the first step in moving from problem domain to the solution
domain. The design of a system is perhaps the most critical factor affecting the quality of the software,
and has a major impact on the later phases, particularly testing and maintenance.
The design activity is often divided into two separate phase-system design and detailed design. System
design, which is sometimes also called top-level design, aims to identify the modules that should be in
the system, the specifications of these modules, and how they interact with each other to produce the
desired results.
A design methodology is a systematic approach to creating a design by application of set of techniques
and guidelines. Most methodologies focus on system design. The two basic principles used in any design
methodology are problem partitioning and abstraction. Abstraction is a concept related to problem.

12
3.1 PROCESS FLOW DIAGRAM:

A process flowchart is a graphical representation of a business process through


flowchart. It’s used as a means of getting a top-down understanding of how a process
works, what steps it consists of, what events change outcomes, and so on.

Fig.3.1 Process Flow Diagram

13
3.2 CLASS DIAGRAM:

The class diagram can be used to show the classes, relationships, interface, association, and
collaboration. UML is standardized in class diagrams.
The main purpose to use class diagrams are:
• This is the only UML which can appropriately depict various aspects of OOPs concept.
• Proper design and analysis of application can be faster and efficient.
• Each class is represented by a rectangle having a subdivision of three compartments name, attributes
and operation.
• There are three types of modifiers which are used to decide the visibility of attributes and operations.
• + is used for public visibility(for everyone)
• # is used for protected visibility (for friend and derived).
• – is used for private visibility (for only me)

Fig.3.2 Class Diagram

14
3.3 INTERACTION DIAGRAMS:

From the term Interaction, it is clear that the diagram is used to describe some type of interactions
among the different elements in the model. This interaction is a part of dynamic behavior of the system.
This interactive behavior is represented in UML by two diagrams known as Sequence diagram and
Collaboration diagram. The basic purpose of both the diagrams are similar. Sequence diagram
emphasizes on time sequence of messages and collaboration diagram emphasizes on the structural
organization of the objects that send and receive messages. The purpose of interaction diagrams is to
visualize the interactive behavior of the system. Visualizing the interaction is a difficult task. Hence,
the solution is to use different types of models to capture the different aspects of the interaction.
Sequence and collaboration diagrams are used to capture the dynamic nature but from a different angle.
The purpose of interaction diagram is –
• To capture the dynamic behaviour of a system.
• To describe the message flow in the system.
• To describe the structural organization of the objects.
• To describe the interaction among objects. The main purpose of both the diagrams are similar as they
are used to capture the dynamic behavior of a system.
However, the specific purpose is more important to clarify and understand. Sequence diagrams are
used to capture the order of messages flowing from one object to another. Collaboration diagrams are
used to describe the structural organization of the objects taking part in the interaction. A single
diagram is not sufficient to describe the dynamic aspect of an entire system, so a set of diagrams are
used to capture it as a whole. Interaction diagrams are used when we want to understand the message
flow and the structural organization. Message flow means the sequence of control flow from one object
to another.

15
3.3.1 SEQUENCE DIAGRAM:

The sequence diagram has four objects (Customer, Order, SpecialOrder and NormalOrder).The
following diagram shows the message sequence for SpecialOrder object and the same can be used in
case of NormalOrder object. It is important to understand the time sequence of message flows. The
message flow is nothing but a method call of an object. The first call is sendOrder () which is a method
of Order object. The next call is confirm () which is a method of SpecialOrder object and the last call
is Dispatch () which is a method of SpecialOrder object. The following diagram mainly describes the
method calls from one object to another, and this is also the actual scenario when the system is running

Fig.3.3.1 Sequence Diagram

16
3.3.2 COLLABORATION DIAGRAM:

The second interaction diagram is the collaboration diagram. It shows the object organization as seen
in the following diagram. In the collaboration diagram, the method call sequence is indicated by some
numbering technique. The number indicates how the methods are called one after another. We have
taken the same order management system to describe the collaboration diagram. Method calls are
similar to that of a sequence diagram. However, difference being the sequence diagram does not
describe the object organization, whereas the collaboration diagram shows the object organization. To
choose between these two diagrams, emphasis is placed on the type of requirement. If the time
sequence is important, then the sequence diagram is used. If organization is required, then collaboration
diagram is used

Fig.3.3.2 Collaboration Diagram

17
3.4 ACTIVITY DIAGRAM:

Activity diagram is defined as a UML diagram that focuses on the execution and flow of the behavior
of a system instead of implementation. It is also called object-oriented flowchart. Activity diagrams
consist of activities that are made up of actions which apply to behavioral modeling technology.
Activity diagrams are used to model processes and workflows. The essence of a useful activity diagram
is focused on communicating a specific aspect of a system's dynamic behavior. Activity diagrams
capture the dynamic elements of a system. Activity diagram is similar to a flowchart that visualizes
flow from one activity to another activity. Activity diagram is identical to the flowchart, but it is not a
flowchart. The flow of activity can be controlled using various control elements in the UML diagram.
In simple words, an activity diagram is used to activity diagrams that describe the flow of execution
between multiple activities. Activity Diagram Notations: - Activity diagrams symbol can be generated
by using the following notations:
• Initial states: The starting stage before an activity takes place is depicted as the initial state
• Final states: The state which the system reaches when a specific process ends is known as a Final
State
• State or an activity box:
• Decision box: It is a diamond shape box which represents a decision with alternate paths. It represents
the flow of control.
FLOW OF OUR ACTIVITY DIAGRAM IS: -
 Initially the HR provide the employee dataset
 Then data preprocessing and data analysis done by the actor.
 The actor performs various model training, validation and predictions takes place.
 Based on the accuracy, employee attrition and prevention strategies takes place.
 HR takes the required measures based on the result.

18
Fig.3.4 Activity Diagram

19
3.5 USECASE DIAGRAM

Use case diagram is used to represent the dynamic behavior of a system. It encapsulates
the system's functionality by incorporating use cases, actors, and their relationships. It
models the tasks, services, and functions required by a system/subsystem of an
application. It depicts the high-level functionality of a system and also tells how the user
handles a system.

Fig.3.5 Use Case Diagram

20
4.SRS DOCUMENT

SRS is a document created by a system analyst after the requirements are collected from
various stakeholders. SRS defines how the intended software will Interact with
hardware, external interfaces, speed of operation, response time of system, portability of
software across various platforms, maintainability, speed of recovery after crashing,
Security, Quality,Limitations etc.
The requirements received from clients are written in natural language. It is the
responsibility of the system analyst to document the requirements in technical language
so that they can be comprehended and useful by the software development team. The
introduction of the software requirement specification states the goals and objectives of
the software, describing it in the context of the computer-base system. The SRS includes
an information description, functional description, behavioral description, validation
criteria.
The purpose of this document is to present the software requirements ina precise and
easily understood manner. This document provides the functional, performance, design
and verification requirements of the software to be developed.
After requirement specifications are developed, the requirements mentioned in this
document are validated. Users might ask for illegal,impractical solutions or experts may
interpret the requirements incorrectly. This results in a huge increase in cost if not nipped
in the bud.

21
4.1 FUNCTIONAL REQUIREMENTS

A functional requirement defines a function of a system or it’s component.A function


can be described as set of inputs, the behavior, and outputs. It also depends upon the type
of software, expected users and the type of system where the software is used.

Functional requirements of our project are:


 User should be able to upload the people preference dataset.
 User should be able to give the latitude and longitude of the preferred location.

4.2 NON FUNCTIONAL REQUIREMENTS

NON-FUNCTIONAL REQUIREMENT (NFR) specifies the quality attribute of a software system.


They judge the software system based on
Responsiveness, Usability, Security, Portability.
Non-functionalrequirements are called qualities of a system, there are as follows:
 Performance-The average response time of the system is less.
 Reliability - The system is highly reliable.
 Operability - The interface of the system will be consistent.
 Efficiency - Once user has learned about the system through his interaction, he can perform the
task easily
 .Understandability-Because of user friendly interfaces, it is more understandable to the users

22
4.3 MINIMUM HARDWARE REQUIREMENTS

RAM : 8GB.
Processor : Intel core i3 or Above.
Hard disk space : 1TB.

4.4 MINIMUM SOFTWARE REQUIREMENTS

Tool: Anaconda (Jupyter Notebook)/ Google Colab


Programming Language: Python

23
5.TESTING

Testing is the process of detecting errors. Testing performs a very critical role for quality
assurance and for ensuring the reliability of software. The results of testing are used later on
during maintenance also. Purpose of Testing: The aim of testing is often to demonstrate that a
program works by showing that it has no errors. The basicpurpose of testing phase is to detect the
errors that may be present in the program. Hence one should not start testing with the intent of
showing that a program works, but the intent should be to show that a program doesn’t work.
Testing Objectives: The main objective of testing is to uncover a host of errors, systematically
and with minimum effort and time.

5.1 Testing Strategies


In order to make sure that system does not have any errors, the different levels of testing strategies
that are applied at different phases of software development are unittesting, integration testing,
system testing and acceptance testing.

Unit Testing
It focuses on smallest unit of software design. In this, testing an individual unit or group of inter
related units will be done. It is often done by programmer by using sample input and observing
its corresponding outputs. A unit may be an individual function, method, procedure, module, or
object. It is a White Box testing technique that is usually performed by the developer.

Integration Testing
The testing of combined parts of an application to determine if they function correctly together is
Integration testing. In this, the program is constructed and testedin small increments, where errors
are easier to isolated and correct;interfaces are more likely to be tested completely; and systematic
test approach may be applied. This testing can be done by using two different methods
1. Top Down Integration Testing
2. Bottom Up Integration Testing.

24
System Testing
System Testing is a type of software testing that is performed on a complete integratedsystem to
evaluate the compliance of the system with the corresponding requirements.System testing detects
defects within both the integrated units and the whole system.
The result of system testing is the observed behaviour of a component or a system when it is tested.
System Testing is a black-boxtesting.

Acceptance Testing
Acceptance Testing is a method of software testing where a system is tested for acceptability. It is
a formal testing according to user needs, requirements and business processes conducted to
determine whether a system satisfies the acceptance criteriaor not and to enable the users,
customers or other authorized entities to determine whether to accept the system or not.

White Box Test


White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used to
test areas that cannot be reached from a black box level.

Black Box Test


Black Box Testing is testing the software without any knowledge of the inner workings, structure
language of the module being tested. Black box tests, as most other kinds of tests, must be written
from a definitive source document, such as specification or requirements document.

25
6.OUTPUT

6.1 SYSTEM IMPLEMENTATION

6.1.1 INTRODUCTION:
 HR teams put constant efforts to improve their hiring process to bring in the best talent into
the organization.
 Even when hiring managers focus on behavioral and cultural-fit aspects of any candidate
along with impressive experience and skill sets, many times the HR teams are unable to
evaluate the long-term success of a future candidate, leading to high voluntary attrition.
 The key to success in an organization is the ability to attract and retain top talents.
 It is vital for the Human Resource (HR) Department to identify the factors that keep
employees and those which prompt them to leave. Organizations could do more to prevent
the loss of good people.

6.1.2 PROJECT MODULES:


 Data Preprocessing: First we will look into the data set for preprocessing and then identify
missing values and single valued variables. If we have any missing variables or useless
variables we will eliminate them from the data set. For our convenience we will aggregate
some fields into a single field.
 Data Analysis: In data analysis part, we will analyze the correlation between independent
variables, target variables, attrition. We will create visualizations using tableau as follows :
o Attrition vs Business Travel o Attrition vs Marital Status o Attrition vs Overtime o
Attrition vs JobSatisfaction o Attrition vs JobInvolvement
 Model Training: For Model training, we will split the data set into training and test data sets
using required ratio split(x / y) , x % of data will be used to train the model and rest y % to
test the accuracy of the model. Then if there is any unbalancing. We will balance the data.
After partitioning and balancing, our data is finally ready to be the input of the machine
learning models. We will train 4 different models: Naïve Bayes, Random Forest, Logistic
regression and Gradient Boosting.
26
 Model Validation: Finally, after testing our models with the test data set, we will conclude
the best model based on the following scores of the algorithm models o Accuracy o
Precision o Recall o F1 Score o AUC(Area Under the Curve)
 Model Predictions: Once we have chosen the best model, we apply that model to current
employees. Then it predicts which employees are at risk of leaving the company. That is
employees who are likely to remain and who are likely to turnover from the company.
 Visualization Results: Now we have our dataset with our current employees and their
probability of leaving the company. If we were the HR manager of the company, we would
require a dashboard in which we could see what to expect regarding future attrition and,
hence, adopt the correct strategy to retain the most talented employees.

27
6.2 SOURCE CODE:

# Importing all required libraries


%matplotlib inline
import os
import pandas as pd
from pandas import ExcelFile
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#file_name = (r"C:\Users\AVV8E5744\HR-Employee-Attrition.csv")
emp_data = pd.read_csv("HR-Employee-Attrition.csv")
print('Dataset dimension: {} rows, {} columns'.format(emp_data.shape[0], emp_data.shape[1]))

# Metadata of IBM HR Employee Attrition dataset


emp_data.info()
attrition_freq = emp_data[['Attrition']].apply(lambda x: x.value_counts())
attrition_freq['frequency_percent'] = round((100 * attrition_freq / attrition_freq.sum()), 2)
print(attrition_freq)
plot = attrition_freq[['frequency_percent']].plot(kind="bar");
plot.set_title("Attrition Distribution", fontsize=20)
plot.grid(color='lightgray', alpha=0.5)
null_features = pd.DataFrame()
null_features['Null Count'] = emp_data.isnull().sum().sort_values(ascending=False
null_features['Null Counts'] = null_features['Null Count'] / float(len(emp_data))
null_features = null_features[null_features['Null Counts'] > 0]
total_null_features = null_features.shape[0]
null_features_names = null_features.index
print('Total number of features having null values: ', total_null_features)
del null_features
28
emp_df = emp_data.copy() #copy cleaned dataset for Exploratory Data Analysis & feature changes

# Let's add 2 features for Exploratory Data Analysis: Employee left and not left
emp_df['Attrition_Yes'] = emp_df['Attrition'].map({'Yes':1, 'No':0}) # 1 means Employee Left
emp_df['Attrition_No'] = emp_df['Attrition'].map({'Yes':0, 'No':1}) # 1 means Employee didnt leave

# Let's look into the new dataset and identify features for which plots needs to be build for
categorical features
emp_df.head()

## Get Categorical feature names


cat_col = emp_df.select_dtypes(include=[np.object]).columns.tolist()
print(cat_col)

def generate_frequency_graph(col_name):
# Plotting of Employee Attrition against feature(col_name)
emp_grp = emp_df.groupby(col_name).agg('sum')[['Attrition_Yes', 'Attrition_No']]
temp_grp['Percentage Attrition'] = temp_grp['Attrition_Yes'] / (temp_grp['Attrition_Yes'] +
temp_grp['Attrition_No']) * 100
print(temp_grp)
emp_df.groupby(col_name).agg('sum')[['Attrition_Yes', 'Attrition_No']].plot(kind='bar',
stacked=False, color=['red', 'green'])
plt.xlabel(col_name)
plt.ylabel('Attrition')

# Plotting of Employee Attrition against Business Travel feature


generate_frequency_graph('BusinessTravel')

# Plotting of Employee Attrition against MaritalStatus feature


generate_frequency_graph('MaritalStatus')
29
generate_frequency_graph('JobRole')
emp_proc_df = emp_data.copy() # Copy cleaned dataset for feature engineering
emp_proc_df['TenurePerJob'] = 0
for i in range(0, len(emp_proc_df)):
if emp_proc_df.loc[i,'NumCompaniesWorked'] > 0:
emp_proc_df.loc[i,'TenurePerJob'] = emp_proc_df.loc[i,'TotalWorkingYears'] /
emp_proc_df.loc[i,'NumCompaniesWorked']
emp_proc_df['YearWithoutChange1'] = emp_proc_df['YearsInCurrentRole'] -
emp_proc_df['YearsSinceLastPromotion']
emp_proc_df['YearWithoutChange2'] = emp_proc_df['TotalWorkingYears'] -
emp_proc_df['YearsSinceLastPromotion']
monthly_income_median = np.median(emp_proc_df['MonthlyIncome'])
emp_proc_df['CompRatioOverall'] = emp_proc_df['MonthlyIncome'] / monthly_income_median
print('Dataset dimension: {} rows, {} columns'.format(emp_proc_df.shape[0],
emp_proc_df.shape[1]))

# Features to remove
feat_to_remove = ['EmployeeNumber', 'EmployeeCount', 'Over18', 'StandardHours']
emp_proc_df.drop( feat_to_remove , axis = 1, inplace = True )
print('Dataset dimension: {} rows, {} columns'.format(emp_proc_df.shape[0],
emp_proc_df.shape[1]))
full_col_names = emp_proc_df.columns.tolist()
num_col_names = emp_proc_df.select_dtypes(include=[np.int64,np.float64]).columns.tolist() # Get
numerical feature names

# Preparing list of ordered categorical features


num_cat_col_names = ['Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel',
'JobSatisfaction',
'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance', 'StockOptionLevel']
target = ['Attrition']

30
num_col_names = list(set(num_col_names) - set(num_cat_col_names)) # Numerical features w/o
Ordered Categorical features
cat_col_names = list(set(full_col_names) - set(num_col_names) - set(target)) # Categorical &
Ordered Categorical features
print('Total number of numerical features: ', len(num_col_names))
print('Total number of categorical & ordered categorical features: ', len(cat_col_names))
cat_emp_df = emp_proc_df[cat_col_names]
num_emp_df = emp_proc_df[num_col_names]

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Settings
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['figure.figsize'] = (16, 4)
pd.options.display.max_columns = 500

# Let's create dummy variables for each categorical attribute for training our calssification model
for col in num_col_names:
if num_emp_df[col].skew() > 0.80:
num_emp_df[col] = np.log1p(num_emp_df[col])
num_emp_df.head()

# Let's create dummy variables for each categorical attribute for training our calssification model
for col in cat_col_names:
col_dummies = pd.get_dummies(cat_emp_df[col], prefix=col)
cat_emp_df = pd.concat([cat_emp_df, col_dummies], axis=1)

31
# Use the pandas apply method to numerically encode our attrition target variable
attrition_target = emp_proc_df['Attrition'].map({'Yes':1, 'No':0})

# Drop categorical feature for which dummy variables have been created
cat_emp_df.drop(cat_col_names, axis=1, inplace=True)
cat_emp_df.head()
num_corr_df = num_emp_df[['MonthlyIncome', 'CompRatioOverall', 'YearWithoutChange1',
'DistanceFromHome']]
corr_df = pd.concat([num_corr_df, attrition_target], axis=1)
corr = corr_df.corr()
plt.figure(figsize = (10, 8))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.axes_style("white")
#sns.heatmap(data=corr, annot=True, mask=mask, square=True, linewidths=.5, vmin=-1, vmax=1,
cmap="YlGnBu")
sns.heatmap(data=corr, annot=True, square=True, linewidths=.5, vmin=-1, vmax=1,
cmap="YlGnBu")
plt.show()

# Concat the two dataframes together columnwise


final_emp_df = pd.concat([num_emp_df, cat_emp_df], axis=1)
print('Dataset dimension after treating categorical features with dummy variables: {} rows, {}
columns'.format(final_emp_df.shape[0], final_emp_df.shape[1]))
final_emp_df.head()

# Import the train_test_split method


from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

32
# Split data into train and test sets as well as for validation and testing
X_train, X_val, y_train, y_val = train_test_split(final_emp_df, attrition_target,
test_size= 0.30, random_state=42)
print("Stratified Sampling: ", len(X_train), "train set +", len(X_val), "validation set")
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score,
f1_score

def gen_model_performance(actual_target, pred_target):


model_conf_matrix = confusion_matrix(actual_target, pred_target)
model_roc_score = roc_auc_score(actual_target, pred_target)
model_accuracy = accuracy_score(actual_target, pred_target) * 100.0
TP = model_conf_matrix[1][1]; TN = model_conf_matrix[0][0];
FP = model_conf_matrix[0][1]; FN = model_conf_matrix[1][0];
sensitivity = TP / float(TP + FN) * 100.0; specificity = TN / float(TN + FP) * 100.0;
precision = TP / float(TP + FP) * 100.0;
return sensitivity, specificity, model_accuracy, precision, model_roc_score

def evaluate_model_score(X, y, scoring='accuracy'):


logreg_model = LogisticRegression(random_state=0)
rfc_model = RandomForestClassifier()
gboost_model = GradientBoostingClassifier()
gnb_model = GaussianNB()
models = [logreg_model,rfc_model, gboost_model, gnb_model]
model_results = pd.DataFrame(columns = ["Model", "Accuracy", "Precision", "CV Score",
"Sensitivity","Specificity","ROC Score"])

33
for model in models:
model.fit(X, y,)
y_pred = model.predict(X)
score = cross_val_score(model, X, y, cv=5, scoring=scoring)
sensitivity, specificity, accuracy, precision, roc_score = gen_model_performance(y, y_pred)
scores = cross_val_score(model, X, y, cv=5)
model_results = model_results.append({"Model": model._class.name_,
"Accuracy": accuracy, "Precision": precision,
"CV Score": scores.mean()*100.0,
"Sensitivity": sensitivity, "Specificity": specificity,
"ROC Score": roc_score}, ignore_index=True)
return model_results
model_results = evaluate_model_score(X_train, y_train)
model_results
rfc_model = RandomForestClassifier()
refclasscol = X_train.columns

# fit random forest classifier on the training set


rfc_model.fit(X_train, y_train);

# extract important features


score = np.round(rfc_model.feature_importances_, 3)
importances = pd.DataFrame({'feature':refclasscol, 'importance':score})
importances = importances.sort_values('importance', ascending=False).set_index('feature')

# random forest classifier parameters used for feature importances


print(rfc_model)
high_imp_df = importances[importances.importance>=0.015]
high_imp_df.plot.bar();
del high_imp_df

34
mid_imp_df = importances[importances.importance<=0.015]
mid_imp_df = mid_imp_df[mid_imp_df.importance>=0.0050]
mid_imp_df.plot.bar();
del mid_imp_df
selection = SelectFromModel(rfc_model, threshold = 0.002, prefit=True)
X_train_select = selection.transform(X_train)
X_val_select = selection.transform(X_val)
print('Train dataset dimension before Feature Selection: {} rows, {}
columns'.format(X_train.shape[0], X_train.shape[1]))
print('Train dataset dimension after Feature Selection: {} rows, {}
columns'.format(X_train_select.shape[0], X_train_select.shape[1]))
model_results = evaluate_model_score(X_train_select, y_train)
model_results
final_rfc_model = RandomForestClassifier()
final_rf_scores = cross_val_score(final_rfc_model, X_train_select, y_train, cv=5)
final_rfc_model.fit(X_train_select, y_train)
y_trn_pred = final_rfc_model.predict(X_train_select)
sensitivity, specificity, accuracy, precision, roc_score = gen_model_performance(y_train,
y_trn_pred)
print("Train Accuracy: %.2f%%, Precision: %.2f%%, CV Mean Score=%.2f%%,
Sensitivity=%.2f%%, Specificity=%.2f%%" %
(accuracy, precision, final_rf_scores.mean()*100.0, sensitivity, specificity))
print('*****************************\n')
y_val_pred = final_rfc_model.predict(X_val_select)
sensitivity, specificity, accuracy, precision, roc_score = gen_model_performance(y_val, y_val_pred)
print("Validation Accuracy: %.2f%%, Precision: %.2f%%, Sensitivity=%.2f%%,
Specificity=%.2f%%" % (accuracy, precision, sensitivity, specificity))
print('*****************************\n')

35
6.3 OUTPUT SCREENS:

Dataset Info:

Fig.6.3.1 Dataset Info

This contains metadata of IBM HR Employee dataset. The above dataset that was downloaded from
Kaggle.
The dataset contains
 35 attributes
 1470 entries
The target attribute of the dataset is “Attrition” which determines whether the employee will leave
the company or stay in the company in the Yes or No format.

36
Data frame:

Fig.6.3.2 Data frame

The CSV file “HR Employee Attrition.csv” is loaded into a dataframe for the ease of accessing of
attributes and other operations.

37
Attrition Target Variable Distribution:

Fig.6.3.3 Attrition Target Variable Distribution

The above snapshot depicts the percentage of samples that were classified as “Yes” and the
percentage of samples that were classified as “No”. Here, value_counts() function is applied to count
the number of “Yes” and “No” in the given data. The number of samples which are classified as
“No” are 1233 which makes 83.88% of total samples and “Yes “ being 16.12% with 237 samples.

38
Attrition Distribution Bar Plot:

Fig.6.3.4 Attrition Distribution Bar Plot

The above frequency percentage of “Yes” and “No” is represented graphically in the form of a bar
plot. Matplotlib library is used and the function which is used to plot the graph is “plot” function. The
title of the graph “Attrition Distribution” is set using the function “set_title”.

39
Adding categorical variable:

Fig.6.3.5 Adding categorical variable

Two features have been added into the dataframe “Attrition_Yes” and “Attrition_No” . If the Attrition is “Yes”
for that particular row then Attrition_Yes will be 1 and thus making Attrition_No will be 0 and similarly with the
“Attrition_No” will be 1 and “Attrition_Yes” will be 0 when the Attrition is “No”.

40
BusinessTravel Vs Attrition:

Fig.6.3.6 BusinessTravel Vs Attrition

The Visualization is performed to compare the attrition and other attributes . This is the snapshot of
one such attribute “BusinessTravel” . It contains three classes “Non-Travel”, “Travel_Frequently”
and “Travel_Rarely”.”Travel_Rareely” is the clas that has the highest number of “No” for the
attribute “Attrition”.

41
MaritalStatus Vs Attrition:

Fig.6.3.7 MaritalStatus Vs Attrition

The above snapshot is the plotting of Employee Attrition and Marital status. The Marital Status has
the classes “Divorced”, “Married” and “Single”. As the snapshot shows the employees with marital
status “married “ are having the attrition as “No” .The Employees with “Single” marital status are
having more percentile of attrition as “Yes”.

42
JobRole Vs Attrition:

Fig.6.3.8 JobRole Vs Attrition

The above snapshot represents the plotting between the jobrole and attrition. There are 9 different
types of classes in the jobrole. The class that has the highest percentile of Attriton_No is
“Sales_executives”. The class that has the highest percentile of “Attrtion_Yes” is “Laboratory
Technician”.

43
JobSatisfaction Vs Attrition:

Fig.6.3.9 JobSatisfaction Vs Attrition

The above snapshot represents the plotting of “JobSatisfaction” and “Attriton”.The JobSatisfaction
has 4 levels of satisfaction. The employees with jobSatisfaction “4” are having the highest percentile
of “Attriton_No”. The JobSatisfaction with level 3 are having highest percentile of “Attrtion_Yes”.

44
WorkLifeBalance Vs Attrition:

Fig.6.3.10 WorkLifeBalance Vs Attrition

The above snapshot represents the plotting of “WorkLifeBalance” and “Attrition”. The
WorkLifeBalance has 4 levels of ranking. The WorklifeBalance of level “3” has the highest
percentile of “Attrition_No” and the level “3” also has the highest percentile of “Attrition_Yes”.

45
EnvironmentSatisfaction Vs Attrition:

Fig.6.3.11 EnvironmentSatisfaction Vs Attrition

The above snapshot represents the plotting “EnvironmentSatisfaction” and “Attrition” .


EnvironmentSatisfaction has 4 classes i.e “1”, “2”, “3”, “4”. Here, the level 3 has the highest
“Attrition_No” and the level “1” has the highest “Attrition_Yes” for the EnvironmentSatisfaction
Attribute.

46
Addition of new features:

Fig.6.3.12 Addition of new features

Tenure per job : Usually, people who have worked with many companies but for small periods at
every organization tend to leave early as they always need a change of Organization to keep them
going.
Years without Change : For any person, a change either in role or job level or responsibility is needed
to keep the work exciting to continue. We create a variable to see how many years it has been for an
employee without any sort of change using Promotion, Role and Job Change as a metric to cover
different variants of change.
Compensation Ratio : compensation Ratio is the ratio of the actual pay of an Employee to the
midpoint of a salary range. The salary range can be that of his/her department or organization or role.
The benchmark numbers can be a organization’s pay or Industry average.

47
Removing features:

Fig.6.3.13 Removing features

Here certain columns like “EmployeeNumber” , “EmployeeCount” , “Over18” , “StandardHours”


which has no effect on the target attribute has been removed from the dataframe. The features are
removed using the “drop” function.

48
Creating dummy variables:

Fig.6.3.14 Creating dummy variables

The above snapshot represents creating dummy variables for each categorical variable having more
than 2 classes. Here one less than the number of categories of variables will be created for that
particular attribute. Machine Learning model works only on numerical datasets, hence, categorical
features needed to be transformed into numerical features. One of the best strategy is to convert each
category value into a new column and assigns 1 or 0 (True/False) value to the column.
This has the benefit of not weighting a value improperly but does have the downside of adding more
columns to the data set. This approach is also called as "One Hot Encoding". We can use Pandas
feature get_dummies to achieve this transformation.

49
Correlation matrix:

Fig.6.3.15 Correlation matrix

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the above table
shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a
more advanced analysis, and as a diagnostic for advanced analyses.
The values range between -1.0 and 1.0. A calculated number greater than 1.0 or less than -1.0 means that there
was an error in the correlation measurement. A correlation of -1.0 shows a perfect negative correlation, while a
correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship
between the movement of the two variables.
Here identifying the relationship between attrition and other important features is performed.
The above snapshot represents MonthlyIncome and YearWithoutChange1 are positively correlated and
YearwithoutChange1 and Attrition are negatively correlated.
50
Function for generating the train and test data:

Fig.6.3.16 Function for generating the train and test data

The above snapshot represents the splitting of data which is to be further used for training and testing
of the data. For all of the models both the training and validation phases are carried out.
train_test_split( ) method is used here to split the data . 70% of the data is used for training and 30%
is used for testing. Data will be trained on 70% which is on 1029 entries of the data and the model
will be validated on the testing data which is the unseen data for the model and here the testing data
is 30% which is 441 entries of our data.

51
Model performance evaluation:

Fig.6.3.17 Model performance evaluation

The classification models performance will be evaluated based on certain metrics such as accuracy,
precision, sensitivity, specificity, CV score etc.

52
Performance Evaluation Of Algorithms:

Fig.6.3.18 Performance Evaluation Of Algorithms

After evaluating the models based on the metrics calculated, we have evaluated the training data for
the algorithms chosen which are Random Forest Classifier, Logistic regression, Gradient Boosting
Classifier, Gaussian NB and of all the algorithms

53
Feature Importance:

Fig.6.3.19 Feature Importance

Feature Importance refers to techniques that calculate a score for all the input features for a given
model. The scores simply represent the “importance” of each feature. A higher score means that the
specific feature will have a larger effect on the model that is being used to predict a certain variable.
Like a correlation matrix, feature importance allows you to understand the relationship between the
features and the target variable. Feature importance can be used to reduce the dimensionality of the
model. The higher scores are usually kept and the lower scores are deleted as they are not important for
the model.

54
Validating the performance of Algorithms on test data :

Fig.6.3.20 Validating the performance of Algorithms on Testing data


Four algorithms are chosen to build the model. They are LogisticRegression,RandomForestClassifier,
Gaussian Naïve Bayes, Gradient Boosting. All these four models are validated on the testing data and
its performance is calculated using certain metrics such as
Accuracy : It is defined as the ratio of True samples to the total samples in the predicted results.
Accuracy= (TP+TN)/(TP+FP+TN+FN)
Precision : It is defined as the ratio of True Positives to the Total positives in the predicted results.
Precision = TP/(TP+FP)
Sensitivity : It is defined as the ratio of True positives to the actual positives in the predicted results.
Sensitivity = TP/(TP+FN)
Specificity : It is defined as the ratio of true Negatives to the actual negatives in the predicted results.
Specificity = TN/(TN+FP)

55
Model with high Accuracy

Fig 6.3.21 Model with high Accuracy

The final model chosen is logistic model. The accuracy of the model is 87.53 percentile.

56
Predicting the attrition for an employee:

Fig.6.3.22 Predicting the attrition for an employee

The above snapshot represents the testcase performed on the final model.

57
7.CONCLUSION

We applied some machine learning techniques in order to identify the factors that may contribute to an
employee leaving the company and, above all, to predict the likelihood of individual employees leaving
the company. First, we assess statistically the data and then we classified them. The dataset was
processed, dividing it into the training phase and the test phase, guaranteeing the same distribution of
the target variable (through the holdout technique).
We selected various classification algorithms and, for each of them, we carried out the training and
validation phases. To evaluate the algorithm’s performance, the predicted results were collected and
fed into the respective confusion matrices. From these it was possible to calculate the basic metrics
necessary for an overall evaluation (precision, recall, accuracy, f1 score, ROC curve, AUC, etc.) and
to identify the most suitable classifier to predict whether an employee was likely to leave the company.
The algorithm that produced the best results for the available dataset was the Gaussian Naïve Bayes
classifier: it revealed the best recall rate (0.54), a metric that measures the ability of a classifier to find
all the positive instances, and achieved an overall false negative rate equal to 4.5% of the total
observations. Results obtained by the proposed automatic predictor demonstrate that the main attrition
variables are monthly income, age, overtime, distance from home.
The results obtained from the data analysis represent a starting point in the development of increasingly
efficient employee attrition classifiers. The use of more numerous datasets or simply to update it
periodically, the application of feature engineering to identify new significant characteristics from the
dataset and the availability of additional information on employees would improve the overall
knowledge of the reasons why employees leave their companies and, consequently, increase the time
available to personnel departments to assess and plan the tasks required to mitigate this risk

58
8.REFERENCES

1.Data Set :

https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

2.Web Page References :


 ROI-based review of HR analytics: practical implementation tools
HC Ben-Gal - Personnel Review, 2019
 Raging debates in HR analytics
L Bassi - People and Strategy, 2011
 evidence-based review of HR Analytics
JH Marler, JW Boudreau - The International Journal of Human, 2017 ,Taylor & Francis

 https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology

59

You might also like