dATA mINING REPORT 222

i
School of Computer Science and Engineering

J Component report
Programme: MCA
Course Title: Datamining and Business Intelligence
Course Code: ITA5007
Slot: A1+TA1
TITLE HR RANKING SYSTEM FOR CANDIDATE

ANALYSIS
Team Members: HARSH KUMAR GUPTA (21MCA1005)
AKANKSHA (21MCA1062)
Faculty: Dr Punitha K
Sign:
Date:
ii

J Component report
Programme: MCA
Course Title: Datamining and Business Intelligence
Course Code: ITA5007
Slot: A1+TA1
TITLE: HR RANKING SYSTEM FOR CANDIDATE

ANALYSIS
Team Members:
HARSH KUMAR GUPTA (21MCA1005)
AKANKSHA (21MCA1062)
iii
DECLARATION
We Harsh Kumar Gupta and Akanksha hereby declared that the thesis entitled “HR
RANKING SYSTEM FOR CANDIDATE ANALYSIS” submitted by us, for the
completion of the course, Data mining and Business Intelligence (ITA5007) is a record of
bonafide work carried out by me under the supervision of Dr Punitha K, my course
instructor. I further declare that the work reported in this document has not been
submitted and will not be submitted, either in part or in full, for any other courses in
this institute or any other institute or university.
Place: Chennai
Date: Signature of the Candidate
iv
CERTIFICATE
This is to certify that the report entitled “HR RANKING SYSTEM FOR
CANDIDATE ANALYSIS” is prepared and submitted by HARSH KUMAR GUPTA
(21MCA1005), AKANKSHA (21MCA1062) to Vellore Institute of Technology,
Chennai, in partial fulfillment of the requirement for the course, Data Mining and
Business Intelligence (ITA5007), is a bonafide record carried out under my guidance.
The project fulfills the requirements as per the regulations of this University and in my
opinion meets the necessary standards for submission. The contents of this report have
not been submitted and will not be submitted either in part or in full, for the any other
course and the same is certified.
NAME: Prof. PUNITHA K

SIGNATURE:
v
ACKNOWLEDGEMENT
We wish to express our sincere thanks and deep sense of gratitude to our
project guide, Prof. PUNITHA K for her consistent encouragement and
valuable guidance offered to us in a pleasant manner throughout the course
of the project work.
We are extremely grateful to Dr. Ganeshan R, Dean of the School of

Computer Science and Engineering, VIT Chennai, for extending the
facilities of the school towards our project and for her unstinting support.
We also take this opportunity to thank all the faculty of the school for their
support and their wisdom imparted to us throughout the course. We thank
our parents, family, and friends for bearing with us throughout the course
of our project and for the opportunity they provided us in undergoing this
course in such a prestigious institution.
HARSH KUMAR GUPTA
AKANKSHA
vi
ABSTRACT
Human Resource Candidate Ranking System consists of the people who select
expert workforce of an organization or business sector. Human Resources ought
to play variety of roles so as to pick skilled candidate for specific designation. A
System every HR person should possess that can rank the expertise needed for
specific job position at specific location. This system can rank the location, level
of most demanding job and according to percentage hike while switching the job
supported the expertise and different key skills that square measure needed for
specific job profile. Project involved python and machine Learning Algorithm
such as Apriori Algorithm, Decision Tree, Random Forest, SVM, KNN, and
Naïve Bayes Algorithm, where a comparative study is being done on the dataset
which collected from Kaggle. In this project I found the area which having the
highest number of employee based on the location and also study about the job
band on the basis of location upon the dataset.
vii
CONTENTS
INDEX.................................................................................................................................... vii
LIST OF FIGURES ................................................................................................................ x
LIST OF TABLES ............................................................................................................... xvi
LIST OF ACRONYMS ...................................................................................................... xxx
CHAPTER 1
1.1 INTRODUCTION ………………………………………………………………….1
1.2 OVERVIEW OF HR RANKING SYSTEM………………………………………1
1.3 CHALLENGES IN HR SYSTEM…………………………………………………2
1.4 PROJECT STATEMENT………………………………………………………….5
1.5 OBJECTIVES………………………………………………………………….........9
1.6 SCOPE OF THE PROJECT……………………………………………………….9
CHAPTER 2
2.1 DATA SET……………………………………………………………………………10
2.2 MODEL DESCRIPTION……………………………………………………………10
2.3 METHODOLOGY…………………………………………………………………...12
2.4 CODE………………………………………………………………………………….13
2.5 OUTPUT………………………………………………………………………………23
2.6 RISK ANALYSIS OF EVERY ALGORITHM………………………………….....24
2.7 COMPARATIVE STUDY OF VARIOUS ALGORITHM………………………..26
2.8 CONCLUSION…………………………………………………………………….....41
REFERENCES…………………………………………………………….35
viii
LIST OF FIGURES
1.1 No. of Employees location vs count of employee---------------------------------------------11

1.2 Count plot for the categorical variable----------------------------------------------------------11
1.3 Count Plot of Target Variable--------------------------------------------------------------------12
1.4 Applying Decision Tree---------------------------------------------------------------------------12
1.5 Correlation Plot-------------------------------------------------------------------------------------13
1.6 Heat Map of Random Forest----------------------------------------------------------------------27
1.9 Heat Map of SVM----------------------------------------------------------------------------------27
1.10 Heat Map of Gaussian Naïve Bayes------------------------------------------------------------28
1.11 Head Map of Decision Tree---------------------------------------------------------------------28
1.12 Combine prediction Model----------------------------------------------------------------------29
ix
LIST OF TABLES
2.1) Data Set for HR Ranking System-----------------------------------------------------------------10

2.2) Table for Checking for Nan values of attribute-------------------------------------------------16
2.4) Data Type of every given attribute in the Table-----------------------------------------------16
2.5) Display Columns-----------------------------------------------------------------------------------17
2.6) Renaming the columns name----------------------------------------------------------------------17
2.7) Rearranging the columns name-------------------------------------------------------------------17
2.8) Statistical Info---------------------------------------------------------------------------------------18
2.9) Head of Data Set -----------------------------------------------------------------------------------18
2.10) Dependent variables------------------------------------------------------------------------------19
2.11) Apriori Algorithm Table-------------------------------------------------------------------------23
x
LIST OF ACRONYMS
 SLNO ----------------------------------- serial number

 Candidate.Ref ----------------------------- candidate reference
 DOJ.Extended -------------------------- Date of joining extended
 Duration.to.accept.offer
 Notice.period
 Offered.band
 Pecent.hike.expected.in.CTC
 Percent.hike.offered.in.CTC
 Percent.difference.CTC
 Joining.Bonus
 Candidate.relocate.actual
 Gender
 Candidate.Source
 Rex.in.Yrs
 LOB -------------------------------------Line of business
 Location
 Age
 Status
 Np ----------------------------------------numpy
 pd ----------------------------------------- pandas
 sns -------------------------------------------- seabon
1
CHAPTER 1
INTRODUCTION
Human Resource Candidate Ranking System consists of the people who select expert
workforce of an organization or business sector. Human Resources ought to play variety of
roles so as to pick skilled candidate for specific designation.
A System every HR person should possess that can rank the expertise needed for specific job
position at specific location. This system can rank the location, level of most demanding job
and according to percentage hike while switching the job supported the expertise and different
key skills that square measure needed for specific job profile. Project involved python and
machine Learning Algorithm such as Apriori Algorithm, Decision Tree, Random Forest, SVM,
KNN, and Naïve Bayes Algorithm, where a comparative study is being done on the dataset
which collected from Kaggle. In this project I found the area which having the highest number
of employee based on the location and also study about the job band on the basis of location
upon the dataset.
1.1OVERVIEW OF HR CANDIDATE RANKING SYSTEM:
In the present system the candidate has to fill each and every information regarding there
resume in a manual form which takes large amount of time and then also the candidates, are
not satisfied by the job which the present system prefers according to their skills. Let me tell
you a ratio of 5:1 means, If 5 people are getting job than out of that 5, only a single guy will be
satisfied by his/her job. Let me tell you an example: If I am a good python developer and
particular company hired me and they are making me work on Java so, my python skills are
pretty useless. And on the other hand if there is vacant place in a company so according to
owner of the company he/she will prefer a best possible candidate for that vacancy. So our
system will act as a handshake between this two entities. The company who prefer the best
possible candidate and the candidate who prefers the best possible job according to his or her
skills and ability.
2
PROBLEMS AND SOLUTION:
The problem is that the present are not much flexible and efficient and time saving. It requires
candidate, to fill the forms online than also you might not get the genuine information of the
candidate. Beside Where our system which saves the time of the candidate by providing to
upload there resume in any format preferable to the candidate beside all the information in the
resume our system will detect all its activity from the candidate social profile which will give
the best candidate for that particular job and candidate will also be satisfied because he will get
job in that company which really appreciates candidates skill and ability. On the other hand we
are providing same kind of flexibility to the client company.
MOTIVATION
The current recruitment process are more tedious and time consuming which forces the
candidates to fill all their skill and information manually. And HR team requires more man
power to scrutinize the resumes of the candidates. So that motivated to build a solution that is
more flexible and automated.
1.2 CHALLENGES IN HR SYSTEM:
1) Challenges in finding Data Set:
With so much data available, it’s difficult to dig down and access the insights that are
needed most. When employees are overwhelmed, they may not fully analyse data or only
focus on the measures that are easiest to collect instead of those that truly add value. In
addition, if an employee has to manually sift through data, it can be impossible to gain
real-time insights on what is currently happening. Outdated data can have significant
negative impacts on decision-making.
A data system that collects, organizes and automatically alerts users of trends will help
solve this issue. Employees can input their goals and easily create a report that provides
3
the answers to their most important questions. With real-time reports and alerts, decision-
makers can be confident they are basing any choices on complete and accurate
information.
2) Visual Representation of data:
To be understood and impactful, data often needs to be visually presented in graphs or

charts. While these tools are incredibly useful, it’s difficult to build them manually.
Taking the time to pull information from multiple areas and put it into a reporting tool is
frustrating and time-consuming.
Strong data systems enable report building at the click of a button. Employees and
decision-makers will have access to the real-time information they need in an appealing
and educational format
3) Data from multiple sources:
The next issue is trying to analyze data across multiple, disjointed sources. Different
pieces of data are often housed in different systems. Employees may not always realize
this, leading to incomplete or inaccurate analysis. Manually combining data is time-
consuming and can limit insights to what is easily viewed.
With a comprehensive and centralized system, employees will have access to all types of
information in one location. Not only does this free up time spent
Accessing multiple sources, it allows cross-comparisons and ensures data is complete.
4) Inaccessible data:
Moving data into one centralized system has little impact if it is not easily accessible to
the people that need it. Decision-makers and risk managers need access to all of an
organization’s data for insights on what is happening at any given moment, even if they
are working off-site. Accessing information should be the easiest part of data analytics.
An effective database will eliminate any accessibility issues. Authorized employees will
be able to securely view or edit data from anywhere, illustrating organizational changes
and enabling high-speed decision making.
4
5) Poor quality data:
Nothing is more harmful to data analytics than inaccurate data. Without good input,
3output will be unreliable. A key cause of inaccurate data is manual errors made during
data entry. This can lead to significant negative consequences if the analysis is used to
influence decisions. Another issue is asymmetrical data: when information in one system
does not reflect the changes made in another system, leaving it outdated.
A centralized system eliminates these issues. Data can be input automatically with
mandatory or drop-down fields, leaving little room for human error. System integrations
ensure that a change in one area is instantly reflected across the board
6) Pressure from the top:
As risk management becomes more popular in organizations, CFOs and other executives
demand more results from risk managers. They expect higher returns and a large number
of reports on all kinds of data.
With a comprehensive analysis system, risk managers can go above and beyond
expectations and easily deliver any desired analysis. They’ll also have more time to act
on insights and further the value of the department to the organization.
7) Confusion in Attribute:
Users may feel confused or anxious about switching from traditional data analysis
methods, even if they understand the benefits of automation. Nobody likes change,
especially when they are comfortable and familiar with the way things are done.
To overcome this HR problem, it’s important to illustrate how changes to analytics will
actually streamline the role and make it more meaningful and fulfilling. With
comprehensive data analytics, employees can eliminate redundant tasks like data
collection and report building and spend time acting on insights instead.
8) Shortage of skill:
5
Some organizations struggle with analysis due to a lack of talent. This is especially true
in those without formal risk departments. Employees may not have the knowledge or
capability to run in-depth data analysis.
This challenge is mitigated in two ways: by addressing analytical competency in the
hiring process and having an analysis system that is easy to use. The first solution ensures
skills are on hand, while the second will simplify the analysis process for everyone.
Everyone can utilize this type of system, regardless of skill level.
9) Scaling data analysis:
Finally, analytics can be hard to scale as an organization and the amount of data it collects
grows. Collecting information and creating reports becomes increasingly complex. A
system that can grow with the organization is crucial to manage this issue.
While overcoming these challenges may take some time, the benefits of data analysis are
well worth the effort. Improve your organization today and consider investing in a data
analytics system.
1.4) PROJECT STATEMENT:
1) PROBLEM STATEMENT:
The problem is that the present are not much flexible and efficient and time saving. It requires
candidate, to fill the forms online than also you might not get the genuine information of the
candidate. Beside Where our system which saves the time of the candidate by providing to
upload there resume in any format preferable to the candidate beside all the information in the
resume our system will detect all its activity from the candidate social profile which will give
the best candidate for that particular job and candidate will also be satisfied because he will get
job in that company which really appreciates candidates skill and ability. On the other hand we
are providing same kind of flexibility to the client company.
1) REQUIREMENT ANALYSIS:
 Language: Python
 Library:
6
 Pandas:
Pandas is a Python library for data analysis. Started by Wes McKinney in 2008 out of a need
for a powerful and flexible quantitative analysis tool, pandas has grown into one of the most
popular Python libraries. It has an extremely active community of contributors.
Pandas is built on top of two core Python libraries—matplotlib for data visualization
and NumPy for mathematical operations. Pandas acts as a wrapper over these libraries,
allowing you to access many of matplotlib's and NumPy's methods with less code. For instance,
pandas' .plot() combines multiple matplotlib methods into a single method, enabling you to
plot a chart in a few lines.
Before pandas, most analysts used Python for data munging and preparation, and then switched
to a more domain specific language like R for the rest of their workflow. Pandas introduced
two new types of objects for storing data that make analytical tasks easier and eliminate the
need to switch tools: Series, which have a list-like structure, and DataFrames, which have a
tabular structure.
 Numpy :
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked arrays
and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms,
basic linear algebra, basic statistical operations, random simulation and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional
arrays of homogeneous data types, with many operations being performed in compiled code
for performance. There are several important differences between NumPy arrays and the
standard Python sequences:
 Matplotlib.pyplot:
Matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB.

Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting
area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
7
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of
things like the current figure and plotting area, and the plotting functions are directed to the
current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of a figure and not the strict mathematical term for more than one axis).
 Seaborn :
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level

interface for drawing attractive and informative statistical graphics.
For a brief introduction to the ideas behind the library, you can read the introductory notes or
the paper. Visit the installation page to see how you can download the package and get started
with it. You can browse the example gallery to see some of the things that you can do with
seaborn, and then check out the tutorial or API reference to find out how.
 Apriori, association rule :
Apriori is an algorithm for frequent item set mining and association rule
learning over relational databases. It proceeds by identifying the frequent individual items in
the database and extending them to larger and larger item sets as long as those item sets appear
sufficiently often in the database. The frequent item sets determined by Apriori can be used to
determine association rules which highlight general trends in the database: this has applications
in domains such as market basket analysis.
 Itertools:
Itertools is a module in python, it is used to iterate over data structures that can be stepped over
using a for-loop. Such data structures are also known as iterables.
This module incorporates functions that utilize computational resources efficiently. Using this
module also tends to enhance the readability and maintainability of the code. The itertools
module needs to be imported prior to using it in the code.
 Warning:
8
Warnings are provided to warn the developer of situations that aren’t necessarily exceptions.
Usually, a warning occurs when there is some obsolete of certain programming elements, such
as keyword, function or class, etc. A warning in a program is distinct from an error. Python
program terminates immediately if an error occurs. Conversely, a warning is not critical. It
shows some message, but the program runs. The warn() function defined in the ‘warning‘
module is used to show warning messages. The warning module is actually a subclass of
Exception which is a built-in class in Python.
 Sklearn.model_selection:
Before discussing train_test_split, you should know about Sklearn (or Scikit-learn). It is
a Python library that offers various features for data processing that can be used for
classification, clustering, and model selection.
Model_selection is a method for setting a blueprint to analyze data and then using it to measure
new data. Selecting a proper model allows you to generate accurate results when making a
prediction. To do that, you need to train your model by using a specific dataset. Then, you test
the model against another dataset.
If you have one dataset, you'll need to split it by using the Sklearn train_test_split function first.
 Sklearn.preprocessing:
The sklearn.preprocessing package provides several common utility functions and transformer
classes to change raw feature vectors into a representation that is more suitable for the
downstream estimators.
In general, learning algorithms benefit from standardization of the data set. If some outliers are
present in the set, robust scalers or transformers are more appropriate. The behaviors of the
different scalers, transformers, and normalizers on a dataset containing marginal outliers is
highlighted in Compare the effect of different scalers on data with outliers.
 Sklearn.neighbors:
sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest

neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or
9
Brute Force. In other words, it acts as a uniform interface to these three algorithms.
The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides
the functionality for unsupervised as well as supervised neighbors-based learning methods.
The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or
Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is
basically only step 1, which is discussed above, and the foundation of many algorithms (KNN
and K-means being the famous one) which require the neighbor search. In simple words, it is
unsupervised learner for implementing neighbor searches.
 Sklearn.svm:
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
The advantages of support vector machines are:
 Effective in high dimensional spaces.
 Still effective in cases where number of dimensions is greater than the number of
samples.
 Uses a subset of training points in the decision function (called support vectors), so it
is also memory efficient.
 Versatile: different Kernel functions can be specified for the decision function.
Common kernels are provided, but it is also possible to specify custom kernels.
 Sklearn.ensemble:
The sklearn.ensemble module includes two averaging algorithms based on

randomized decision trees: the RandomForest algorithm and the Extra-Trees method. Both
algorithms are perturb-and-combine techniques specifically designed for trees. This means a
diverse set of classifiers is created by introducing randomness in the classifier construction.
The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
 Environment:
 Jupyter notebook:
10
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Its uses include
data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
1.5 OBJECTIVE:
The major objective of our system is to take the current resume ranking system for candidate
ranking to other level and makes it more flexible for both the entity.
1) Candidates, who has been hired.

2) Client company who is hiring the candidates.
3) Different level of business.
4) Candidate most preferred location.
5) Status of Joining.
6) Year of experience
7) percentage hike in switch from one job to other.
11
CHAPTER 2
2.1) DATASET
The dataset is named as Hr.csv file wherein the dataset is downloaded from kaggle.
2.2) MODEL DESCRIPTION

KDD represents Knowledge Discovery in Databases. It defines the broad process of
discovering knowledge in data and emphasizes the high-level applications of definite data
mining techniques. It is an area of interest to researchers in several fields, such as artificial
intelligence, machine learning, pattern recognition, databases, statistics, knowledge acquisition
for professional systems, and data visualization.
The knowledge discovery process is iterative and interactive, includes nine steps. The process
is iterative at every stage, implying that transforming back to the previous actions can be
required. The process has several imaginative methods in the sense that one cannot present one
formula or create a complete scientific categorization for the correct decisions for each step
12
and application type. Therefore, it is required to understand the process and the multiple
requirements and possibilities in each stage.
 Developing an understanding − This is the basic preliminary step. It creates the scene
for learning what should be done with the several decisions like transformation,
algorithms, representation, etc. The individuals who are in charge of a KDD venture
are required to learn and characterize the goals of the end-user and the environment in
13
which the knowledge discovery process will appear (involves relevant prior
knowledge).
 Creating a target data set − It can be choosing a data set or targeting a subset of
variables or data samples, on which discovery is to be implemented. This process is
essential because Data Mining learns and finds from the accessible data. This is the
evidence foundation for building the models. If some important attributes are missing,
at that point, then the whole study can be unsuccessful from this respect, the more
attributes are considered.
 Data cleaning and pre-processing − Data cleaning defines to clean the data by filling
in the missing values, smoothing noisy data, identifying and eliminating outliers, and
removing inconsistencies in the data.
 Exploratory analysis and model and hypothesis selection − It can be selecting the
data mining algorithm(s) and selecting method(s) to be used for searching for data
patterns. This process contains deciding which models and parameters can be
appropriate and matching a particular data-mining method with the long-term criteria
of the KDD process.
 Data mining − It is used to search for patterns of interest in a specific representational

form or a set of such representations, involving classification rules or trees, regression,
14
and clustering. The user can significantly help the data-mining method by correctly
implementing the preceding steps.
 Acting on the discovered knowledge − It is using the knowledge directly, including

the knowledge into another system for additional action, or simply documenting it and
reporting it to interested parties. This process also contains checking for and resolving
potential conflicts with previously accepted (or extracted) knowledge.
KDD is referred to as Knowledge Discovery in Database and is defined as a method of finding,

transforming, and refining meaningful data and patterns from a raw database in order to be
utilised in different domains or applications. The above statement is an overview or gist of
KDD, but it’s a lengthy and complex process which involves many steps and iterations. Now
before we delve into the nitty-gritty of KDD, let’s try and set the tone through an example.
Suppose, there’s a small river flowing nearby and you happen to be either one of a craft
enthusiast, a stone collector or a random explorer. Now, you have prior knowledge that a river
bed is full of stones, shells and other random objects. This premise is of the utmost importance
without which one can’t reach the source. Next, depending on whom you happen to be, the
needs and requirements may vary. This is the second most important thing to understand. So,
you go ahead and collect stones, shells, coins or any artefacts that might be lying on the river
bed. But that brings along dirt and other unwanted objects along as well, which you’ll need to
get rid of in order to have the objects ready for further use.
15
At this stage, you might need to go back and collect more items as per your needs, and this
process will repeat a few times or be completely skipped as per the conditions.
2.3) METHODOLOGY:
Knowledge discovery in databases (KDD) is an iterative multi-stage process for extracting

useful, non-trivial information from large databases. Each stage of the process presents
numerous choices to the user that can significantly change the outcome of the project. This
methodology, presented in the form of a roadmap, emphasizes the importance of the early
stages of the KDD process and shows how careful planning can lead to a successful and well-
managed project. The content is the result of expertise acquired through research and a wide
range of practical experiences; the work is of value to KDD experts and novices alike. Each
stage, from specification to exploitation, is described in detail with suggested approaches,
resources and questions that should be considered. The final section describes how the
methodology has been successfully used in the design of a commercial KDD toolkit.
The KDD process can be divided into eight major stages which are described in detail in the
following sections.
• Problem specification.
• Resourcing.
• Data cleansing.
• Pre-processing.
• Data mining.
• Evaluation of results.
• Interpretation of results.
• Exploitation of results.
We present the KDD process in the form of a roadmap. The process has one major route which
is to get from the starting data to discovered knowledge about the data. Like most long and
difficult journeys, there will be a number of stops along the way. A stop represents one of the
eight KDD stages named above, each of which is made up of a number of smaller tasks. Each
stage, and task, is optional although in most circumstances at least one task from each stage
will be necessary. When applying the roadmap, it is unlikely the project will run directly from
16
start to end; stages will usually have to be repeated using a different set of decisions and
parameters. For this reason, the process is deemed iterative which is denoted by the inner
feedback route on the figure. The iterations performed during the KDD process are vital to the
success of the KDD project. In addition to reviewing and repeating a major stage, some or all
of the smaller tasks within the stage can be repeated.
2.4) CODE:
1) Data Analysis and Visualization:

The training of dataset consists of the following steps:
• Installing required files
 Checking Shape
 Checking for NAN Value

17
 Checking Dtype of each attribute
 DisplayColumn
18
19
20
From all above observation we can observe the **target variable** which is "Status" is
inbalance in nature which is we can observe from above plot number of candidates joined are
21
more in number than the Not Joined once. Even though the model accuracy is good enough but
the predications made are wrong we knew it. So inorder the make dataset balance in nature we
shall use the **Synthetic Minority Oversample TechniquE ( SMOTE )** for it to make it
balance.
22
Apply Apriori Algorithm

23
24
25
Importing Library
Random Forest
26
SVM->Support Model
27
Gaussian Naive Bayes

28
Decision Tree Classifier

29
2.5) OUTPUT:
COMBINED ACCURACY OF ALL 5 ALGORITHMS
2.6) RISK ANALYSIS OF ALGORITHM:
I) KNN ALGORITHM:
1. DOES NOT WORK WELL WITH LARGE DATASET: In large datasets, the cost
of calculating the distance between the new point and each existing points is huge which
degrades the performance of the algorithm.
2. DOES NOT WORK WELL WITH HIGH DIMENSIONS: The KNN algorithm
doesn't work well with high dimensional data because with large number of dimensions,
it becomes difficult for the algorithm to calculate the distance in each dimension.
3. NEED FEATURE SCALING: We need to do feature scaling (standardization and

normalization) before applying KNN algorithm to any dataset. If we don't do so, KNN
may generate wrong predictions.
4. SENSITIVE TO NOISY DATA, MISSING VALUES AND OUTLIERS: KNN is

sensitive to noise in the dataset. We need to manually impute missing values and
30
remove outliers.
II) DECISION TREE:
The decision tree also has its disadvantages and limitations.

1. The information in the decision tree relies on precise input to provide the user with a
reliable outcome. A little change in the data can result in a massive change in the
outcome. Getting reliable data can be hard for the project manager for example how
you would set the probability of a repair being a success or failure. The estimated cost
could be way off if several events in a row has been estimated 10% wrong.
2. Another fundamental flaw is that the decision tree is based on expectations that will
happen for each decision taken. The project manager’s skills to do predictions will
however always be limited. There can always be unforeseen events happening from a
decision taken which could change the outcome of the situation.
3. At the same time the decision tree is easy to use it can also be very complex and time
consuming. This is seen when using it on large problems. There will be many branches
and decisions which takes long time to create. With extra information added or removed
the manager would probably have to re-draw the decision tree.
4. Having large project can easily make the tool unwieldy as it can be hard to present for
colleagues if they have not been on the project from the start.
5. Even though the decision tree seem to be easy it requires skill and expertise to master.
Without this it could easily go wrong and could be at high expense for the company if
the outcome was not as expected. To ensure the expertise the company would have to
maintain their project manager’s skills which could be expensive.
6. Having to make a decision based on valuable information is good. However having to
much information can go in both ways. The project manager can hit the "paralysis of
analysis. Where he got a massive challenge to process all the information which will
slow down the decision making capacity. Having too much information could therefore
be a burden in both cost and time on analysis.
III) RANDOM FOREST:

31
1. COMPLEXITY: Random Forest creates a lot of trees (unlike only one tree in case of
decision tree) and combines their outputs. By default, it creates 100 trees in Python
sklearn library. To do so, this algorithm requires much more computational power and
resources. On the other hand decision tree is simple and does not require so much
computational resources.
2. LONGER TRAINING PERIOD: Random Forest require much more time to train as
compared to decision trees as it generates a lot of trees (instead of one tree in case of
decision tree) and makes decision on the majority of votes.
IV) APRIORI ALGORITHM:
One of the biggest limitations of the Apriori Algorithm is that it is slow. This is so because of
the bare decided by the
1. A large number of itemsets in the Apriori algorithm dataset.
2. Low minimum support in the data set for the Apriori algorithm.
3. The time needed to hold a large number of candidate-sets with many frequent itemsets.
Thus it is inefficient when used with large volumes of datasets.
V) GAUSSIAN NAÏVE BAYES:
The assumption that all features are independent is not usually the case in real life so it makes
naive bayes algorithm less accurate than complicated algorithms. Speed comes at a cost
VI) Support Vector Machine(SVM):

1. It doesn’t perform well when we have large data set because the required training time
is higher
2. It also doesn’t perform very well, when the data set has more noise i.e. target classes
are overlapping
3. SVM doesn’t directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation. It is included in the related SVC method of Python
scikit-learn library.
2.7) COMPARATIVE STUDY OF ALGORITHM:

32
1. APRIORI ALGORITHM:
HR Ranking System for Candidate analysis in mining includes the approach of extracting usage
behaviour of the customers in Human Resource data(HR_data). The obtained data can be
utilized in many ways such as, checking of fraudulent elements, improvement of the
application, etc. The presented research work, focuses on HR_data for analysis candidate
requirement and it mainly consist of three steps preprocessing of the information, knowledge
exploration, and finally Pattern analysis, task involves Association Rule Mining, Statistical
Analysis, Clustering, Classification and identification of Sequential Patterns. Apriori is most
commonly used algorithm in terms of data mining for extracting usage behaviour.
There are three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
SupportIt refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number of
transactions.
33
Confidence It refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
34
2. DECISION TREE:
Decision trees are outstanding tools to help anyone to select the best course of action.
They generate a highly valuable arrangement in which one can place options and
study possible outcomes of those options. They also facilitate users to make a fair
idea of the pros and cons related to each possible action. A decision tree is used to
represent graphically the decisions, the events, and the outcomes related to decisions
35
and events. Events are probabilistic and determined for each outcome. The aim of
this project HR Ranking system for candidate analysis is to do detailed analysis of
decision tree and its variants for determining the best appropriate decision.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the HR_data is taken,and the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the record
36
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.
3. KNN ALGORITHM:
Data mining is the process of handling information from a database which is

invisible directly. Data mining is predicted to become a highly revolutionary branch
of science over the next decade. One of data mining techniques is classification. The
most popular classification technique is K-Nearest Neighbor (KNN).
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on

Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
37
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
ACCURACY OF IMPLEMENTATION :
38
4. RANDOM FOREST:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
39
ACCURACY OF RANDOM FOREST:
5. GAUSSIAN NAÏVE BAYES:
Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution
and supports continuous data. We have explored the idea behind Gaussian Naive Bayes along
with an example.
Naive Bayes are a group of supervised machine learning classification algorithms based on
the Bayes theorem. It is a simple classification technique, but has high functionality. They find
use when the dimensionality of the inputs is high. Complex classification problems can also be
implemented by using Naive Bayes Classifier.
Bayes Theorem can be used to calculate conditional probability. Being a powerful tool in the
study of probability, it is also applied in Machine Learning.
40
ACCURACY OF NAÏVE BAYES CLASSIFIER MACHINE ALGORITHM:
6. SUPPORT VECTOR CLUSTERING(SVC):
The objective of clustering is to partition a data set into groups according to some criterion in
an attempt to organize data into a more meaningful form. There are many ways of achieving
this goal. Clustering may proceed according to some parametric model or by grouping points
according to some distance or similarity measure as in hierarchical clustering. A natural way
to put cluster boundaries is in regions in data space where there is little data, i.e. in "valleys" in
the probability distribution of the data. This is the path taken in support vector
clustering (SVC), which is based on the support vector approach.
In SVC data points are mapped from data space to a high dimensional feature space using a kernel
function. In the kernel's feature space the algorithm searches for the smallest sphere that encloses the
41
image of the data using the Support Vector Domain Description algorithm. This sphere, when mapped
back to data space, forms a set of contours which enclose the data points. Those contours are then
interpreted as cluster boundaries, and points enclosed by each contour are associated by SVC to the
same cluster.
OUTPUT OF SVC ALGORITHM IS:
2.8 CONCLUSION
Human Resource Candidate Ranking System consists of the people who select
expert workforce of an organization or business sector. Human Resources ought
to play variety of roles so as to pick skilled candidate for specific designation. A
System every HR person should possess that can rank the expertise needed for
specific job position at specific location. There is need of a system which can
study about a different candidates and classify them based upon different factors.
Upon the comparative study of various algorithms the accuracy that is found out
to predict and study the HR ranking system for the candidate analysis is
42
KNN algorithm scored the accuracy of 91%

Random forest scored the accuracy of 96%
SVC forest scored the accuracy of 98.9%
Naïve bayes classifier scored the accuracy of 91%
Decision tree classifier scored the accuracy of 81%
ALGORITHM USED PERCENTAGE ACCURACY

KNN ALGORITHM 91%
RANDOM FOREST 96%
SVC CLASSIFIER 98.9%
NAÏVE BAYES CLASSIFIER 91%
DECISION TREE CLASSIFIER 81%
43
REFERENCES
[1] Kaggle: Your Home for Data Science
[2]Data Science - DataScienceCentral.com
[3]Towards Data Science
[4] K-Nearest Neighbor(KNN) Algorithm for Machine Learning - Javatpoint
[5] Naive Bayes Classifier in Machine Learning - Javatpoint

44

dATA mINING REPORT 222

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

dATA mINING REPORT 222

Uploaded by

Copyright:

Available Formats

i

School of Computer Science and Engineering

TITLE HR RANKING SYSTEM FOR CANDIDATE

School of Computer Science and Engineering

TITLE: HR RANKING SYSTEM FOR CANDIDATE

School of Computer Science and Engineering

NAME: Prof. PUNITHA K

We are extremely grateful to Dr. Ganeshan R, Dean of the School of

HARSH KUMAR GUPTA

1.1 No. of Employees location vs count of employee---------------------------------------------11

2.1) Data Set for HR Ranking System-----------------------------------------------------------------10

 SLNO ----------------------------------- serial number

1.1OVERVIEW OF HR CANDIDATE RANKING SYSTEM:

PROBLEMS AND SOLUTION:

1.2 CHALLENGES IN HR SYSTEM:

1) Challenges in finding Data Set:

2) Visual Representation of data:

To be understood and impactful, data often needs to be visually presented in graphs or

3) Data from multiple sources:

5) Poor quality data:

6) Pressure from the top:

1.4) PROJECT STATEMENT:

Matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB.

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level

 Apriori, association rule :

sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest

The sklearn.ensemble module includes two averaging algorithms based on

1) Candidates, who has been hired.

2.2) MODEL DESCRIPTION

 Data mining − It is used to search for patterns of interest in a specific representational

 Acting on the discovered knowledge − It is using the knowledge directly, including

KDD is referred to as Knowledge Discovery in Database and is defined as a method of finding,

Knowledge discovery in databases (KDD) is an iterative multi-stage process for extracting

1) Data Analysis and Visualization:

 Checking for NAN Value

 Checking Dtype of each attribute

Apply Apriori Algorithm

Gaussian Naive Bayes

Decision Tree Classifier

2.6) RISK ANALYSIS OF ALGORITHM:

3. NEED FEATURE SCALING: We need to do feature scaling (standardization and

4. SENSITIVE TO NOISY DATA, MISSING VALUES AND OUTLIERS: KNN is

II) DECISION TREE:

The decision tree also has its disadvantages and limitations.

III) RANDOM FOREST:

IV) APRIORI ALGORITHM:

V) GAUSSIAN NAÏVE BAYES:

VI) Support Vector Machine(SVM):

2.7) COMPARATIVE STUDY OF ALGORITHM:

There are three components comprise the apriori algorithm.

Data mining is the process of handling information from a database which is

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on

ACCURACY OF RANDOM FOREST:

5. GAUSSIAN NAÏVE BAYES:

ACCURACY OF NAÏVE BAYES CLASSIFIER MACHINE ALGORITHM:

6. SUPPORT VECTOR CLUSTERING(SVC):

OUTPUT OF SVC ALGORITHM IS:

KNN algorithm scored the accuracy of 91%

ALGORITHM USED PERCENTAGE ACCURACY

[1] Kaggle: Your Home for Data Science

[2]Data Science - DataScienceCentral.com

[3]Towards Data Science

[4] K-Nearest Neighbor(KNN) Algorithm for Machine Learning - Javatpoint