Professional Documents
Culture Documents
Programme: MCA
Course Title: Datamining and Business Intelligence
Course Code: ITA5007
Slot: A1+TA1
Faculty: Dr Punitha K
Sign:
Date:
ii
Programme: MCA
Course Title: Datamining and Business Intelligence
Course Code: ITA5007
Slot: A1+TA1
DECLARATION
We Harsh Kumar Gupta and Akanksha hereby declared that the thesis entitled “HR
RANKING SYSTEM FOR CANDIDATE ANALYSIS” submitted by us, for the
completion of the course, Data mining and Business Intelligence (ITA5007) is a record of
bonafide work carried out by me under the supervision of Dr Punitha K, my course
instructor. I further declare that the work reported in this document has not been
submitted and will not be submitted, either in part or in full, for any other courses in
this institute or any other institute or university.
Place: Chennai
Date: Signature of the Candidate
iv
CERTIFICATE
This is to certify that the report entitled “HR RANKING SYSTEM FOR
CANDIDATE ANALYSIS” is prepared and submitted by HARSH KUMAR GUPTA
(21MCA1005), AKANKSHA (21MCA1062) to Vellore Institute of Technology,
Chennai, in partial fulfillment of the requirement for the course, Data Mining and
Business Intelligence (ITA5007), is a bonafide record carried out under my guidance.
The project fulfills the requirements as per the regulations of this University and in my
opinion meets the necessary standards for submission. The contents of this report have
not been submitted and will not be submitted either in part or in full, for the any other
course and the same is certified.
ACKNOWLEDGEMENT
We wish to express our sincere thanks and deep sense of gratitude to our
project guide, Prof. PUNITHA K for her consistent encouragement and
valuable guidance offered to us in a pleasant manner throughout the course
of the project work.
We also take this opportunity to thank all the faculty of the school for their
support and their wisdom imparted to us throughout the course. We thank
our parents, family, and friends for bearing with us throughout the course
of our project and for the opportunity they provided us in undergoing this
course in such a prestigious institution.
AKANKSHA
vi
ABSTRACT
Human Resource Candidate Ranking System consists of the people who select
expert workforce of an organization or business sector. Human Resources ought
to play variety of roles so as to pick skilled candidate for specific designation. A
System every HR person should possess that can rank the expertise needed for
specific job position at specific location. This system can rank the location, level
of most demanding job and according to percentage hike while switching the job
supported the expertise and different key skills that square measure needed for
specific job profile. Project involved python and machine Learning Algorithm
such as Apriori Algorithm, Decision Tree, Random Forest, SVM, KNN, and
Naïve Bayes Algorithm, where a comparative study is being done on the dataset
which collected from Kaggle. In this project I found the area which having the
highest number of employee based on the location and also study about the job
band on the basis of location upon the dataset.
vii
CONTENTS
INDEX.................................................................................................................................... vii
LIST OF FIGURES ................................................................................................................ x
LIST OF TABLES ............................................................................................................... xvi
LIST OF ACRONYMS ...................................................................................................... xxx
CHAPTER 1
1.1 INTRODUCTION ………………………………………………………………….1
1.2 OVERVIEW OF HR RANKING SYSTEM………………………………………1
1.3 CHALLENGES IN HR SYSTEM…………………………………………………2
1.4 PROJECT STATEMENT………………………………………………………….5
1.5 OBJECTIVES………………………………………………………………….........9
1.6 SCOPE OF THE PROJECT……………………………………………………….9
CHAPTER 2
2.1 DATA SET……………………………………………………………………………10
2.2 MODEL DESCRIPTION……………………………………………………………10
2.3 METHODOLOGY…………………………………………………………………...12
2.4 CODE………………………………………………………………………………….13
2.5 OUTPUT………………………………………………………………………………23
2.6 RISK ANALYSIS OF EVERY ALGORITHM………………………………….....24
2.7 COMPARATIVE STUDY OF VARIOUS ALGORITHM………………………..26
2.8 CONCLUSION…………………………………………………………………….....41
REFERENCES…………………………………………………………….35
viii
LIST OF FIGURES
LIST OF TABLES
LIST OF ACRONYMS
CHAPTER 1
INTRODUCTION
Human Resource Candidate Ranking System consists of the people who select expert
workforce of an organization or business sector. Human Resources ought to play variety of
roles so as to pick skilled candidate for specific designation.
A System every HR person should possess that can rank the expertise needed for specific job
position at specific location. This system can rank the location, level of most demanding job
and according to percentage hike while switching the job supported the expertise and different
key skills that square measure needed for specific job profile. Project involved python and
machine Learning Algorithm such as Apriori Algorithm, Decision Tree, Random Forest, SVM,
KNN, and Naïve Bayes Algorithm, where a comparative study is being done on the dataset
which collected from Kaggle. In this project I found the area which having the highest number
of employee based on the location and also study about the job band on the basis of location
upon the dataset.
In the present system the candidate has to fill each and every information regarding there
resume in a manual form which takes large amount of time and then also the candidates, are
not satisfied by the job which the present system prefers according to their skills. Let me tell
you a ratio of 5:1 means, If 5 people are getting job than out of that 5, only a single guy will be
satisfied by his/her job. Let me tell you an example: If I am a good python developer and
particular company hired me and they are making me work on Java so, my python skills are
pretty useless. And on the other hand if there is vacant place in a company so according to
owner of the company he/she will prefer a best possible candidate for that vacancy. So our
system will act as a handshake between this two entities. The company who prefer the best
possible candidate and the candidate who prefers the best possible job according to his or her
skills and ability.
2
The problem is that the present are not much flexible and efficient and time saving. It requires
candidate, to fill the forms online than also you might not get the genuine information of the
candidate. Beside Where our system which saves the time of the candidate by providing to
upload there resume in any format preferable to the candidate beside all the information in the
resume our system will detect all its activity from the candidate social profile which will give
the best candidate for that particular job and candidate will also be satisfied because he will get
job in that company which really appreciates candidates skill and ability. On the other hand we
are providing same kind of flexibility to the client company.
MOTIVATION
The current recruitment process are more tedious and time consuming which forces the
candidates to fill all their skill and information manually. And HR team requires more man
power to scrutinize the resumes of the candidates. So that motivated to build a solution that is
more flexible and automated.
With so much data available, it’s difficult to dig down and access the insights that are
needed most. When employees are overwhelmed, they may not fully analyse data or only
focus on the measures that are easiest to collect instead of those that truly add value. In
addition, if an employee has to manually sift through data, it can be impossible to gain
real-time insights on what is currently happening. Outdated data can have significant
negative impacts on decision-making.
A data system that collects, organizes and automatically alerts users of trends will help
solve this issue. Employees can input their goals and easily create a report that provides
3
the answers to their most important questions. With real-time reports and alerts, decision-
makers can be confident they are basing any choices on complete and accurate
information.
The next issue is trying to analyze data across multiple, disjointed sources. Different
pieces of data are often housed in different systems. Employees may not always realize
this, leading to incomplete or inaccurate analysis. Manually combining data is time-
consuming and can limit insights to what is easily viewed.
With a comprehensive and centralized system, employees will have access to all types of
information in one location. Not only does this free up time spent
Accessing multiple sources, it allows cross-comparisons and ensures data is complete.
4) Inaccessible data:
Moving data into one centralized system has little impact if it is not easily accessible to
the people that need it. Decision-makers and risk managers need access to all of an
organization’s data for insights on what is happening at any given moment, even if they
are working off-site. Accessing information should be the easiest part of data analytics.
An effective database will eliminate any accessibility issues. Authorized employees will
be able to securely view or edit data from anywhere, illustrating organizational changes
and enabling high-speed decision making.
4
Nothing is more harmful to data analytics than inaccurate data. Without good input,
3output will be unreliable. A key cause of inaccurate data is manual errors made during
data entry. This can lead to significant negative consequences if the analysis is used to
influence decisions. Another issue is asymmetrical data: when information in one system
does not reflect the changes made in another system, leaving it outdated.
A centralized system eliminates these issues. Data can be input automatically with
mandatory or drop-down fields, leaving little room for human error. System integrations
ensure that a change in one area is instantly reflected across the board
As risk management becomes more popular in organizations, CFOs and other executives
demand more results from risk managers. They expect higher returns and a large number
of reports on all kinds of data.
With a comprehensive analysis system, risk managers can go above and beyond
expectations and easily deliver any desired analysis. They’ll also have more time to act
on insights and further the value of the department to the organization.
7) Confusion in Attribute:
Users may feel confused or anxious about switching from traditional data analysis
methods, even if they understand the benefits of automation. Nobody likes change,
especially when they are comfortable and familiar with the way things are done.
To overcome this HR problem, it’s important to illustrate how changes to analytics will
actually streamline the role and make it more meaningful and fulfilling. With
comprehensive data analytics, employees can eliminate redundant tasks like data
collection and report building and spend time acting on insights instead.
8) Shortage of skill:
5
Some organizations struggle with analysis due to a lack of talent. This is especially true
in those without formal risk departments. Employees may not have the knowledge or
capability to run in-depth data analysis.
This challenge is mitigated in two ways: by addressing analytical competency in the
hiring process and having an analysis system that is easy to use. The first solution ensures
skills are on hand, while the second will simplify the analysis process for everyone.
Everyone can utilize this type of system, regardless of skill level.
9) Scaling data analysis:
Finally, analytics can be hard to scale as an organization and the amount of data it collects
grows. Collecting information and creating reports becomes increasingly complex. A
system that can grow with the organization is crucial to manage this issue.
While overcoming these challenges may take some time, the benefits of data analysis are
well worth the effort. Improve your organization today and consider investing in a data
analytics system.
1) PROBLEM STATEMENT:
The problem is that the present are not much flexible and efficient and time saving. It requires
candidate, to fill the forms online than also you might not get the genuine information of the
candidate. Beside Where our system which saves the time of the candidate by providing to
upload there resume in any format preferable to the candidate beside all the information in the
resume our system will detect all its activity from the candidate social profile which will give
the best candidate for that particular job and candidate will also be satisfied because he will get
job in that company which really appreciates candidates skill and ability. On the other hand we
are providing same kind of flexibility to the client company.
1) REQUIREMENT ANALYSIS:
Language: Python
Library:
6
Pandas:
Pandas is a Python library for data analysis. Started by Wes McKinney in 2008 out of a need
for a powerful and flexible quantitative analysis tool, pandas has grown into one of the most
popular Python libraries. It has an extremely active community of contributors.
Pandas is built on top of two core Python libraries—matplotlib for data visualization
and NumPy for mathematical operations. Pandas acts as a wrapper over these libraries,
allowing you to access many of matplotlib's and NumPy's methods with less code. For instance,
pandas' .plot() combines multiple matplotlib methods into a single method, enabling you to
plot a chart in a few lines.
Before pandas, most analysts used Python for data munging and preparation, and then switched
to a more domain specific language like R for the rest of their workflow. Pandas introduced
two new types of objects for storing data that make analytical tasks easier and eliminate the
need to switch tools: Series, which have a list-like structure, and DataFrames, which have a
tabular structure.
Numpy :
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked arrays
and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms,
basic linear algebra, basic statistical operations, random simulation and much more.
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional
arrays of homogeneous data types, with many operations being performed in compiled code
for performance. There are several important differences between NumPy arrays and the
standard Python sequences:
Matplotlib.pyplot:
In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of
things like the current figure and plotting area, and the plotting functions are directed to the
current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of a figure and not the strict mathematical term for more than one axis).
Seaborn :
Apriori is an algorithm for frequent item set mining and association rule
learning over relational databases. It proceeds by identifying the frequent individual items in
the database and extending them to larger and larger item sets as long as those item sets appear
sufficiently often in the database. The frequent item sets determined by Apriori can be used to
determine association rules which highlight general trends in the database: this has applications
in domains such as market basket analysis.
Itertools:
Itertools is a module in python, it is used to iterate over data structures that can be stepped over
using a for-loop. Such data structures are also known as iterables.
This module incorporates functions that utilize computational resources efficiently. Using this
module also tends to enhance the readability and maintainability of the code. The itertools
module needs to be imported prior to using it in the code.
Warning:
8
Warnings are provided to warn the developer of situations that aren’t necessarily exceptions.
Usually, a warning occurs when there is some obsolete of certain programming elements, such
as keyword, function or class, etc. A warning in a program is distinct from an error. Python
program terminates immediately if an error occurs. Conversely, a warning is not critical. It
shows some message, but the program runs. The warn() function defined in the ‘warning‘
module is used to show warning messages. The warning module is actually a subclass of
Exception which is a built-in class in Python.
Sklearn.model_selection:
Before discussing train_test_split, you should know about Sklearn (or Scikit-learn). It is
a Python library that offers various features for data processing that can be used for
classification, clustering, and model selection.
Model_selection is a method for setting a blueprint to analyze data and then using it to measure
new data. Selecting a proper model allows you to generate accurate results when making a
prediction. To do that, you need to train your model by using a specific dataset. Then, you test
the model against another dataset.
If you have one dataset, you'll need to split it by using the Sklearn train_test_split function first.
Sklearn.preprocessing:
The sklearn.preprocessing package provides several common utility functions and transformer
classes to change raw feature vectors into a representation that is more suitable for the
downstream estimators.
In general, learning algorithms benefit from standardization of the data set. If some outliers are
present in the set, robust scalers or transformers are more appropriate. The behaviors of the
different scalers, transformers, and normalizers on a dataset containing marginal outliers is
highlighted in Compare the effect of different scalers on data with outliers.
Sklearn.neighbors:
Brute Force. In other words, it acts as a uniform interface to these three algorithms.
The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides
the functionality for unsupervised as well as supervised neighbors-based learning methods.
The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or
Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is
basically only step 1, which is discussed above, and the foundation of many algorithms (KNN
and K-means being the famous one) which require the neighbor search. In simple words, it is
unsupervised learner for implementing neighbor searches.
Sklearn.svm:
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of
samples.
Uses a subset of training points in the decision function (called support vectors), so it
is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function.
Common kernels are provided, but it is also possible to specify custom kernels.
Sklearn.ensemble:
Environment:
Jupyter notebook:
10
The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Its uses include
data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.
1.5 OBJECTIVE:
The major objective of our system is to take the current resume ranking system for candidate
ranking to other level and makes it more flexible for both the entity.
CHAPTER 2
2.1) DATASET
The dataset is named as Hr.csv file wherein the dataset is downloaded from kaggle.
and application type. Therefore, it is required to understand the process and the multiple
requirements and possibilities in each stage.
Developing an understanding − This is the basic preliminary step. It creates the scene
for learning what should be done with the several decisions like transformation,
algorithms, representation, etc. The individuals who are in charge of a KDD venture
are required to learn and characterize the goals of the end-user and the environment in
13
which the knowledge discovery process will appear (involves relevant prior
knowledge).
Creating a target data set − It can be choosing a data set or targeting a subset of
variables or data samples, on which discovery is to be implemented. This process is
essential because Data Mining learns and finds from the accessible data. This is the
evidence foundation for building the models. If some important attributes are missing,
at that point, then the whole study can be unsuccessful from this respect, the more
attributes are considered.
Data cleaning and pre-processing − Data cleaning defines to clean the data by filling
in the missing values, smoothing noisy data, identifying and eliminating outliers, and
removing inconsistencies in the data.
Exploratory analysis and model and hypothesis selection − It can be selecting the
data mining algorithm(s) and selecting method(s) to be used for searching for data
patterns. This process contains deciding which models and parameters can be
appropriate and matching a particular data-mining method with the long-term criteria
of the KDD process.
and clustering. The user can significantly help the data-mining method by correctly
implementing the preceding steps.
At this stage, you might need to go back and collect more items as per your needs, and this
process will repeat a few times or be completely skipped as per the conditions.
2.3) METHODOLOGY:
The KDD process can be divided into eight major stages which are described in detail in the
following sections.
• Problem specification.
• Resourcing.
• Data cleansing.
• Pre-processing.
• Data mining.
• Evaluation of results.
• Interpretation of results.
• Exploitation of results.
We present the KDD process in the form of a roadmap. The process has one major route which
is to get from the starting data to discovered knowledge about the data. Like most long and
difficult journeys, there will be a number of stops along the way. A stop represents one of the
eight KDD stages named above, each of which is made up of a number of smaller tasks. Each
stage, and task, is optional although in most circumstances at least one task from each stage
will be necessary. When applying the roadmap, it is unlikely the project will run directly from
16
start to end; stages will usually have to be repeated using a different set of decisions and
parameters. For this reason, the process is deemed iterative which is denoted by the inner
feedback route on the figure. The iterations performed during the KDD process are vital to the
success of the KDD project. In addition to reviewing and repeating a major stage, some or all
of the smaller tasks within the stage can be repeated.
2.4) CODE:
Checking Shape
DisplayColumn
18
19
20
From all above observation we can observe the **target variable** which is "Status" is
inbalance in nature which is we can observe from above plot number of candidates joined are
21
more in number than the Not Joined once. Even though the model accuracy is good enough but
the predications made are wrong we knew it. So inorder the make dataset balance in nature we
shall use the **Synthetic Minority Oversample TechniquE ( SMOTE )** for it to make it
balance.
22
Importing Library
Random Forest
26
SVM->Support Model
27
2.5) OUTPUT:
COMBINED ACCURACY OF ALL 5 ALGORITHMS
I) KNN ALGORITHM:
1. DOES NOT WORK WELL WITH LARGE DATASET: In large datasets, the cost
of calculating the distance between the new point and each existing points is huge which
degrades the performance of the algorithm.
2. DOES NOT WORK WELL WITH HIGH DIMENSIONS: The KNN algorithm
doesn't work well with high dimensional data because with large number of dimensions,
it becomes difficult for the algorithm to calculate the distance in each dimension.
remove outliers.
1. COMPLEXITY: Random Forest creates a lot of trees (unlike only one tree in case of
decision tree) and combines their outputs. By default, it creates 100 trees in Python
sklearn library. To do so, this algorithm requires much more computational power and
resources. On the other hand decision tree is simple and does not require so much
computational resources.
2. LONGER TRAINING PERIOD: Random Forest require much more time to train as
compared to decision trees as it generates a lot of trees (instead of one tree in case of
decision tree) and makes decision on the majority of votes.
One of the biggest limitations of the Apriori Algorithm is that it is slow. This is so because of
the bare decided by the
1. A large number of itemsets in the Apriori algorithm dataset.
2. Low minimum support in the data set for the Apriori algorithm.
3. The time needed to hold a large number of candidate-sets with many frequent itemsets.
Thus it is inefficient when used with large volumes of datasets.
The assumption that all features are independent is not usually the case in real life so it makes
naive bayes algorithm less accurate than complicated algorithms. Speed comes at a cost
1. APRIORI ALGORITHM:
HR Ranking System for Candidate analysis in mining includes the approach of extracting usage
behaviour of the customers in Human Resource data(HR_data). The obtained data can be
utilized in many ways such as, checking of fraudulent elements, improvement of the
application, etc. The presented research work, focuses on HR_data for analysis candidate
requirement and it mainly consist of three steps pre- processing of the information, knowledge
exploration, and finally Pattern analysis, task involves Association Rule Mining, Statistical
Analysis, Clustering, Classification and identification of Sequential Patterns. Apriori is most
commonly used algorithm in terms of data mining for extracting usage behaviour.
1. Support
2. Confidence
3. Lift
SupportIt refers to the default popularity of any product. You find the support as a quotient
of the division of the number of transactions comprising that product by the total number of
transactions.
33
Confidence It refers to the possibility that the customers bought both biscuits and chocolates
together. So, you need to divide the number of transactions that comprise both biscuits and
chocolates by the total number of transactions to get the confidence.
34
2. DECISION TREE:
Decision trees are outstanding tools to help anyone to select the best course of action.
They generate a highly valuable arrangement in which one can place options and
study possible outcomes of those options. They also facilitate users to make a fair
idea of the pros and cons related to each possible action. A decision tree is used to
represent graphically the decisions, the events, and the outcomes related to decisions
35
and events. Events are probabilistic and determined for each outcome. The aim of
this project HR Ranking system for candidate analysis is to do detailed analysis of
decision tree and its variants for determining the best appropriate decision.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the HR_data is taken,and the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the record
36
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.
3. KNN ALGORITHM:
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
ACCURACY OF IMPLEMENTATION :
38
4. RANDOM FOREST:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
39
Gaussian Naive Bayes is a variant of Naive Bayes that follows Gaussian normal distribution
and supports continuous data. We have explored the idea behind Gaussian Naive Bayes along
with an example.
Naive Bayes are a group of supervised machine learning classification algorithms based on
the Bayes theorem. It is a simple classification technique, but has high functionality. They find
use when the dimensionality of the inputs is high. Complex classification problems can also be
implemented by using Naive Bayes Classifier.
Bayes Theorem can be used to calculate conditional probability. Being a powerful tool in the
study of probability, it is also applied in Machine Learning.
40
The objective of clustering is to partition a data set into groups according to some criterion in
an attempt to organize data into a more meaningful form. There are many ways of achieving
this goal. Clustering may proceed according to some parametric model or by grouping points
according to some distance or similarity measure as in hierarchical clustering. A natural way
to put cluster boundaries is in regions in data space where there is little data, i.e. in "valleys" in
the probability distribution of the data. This is the path taken in support vector
clustering (SVC), which is based on the support vector approach.
In SVC data points are mapped from data space to a high dimensional feature space using a kernel
function. In the kernel's feature space the algorithm searches for the smallest sphere that encloses the
41
image of the data using the Support Vector Domain Description algorithm. This sphere, when mapped
back to data space, forms a set of contours which enclose the data points. Those contours are then
interpreted as cluster boundaries, and points enclosed by each contour are associated by SVC to the
same cluster.
2.8 CONCLUSION
Human Resource Candidate Ranking System consists of the people who select
expert workforce of an organization or business sector. Human Resources ought
to play variety of roles so as to pick skilled candidate for specific designation. A
System every HR person should possess that can rank the expertise needed for
specific job position at specific location. There is need of a system which can
study about a different candidates and classify them based upon different factors.
Upon the comparative study of various algorithms the accuracy that is found out
to predict and study the HR ranking system for the candidate analysis is
42
REFERENCES