You are on page 1of 44

DATA MINING

PRESIDENCY
Presidency COLLEGE
College
(Autonomous)
(Autonomous)

5 SEMESTER BCA

By
Dr. J. Vijay Fidelis
Reaccredited by Associate Professor, Dept of Computer Applications
NAAC with A+ Presidency college, Bangalore-24

Presidency
Group
UNIT V
CLASSIFICATION AND PREDICTION
Presidency College
(Autonomous)

There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
Classification
Prediction

Reaccredited by
NAAC with A+ Classification models predict categorical class labels;
Prediction models predict continuous valued functions.
For example, we can build a classification model to categorize bank
loan applications as either safe or risky,
Presidency
Group or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and
occupation.
What is classification?
Presidency College
(Autonomous)
Classification is to identify the category or the class label
of a new observation.
Following are the examples of cases where the data
analysis task is Classification −
A bank loan officer wants to analyze the data in order to
know which customer (loan applicant) are risky or which
Reaccredited by
NAAC with A+ are safe.
A marketing manager at a company needs to analyze a
customer with a given profile, who will buy a new
Presidency
computer.
Group
In both of the above examples, a model or classifier is
constructed to predict the categorical labels. These labels
are risky or safe for loan application data and yes or no
for marketing data.
What is Prediction?
Presidency College
(Autonomous)

It is used to find a numerical output. Same as in


classification, the training dataset contains the inputs and
corresponding numerical output values.
. The model should find a numerical output when the new
data is given..
Reaccredited by
NAAC with A+ Regression is generally used for prediction.
Predicting the value of a house depending on the facts such
as the number of rooms, the total area, etc., is an example
Presidency
for prediction.
Group
For example, suppose the marketing manager needs to
predict how much a particular customer will spend at his
company during a sale.
CLASSIFICATION VS PREDICTION
Presidency College
(Autonomous)

Classification Prediction
Classification is the process of identifying Predication is the process of identifying
which category a new observation belongs the missing or unavailable numerical data
to based on a training data set containing for a new observation.
observations whose category membership
is known.

In classification, the accuracy depends on In prediction, the accuracy depends on


Reaccredited by
NAAC with A+
finding the class label correctly. how well a given predictor can guess the
value of a predicated attribute for new
data.

In classification, the model can be known In prediction, the model can be known as
as the classifier. the predictor.
Presidency
Group A model or the classifier is constructed to A model or a predictor will be constructed
find the categorical labels. that predicts a continuous-valued function
or ordered value.
For example, the grouping of patients For example, We can think of prediction
based on their medical records can be as predicting the correct treatment for a
considered a classification. particular disease for a person.
HOW DOES CLASSIFICATION WORKS?
Presidency College
(Autonomous)
❖ With the help of the bank loan application that we have discussed
above, let us understand the working of classification.
The Data Classification process includes two steps −
➢ Building the Classifier or Model
➢ Using Classifier for Classification

Reaccredited by ❖ Building the Classifier or Model


NAAC with A+
➢ This step is the learning step or the learning phase.
➢ In this step the classification algorithms build the classifier.
➢ The classifier is built from the training set made up of database
Presidency
tuples and their associated class labels.
Group
➢ Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as sample,
object or data points.
❖ Building the Classifier or Model
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Using Classifier for Classification
Presidency College
(Autonomous)

In this step, the classifier is used for classification. Here the test data
is used to estimate the accuracy of classification rules. The
classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.

Reaccredited by
NAAC with A+

Presidency
Group
Classification and Prediction Issues
Presidency College
(Autonomous)

The major issue is preparing the data for Classification


and Prediction. Preparing the data involves the following
activities −
Data Cleaning − Data cleaning involves removing the
noise and treatment of missing values. The noise is
Reaccredited by
NAAC with A+ removed by applying smoothing techniques and the
problem of missing values is solved by replacing a
missing value with most commonly occurring value for
Presidency
that attribute.
Group
Relevance Analysis − Database may also have the
irrelevant attributes. Correlation analysis is used to know
whether any two given attributes are related.
Data Transformation and reduction
Presidency College
(Autonomous)

− The data can be transformed by any of the following methods.


Normalization is used to scale the data of an attribute so that it falls
in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally
useful for classification algorithms.
Need of Normalization –
Normalization is generally required when we are dealing with
Reaccredited by attributes on a different scale, otherwise, it may lead to a dilution in
NAAC with A+
effectiveness of an important equally important attribute(on lower
scale) because of other attribute having values on larger scale.

Presidency
In simple words, when multiple attributes are there but attributes
Group have values on different scales, this may lead to poor data models
while performing data mining operations. So they are normalized to
bring all the attributes on the same scale.
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
Comparison of Classification and Prediction
Methods
Presidency College
(Autonomous)

Here is the criteria for comparing the methods of Classification and


Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It
predict the class label correctly and the accuracy of the predictor
refers to how well a given predictor can guess the value of predicted
attribute for a new data.
Reaccredited by Speed − This refers to the computational cost in generating and using
NAAC with A+
the classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
Presidency
Scalability − Scalability refers to the ability to construct the classifier
Group or predictor efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor
understands.
Decision Tree Induction
Presidency College
(Autonomous)
A decision tree is a structure that includes a root node, branches, and
leaf nodes.
Each internal node denotes a test on an attribute,
each branch denotes the outcome of a test,
and each leaf node holds a class label.
The topmost node in the tree is the root node.
Reaccredited by
NAAC with A+
The benefits of having a decision tree are as follows −
❑ It does not require any domain knowledge.
❑ It is easy to comprehend.
Presidency ❑ The learning and classification steps of a decision tree are simple
Group
and fast.
Decision Tree Induction
Presidency College
(Autonomous)

The following decision tree is for the concept buy_computer that


indicates whether a customer at a company is likely to buy a
computer or not.
Each internal node represents a test on an attribute. Each leaf node
represents a class.

Reaccredited by
NAAC with A+

Presidency
Group
Decision Tree Induction Algorithm
Presidency College
(Autonomous)

A machine researcher named J. Ross Quinlan in 1980


developed a decision tree algorithm known as ID3
(Iterative Dichotomiser). Later, he presented C4.5, which
was the successor of ID3. ID3 and C4.5 adopt a greedy
approach.
Reaccredited by In this algorithm, there is no backtracking; the trees are
NAAC with A+
constructed in a top-down recursive divide-and-conquer
manner.

Presidency
Group
Tree Pruning
Presidency College
(Autonomous)

Tree pruning is performed in order to remove anomalies


in the training data due to noise or outliers. The pruned
trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
Reaccredited by
NAAC with A+
Pre-pruning − The tree is pruned by halting its
construction early.
Post-pruning - This approach removes a sub-tree from a
fully grown tree.
Presidency
Group
Data Mining - Bayesian Classification
Presidency College
(Autonomous)
Bayesian classification is based on Bayes'
Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class
membership probabilities such as the probability that a
given tuple belongs to a particular class.
Baye's Theorem
Reaccredited by
NAAC with A+
Bayes' Theorem is named after Thomas Bayes. There are
two types of probabilities −
Posterior Probability [P(H/X)]
Presidency
Group
Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Presidency College
(Autonomous)
They are also known as Belief Networks, Bayesian
Networks, or Probabilistic Networks.
A Belief Network allows class conditional independencies
to be defined between subsets of variables.
It provides a graphical model of causal relationship on
Reaccredited by
which learning can be performed.
NAAC with A+
There are two components that define a Bayesian
Belief Network −
Directed acyclic graph
Presidency
Group A set of conditional probability tables
Directed Acyclic Graph
Presidency College
(Autonomous)

• Each node in a directed acyclic graph


represents a random variable.
• These variable may be discrete or continuous
valued.
Reaccredited by
NAAC with A+
• These variables may correspond to the actual
attribute given in the data.

Presidency
Group
Directed Acyclic Graph Representation
Presidency College
(Autonomous)

The following diagram shows a directed acyclic graph for six Boolean
variables.

Reaccredited by
NAAC with A+

The arc in the diagram allows representation of causal knowledge.


For example, lung cancer is influenced by a person's family history
Presidency
Group
of lung cancer, as well as whether or not the person is a smoker.
It is worth noting that the variable PositiveXray is independent of
whether the patient has a family history of lung cancer or that the
patient is a smoker, given that we know the patient has lung cancer.
Conditional Probability Table
Presidency College
(Autonomous)

The conditional probability table for the values of the


variable LungCancer (LC) showing each possible
combination of the values of its parent nodes,
FamilyHistory (FH), and Smoker (S) is as follows −

Reaccredited by
NAAC with A+

Presidency
Group
CLASSIFICATION BY BACKPROPAGATION
Presidency College
(Autonomous)
• Back propagation, or backward propagation is
an algorithm that is designed to test for errors working
back from output nodes to input nodes.
• It is an important mathematical tool for improving the
accuracy of predictions in data mining and machine
learning..
• The characteristics of Back propagation are the iterative,
recursive and effective approach through which it
Reaccredited by
NAAC with A+
computes the updated weight to enhance the network
• Back propagation is generally used in neural network
training and computes the loss function concerning the
weights of the network.
Presidency
• It functions with a multi-layer neural network and
Group observes the internal representations of input-output
mapping.
CLASSIFICATION BY BACKPROPAGATION
Presidency College
(Autonomous)
➢ A neural network: A set of connected input/output units
where each connection has a weight associated with it .
➢ Neural networks can help computers make intelligent
decisions with limited human assistance. This is because
they can learn and model the relationships between input
and output data that are nonlinear and complex.
Reaccredited by
NAAC with A+ Neural Network as a Classifier
WEAKNESS
o Long training time
Presidency
o Require a number of parameters typically best determined
Group empirically, e.g., the network topology or ``structure."
o Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of ``hidden units" in
the network
CLASSIFICATION BY BACKPROPAGATION
Presidency College
(Autonomous)

Strength
o High tolerance to noisy data
o Ability to classify untrained patterns
o Well-suited for continuous-valued inputs and outputs
o Successful on a wide array of real-world data
Reaccredited by
NAAC with A+
o Algorithms are inherently parallel
o Techniques have recently been developed for the
extraction of rules from trained neural networks
Presidency
Group
PROCESS
Presidency College
(Autonomous)

Initialize the weights:


▪ The weights in the network are initialized to small random
numbers ranging from-1.0 to 1.0, or -0.5 to 0.5.(small random
numbers)
▪ Each training tuple, X, is processed by the following steps.
▪ Propagate the inputs forward: First, the training tuple is fed to the
Reaccredited by
NAAC with A+ input layer of the network.
▪ The inputs pass through the input units, unchanged. That is, for an
input unitj, its output, Oj, is equal to its input value, Ij.
▪ Next, the net input and output of eachunit in the hidden and
Presidency output layers are computed.
Group
PROCESS
Presidency College
(Autonomous)

▪ The net input to a unit in the hidden or output layers is


computed as a linear combination of its inputs.
▪ Each such unit has a number of inputs to it that are, in
fact, the outputs of the units connected to it in the
previous layer.
Reaccredited by
NAAC with A+ ▪ Each connection has a weight.
▪ To compute the net input to the unit, each input
connected to the unit is multiplied by its
Presidency
corresponding weight, and this is summed.
Group
PROCESS
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
k-Nearest-Neighbor Classifier:
Presidency College
(Autonomous)

▪ Nearest-neighbor classifiers are based on comparing


a given test tuple with training tuples that are similar
to it.
▪ The training tuples are described by n attributes.
▪ Each tuple represents a point in an n-dimensional
Reaccredited by space.
NAAC with A+
▪ In this way, all of the training tuples are stored in an n-
dimensional pattern space.
▪ When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training
Presidency
Group

tuples that are closest to the unknown tuple.


▪ These k training tuples are the k nearest neighbors of
the unknown tuple.
KNN Algorithm
Presidency College
(Autonomous)

Closeness is defined in terms of a distance metric, such as Euclidean


distance. The Euclidean distance between two points or tuples, say,
X1 = (x11, x12, … , x1n) and X2 = (x21, x22, … ,x2n),
is In other words, for each numeric attribute, we take the difference
between the corresponding values of that attribute in tuple X1and in
tuple X2, square this difference,and accumulate it.
Reaccredited by
NAAC with A+ The square root is taken of the total accumulated distance
count.

Presidency
Group
Working of KNN Algorithm
Presidency College
(Autonomous)

K-nearest neighbors (KNN) algorithm uses ‘feature


similarity’ to predict the values of new datapoints which
further means that the new data point will be assigned a
value based on how closely it matches the points in the
training set. We can understand its working with the help
Reaccredited by
of following steps −
NAAC with A+
Step 1 − For implementing any algorithm, we need
dataset. So during the first step of KNN, we must load the
training as well as test data.
Presidency
Group
Step 2 − Next, we need to choose the value of K i.e. the
nearest data points. K can be any integer.
KNN Algorithm
Presidency College
(Autonomous)

Step 3 − For each point in the test data do the following −


3.1 − Calculate the distance between test data and each
row of training data with the help of any of the method
namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is
Reaccredited by
Euclidean.
NAAC with A+
3.2 − Now, based on the distance value, sort them in
ascending order.
3.3 − Next, it will choose the top K rows from the sorted
Presidency
Group
array.
3.4 − Now, it will assign a class to the test point based on
most frequent class of these rows.
Step 4 − End
KNN Algorithm
Presidency College
(Autonomous)

Example
The following is an example to understand the concept of
K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as
follows −
Reaccredited by
NAAC with A+

Presidency
Group
KNN Algorithm
Presidency College
(Autonomous)
Now, we need to classify new data point with black dot (at point
60,60) into blue or red class. We are assuming K = 3 i.e. it would find
three nearest data points. It is shown in the next diagram −

Reaccredited by
NAAC with A+

Presidency
Group
We can see in the above diagram the three nearest neighbors of
the data point with black dot. Among those three, two of them lies
in Red class hence the black dot will also be assigned in red
class.
Pros and Cons of KNN
Presidency College
(Autonomous)

Pros
• It is very simple algorithm to understand and
interpret.
• It is very useful for nonlinear data because there is no
assumption about data in this algorithm.
Reaccredited by
NAAC with A+ • It is a versatile algorithm as we can use it for
classification as well as regression.
• It has relatively high accuracy but there are much
Presidency
better supervised learning models than KNN.
Group
Cons
Presidency College
(Autonomous)

• It is computationally a bit expensive algorithm


because it stores all the training data.
• High memory storage required as compared
to other supervised learning algorithms.
Reaccredited by
• Prediction is slow in case of big N.
NAAC with A+
• It is very sensitive to the scale of data as well
as irrelevant features.
Presidency
Group
Applications of KNN
Presidency College
(Autonomous)

The following are some of the areas in which KNN can be applied
successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for
loan approval? Does that individual have the characteristics similar to the
defaulters one?
Calculating Credit Ratings
Reaccredited by KNN algorithms can be used to find an individual’s credit rating by
NAAC with A+ comparing with the persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into
various classes like “Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’,
Presidency
“Will Vote to Party ‘BJP’.
Group Other areas in which KNN algorithm can be used are Speech Recognition,
Handwriting Detection, Image Recognition and Video Recognition.
Presidency College
(Autonomous)

Reaccredited by
NAAC with A+

Presidency
Group
What Is the Genetic Algorithm?
Presidency College
(Autonomous)
➢ The genetic algorithm is a method for solving both
constrained and unconstrained optimization problems.
➢ The genetic algorithm repeatedly modifies a population of
individual solutions.
➢ At each step, the genetic algorithm selects individuals at random
from the current population to be parents and uses them to
Reaccredited by
produce the children for the next generation.
NAAC with A+
➢ Over successive generations, the population "evolves" toward an
optimal solution.
➢ Genetic algorithm is used to solve a variety of optimization
problems that are not well suited for standard optimization
Presidency
Group algorithms, including problems in which the objective function is
discontinuous, non differentiable, highly nonlinear.
➢ The genetic algorithm can address problems of mixed integer
programming, where some components are restricted to be
integer-valued.
Presidency College
(Autonomous)
The genetic algorithm uses three main types of rules at each step to
create the next generation from the current population:
• Selection rules select the individuals, called parents, that contribute
to the population at the next generation.
• Crossover rules combine two parents to form children for the next
generation.
Reaccredited by • Mutation rules apply random changes to individual parents to form
NAAC with A+
children.

The genetic algorithm differs from a classical, derivative-based,


optimization algorithm in two main ways, as summarized in the
Presidency
Group following table.
Cluster Analysis
Presidency College
(Autonomous)

Cluster is a group of objects that belongs to the same class. In other words,
similar objects are grouped in one cluster and dissimilar objects are grouped
in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of
similar objects.
Reaccredited by
Points to Remember
NAAC with A+ • A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable
Presidency to changes and helps single out useful features that distinguish different
Group
groups
Applications of Cluster Analysis
Presidency College
(Autonomous)
• Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing

Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
Reaccredited by
inherent to populations.
NAAC with A+ • Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information
discovery.
Presidency • Clustering is also used in outlier detection applications such as detection of
Group
credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
Presidency College
(Autonomous)
The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large
databases.
• Ability to deal with different kinds of attributes − Algorithms should be
capable to be applied on any kind of data such as interval-based (numerical) data,
categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should
be capable of detecting clusters of arbitrary shape. They should not be bounded
Reaccredited by
NAAC with A+ to only distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to
handle low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead to poor
Presidency
Group
quality clusters.
• Interpretability − The clustering results should be interpretable,
comprehensible, and usable
Clustering Methods
Presidency College
(Autonomous) Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
Model-Based Method
• Constraint-based Method
Reaccredited by
NAAC with A+

Presidency
Group
Partitioning Method
Presidency College
(Autonomous)

Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤
n. It means that it will classify the data into k groups, which satisfy the
following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Hierarchical Methods
Reaccredited by
This method creates a hierarchical decomposition of the given set of data
NAAC with A+ objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with
all of the objects in the same cluster. In the continuous iteration, a cluster is
Presidency
Group
split up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.

You might also like