You are on page 1of 43

Classification and Prediction

Lecture-22,23,24,25,26,27, 28
Dr. Sudhir Sharma
Manipal University Jaipur
Topic Covered
• Basic Concepts,
• Decision Tree,
• Introduction, Bayesian, Rule-Based,
• Backpropagation,
• Support Vector Machines,
• Lazy Learners,
• Prediction
Data Mining - Classification & Prediction
• There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
These two forms are as follows −
• Classification
• Prediction
• Classification models predict categorical class labels, and prediction
models predict continuous-valued functions. For example, we can
build a classification model to categorize bank loan applications as
either safe or risky or a prediction model to predict the expenditures in
dollars of potential customers on computer equipment given their
income and occupation.
Classification?
• Following are the examples of cases
where the data analysis task is
Classification −
• A bank loan officer wants to analyze the data
in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to
analyze a customer with a given profile, who
will buy a new computer.
• In both of the above examples, a model or
classifier is constructed to predict the
categorical labels. These labels are risky
or safe for loan application data and yes
or no for marketing data.
Prediction?
• Following are the examples of cases
where the data analysis task is Prediction

• Suppose the marketing manager needs to
predict how much a given customer will spend
during a sale at his company. In this example,
we are bothered to predict a numeric value.
Therefore the data analysis task is an example
of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a
continuous-valued function or ordered value.
• Note − Regression analysis is a statistical
methodology that is most often used for
numeric prediction.
How Does Classification Works?
• With the help of the bank loan application that we
have discussed above, let us understand the working
of classification. The Data Classification process
includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
• Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the
classifier.
• The classifier is built from the training set made up of
database tuples and their associated class labels.
• Each tuple that constitutes the training set is referred
to as a category or class. These tuples can also be
referred to as sample, object or data points.
Using Classifier for Classification
• In this step, the classifier is
used for classification. Here the
test data is used to estimate
the accuracy of classification
rules. The classification rules
can be applied to the new data
tuples if the accuracy is
considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and Prediction.
Preparing the data involves the following activities −
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying smoothing
techniques and the problem of missing values is solved by replacing a
missing value with the most commonly occurring value for that attribute.
• Relevance Analysis − Database may also have irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
• Data Transformation and reduction − The data can be transformed by
any of the following methods.
• Normalization − The data is transformed using normalization. Normalization
involves scaling all values for a given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
• Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
• Note − Data can also be reduced by some other methods such as
wavelet transformation, binning, histogram analysis, and clustering.
Comparison of Classification and
Prediction Methods
• Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the
class label correctly and the accuracy of the predictor refers to how well a given
predictor can guess the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct
predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understands.
Decision Tree
Decision Tree
• A decision tree is a structure that includes
a root node, branches, and leaf nodes.
Each internal node denotes a test on an
attribute, each branch denotes the
outcome of a test, and each leaf node
holds a class label. The topmost node in
the tree is the root node.
• The following decision tree is for the
concept buy_computer that indicates
whether a customer at a company is likely
to buy a computer or not. Each internal
node represents a test on an attribute.
Each leaf node represents a class.
Conti…
• The benefits of having a decision
tree are as follows −
• It does not require any domain
knowledge.
• It is easy to comprehend.
• The learning and classification
steps of a decision tree are simple
and fast.
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980
developed a decision tree algorithm known as ID3 (Iterative
Dichotomiser). Later, he presented C4.5, which was the
successor of ID3. ID3 and C4.5 adopt a greedy approach. In
this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer
manner.
Generate_Decision_Tree
Generating a decision tree from training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree
Tree Pruning
• Tree pruning is performed in order to remove anomalies in the training
data due to noise or outliers. The pruned trees are smaller and less
complex.
• Tree Pruning Approaches
• There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
• Cost Complexity
• The cost complexity is measured by the following two parameters −
• Number of leaves in the tree, and
• Error rate of the tree.
Data Mining - Bayesian Classification
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are statistical classifiers. Bayesian
classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a
particular class.

Baye's Theorem

Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

Posterior Probability [P(H/X)]

Prior Probability [P(H)]

where X is data tuple and H is some hypothesis.

According to Bayes' Theorem,

P(H/X)= P(X/H)P(H) / P(X)


Bayesian Belief Network
• Bayesian Belief Networks specify joint conditional probability
distributions. They are also known as Belief Networks, Bayesian
Networks, or Probabilistic Networks.
• A Belief Network allows class conditional independencies to be defined
between subsets of variables.
• It provides a graphical model of causal relationships on which learning can
be performed.
• We can use a trained Bayesian Network for classification.
• There are two components that define a Bayesian Belief Network −
• Directed acyclic graph
• A set of conditional probability tables
Conti…
Conditional Probability Table
• The conditional probability table for the values of the variable
LungCancer (LC) showing each possible combination of the
values of its parent nodes, FamilyHistory (FH), and Smoker (S)
is as follows −
Data Mining - Rule Based
Classification
Rule-based classifiers
• Rule-based classifiers are just another type of classifier that makes
the class decision depending on using various “if..else” rules. These
rules are easily interpretable and thus these classifiers are generally
used to generate descriptive models. The condition used with “if” is
called the antecedent and the predicted class of each rule is called
the consequent.
Properties of rule-based classifiers
• Coverage: The percentage of records that satisfy the antecedent
conditions of a particular rule.
• The rules generated by the rule-based classifiers are generally not
mutually exclusive, i.e. many rules can cover the same record.
• The rules generated by the rule-based classifiers may not be
exhaustive, i.e. there may be some records that are not covered by any
of the rules.
• The decision boundaries created by them are linear, but these can be
much more complex than the decision tree because the many rules are
triggered for the same record.
Conti…
• There are two solutions to the above problem: 
 
• Either rules can be ordered, i.e. the class corresponding to the highest priority
rule triggered is taken as the final class.
• Otherwise, we can assign votes for each class depending on some their
weights, i.e. the rules remain unordered.
How to generate a rule
• Algorithm for generating the
model incrementally: 
• The algorithm given below
generates a model with unordered
rules and ordered classes, i.e. we
can decide which class to give
priority while generating the
rules. 
Conti…
• A <-Set of attributes 
T <-Set of training records 
Y <-Set of classes 
Y’ <-Ordered Y according to relevance 
R <-Set of rules generated, initially to an empty list 
for each class y in Y’ 
while the majority of class y records are not covered 
generate a new rule for class y, using methods given above 
Add this rule to R 
Remove the records covered by this rule from T 
end while 
end for 
Add rule {}->y’ where y’ is the default class 
Classifying a record: 
• The classification algorithm R <-Set of rules generated using training Set 
T <-Test Record 
described below assumes that the W <-class name to Weight mapping,
rules are unordered and the predefined, given as input 
classes are weighted.  F <-class name to Vote mapping,
generated for each test record, to be
calculated 
for each rule r in R 
check if r covers T 
if so then add W of predicted_class to F of
predicted_class 
end for 
Output the class with the highest
calculated vote in F 
IF-THEN Rules
•Rule-based classifier makes use of a set of IF-THEN rules for
classification. We can express a rule in the following from −
• Note − We can also write rule R1 as follows −
• IF condition THEN conclusion

•Let us consider a rule R1, • R1: (age = youth) ^ (student = yes))(buys


computer = yes)
• R1: IF age = youth AND student = yes

• THEN buy_computer = yes • If the condition holds true for a given tuple,
•Points to remember − then the antecedent is satisfied.
 The IF part of the rule is called rule
antecedent or precondition.

 The THEN part of the rule is called rule consequent.

 The antecedent part the condition consist of one or more


attribute tests and these tests are logically ANDed.

 The consequent part consists of class prediction.


Back propagation
• What are Artificial Neural Networks?
• A neural network is a group of connected I/O units where each connection has a weight
associated with its computer programs. It helps you to build predictive models from large
databases. This model builds upon the human nervous system. It helps you to conduct
image understanding, human learning, computer speech, etc.
• What is Backpropagation?
• Backpropagation is the essence of neural network training. It is the method of fine-
tuning the weights of a neural network based on the error rate obtained in the previous
epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and
make the model reliable by increasing its generalization.
• Backpropagation in neural networks is a short form for “backward propagation of errors.”
It is a standard method of training artificial neural networks. This method helps calculate
the gradient of a loss function with respect to all the weights in the network.
How Backpropagation Algorithm Works
• The Back propagation algorithm in neural network computes the
gradient of the loss function for a single weight by the chain rule. It
efficiently computes one layer at a time, unlike a native direct
computation. It computes the gradient, but it does not define how
the gradient is used. It generalizes the computation in the delta rule.
• Consider the following Back propagation neural network example
diagram to understand:
1.Inputs X, arrive through the
preconnected path
2.Input is modeled using real weights W.
The weights are usually randomly
selected.
3.Calculate the output for every neuron
from the input layer, to the hidden
layers, to the output layer.
4.Calculate the error in the outputs
5.Travel back from the output layer to
the hidden layer to adjust the weights
such that the error is decreased.
6.Keep repeating the process until the
desired output is achieved
ErrorB= Actual Output – Desired Output
Why We Need Backpropagation?
The most prominent advantages of Backpropagation are:
• Backpropagation is fast, simple, and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the
network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be
learned.
What is a Feed Forward Network?
• A feedforward neural network is an artificial neural network where
the nodes never form a cycle. This kind of neural network has an
input layer, hidden layers, and an output layer. It is the first and
simplest type of artificial neural network.
• Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
• Static Back-propagation
• Recurrent Backpropagation
Static back-propagation:
• It is one kind of backpropagation network which produces a
mapping of a static input for static output. It is useful to solve static
classification issues like optical character recognition.
Recurrent Backpropagation:
• Recurrent Back propagation in data mining is fed forward until a
fixed value is achieved. After that, the error is computed and
propagated backward.
• The main difference between both of these methods is: that the
mapping is rapid in static back-propagation while it is nonstatic in
recurrent backpropagation.
Backpropagation Key Points
• Simplifies the network structure by elements weighted links that have the
least effect on the trained network
• You need to study a group of input and activation values to develop the
relationship between the input and hidden unit layers.
• It helps to assess the impact that a given input variable has on a network
output. The knowledge gained from this analysis should be represented in
rules.
• Backpropagation is especially useful for deep neural networks working on
error-prone projects, such as image or speech recognition.
• Backpropagation takes advantage of the chain and power rules allows
backpropagation to function with any number of outputs.
Best practice Backpropagation
• Backpropagation in a neural network can be explained with the help
of the “Shoe Lace” analogy
Too little tension =
• Not enough constraining and very loose
Too much tension =
• Too much constraint (overtraining)
• Taking too much time (relatively slow process)
• Higher likelihood of breaking
Pulling one lace more than the other =
• Discomfort (bias)
Disadvantages of using Backpropagation
• The actual performance of backpropagation on a specific problem is
dependent on the input data.
• Back propagation algorithm in data mining can be quite sensitive to
noisy data
• You need to use the matrix-based approach for backpropagation
instead of mini-batch.
Support Vector Machine
• Support Vector Machine(SVM) is a supervised machine learning
algorithm used for both classification and regression. Though we say
regression problems as well it’s best suited for classification. The
objective of the SVM algorithm is to find a hyperplane in an N-
dimensional space that distinctly classifies the data points. The dimension
of the hyperplane depends upon the number of features. If the number of
input features is two, then the hyperplane is just a line. If the number of
input features is three, then the hyperplane becomes a 2-D plane. It
becomes difficult to imagine when the number of features exceeds three. 
• Let’s consider two independent variables x1, x2 and one dependent
variable which is either a blue circle or a red circle
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression, and outliers detection.

The advantages of support vector machines are:


 Effective in high dimensional spaces.
 Still effective in cases where a number of dimensions are greater than the number of samples.
 Uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.
 Versatile: different Kernel functions can be specified for the decision function. Common kernels are
provided, but it is also possible to specify custom kernels.
The disadvantages of support vector machines include:
 If the number of features is much greater than the number of samples, avoid over-fitting in
choosing Kernel functions and the regularization term is crucial.
 SVMs do not directly provide probability estimates, these are calculated using an expensive five-
fold cross-validation (see Scores and probabilities, below).
Classification
• SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on
a dataset.
Regression
• The method of Support Vector Classification can be extended to solve regression problems. This
method is called Support Vector Regression.

• The model produced by support vector classification (as described above) depends only on a subset
of the training data because the cost function for building the model does not care about training
points that lie beyond the margin. Analogously, the model produced by Support Vector Regression
depends only on a subset of the training data, because the cost function ignores samples whose
prediction is close to their target.

• There are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR


. LinearSVR provides a faster implementation than SVR but only considers the linear kernel, while 
NuSVR implements a slightly different formulation than SVR and LinearSVR. See 
Implementation details for further details.
 Complexity
• Support Vector Machines are powerful tools, but their compute and storage requirements increase
rapidly with the number of training vectors. The core of an SVM is a quadratic programming
problem (QP), separating support vectors from the rest of the training data. The QP solver used by
the libsvm-based implementation scales between 
• O(nfeatures×n^2samples) and O(nfeatures×n^3samples) depending on how efficiently the libsvm cache is used
in practice (dataset dependent). If the data is very sparse n features should be replaced by the
average number of non-zero features in a sample vector.
• For the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more
efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples
and/or features.
Kernel functions

You might also like