You are on page 1of 24

Data Mining

B.Tech. IV Year I Semester

UNIT – III
Classification and Prediction

Classification and Prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Classifies data
(constructs a model) based on the training set and the values (class labels) in a classifying
attribute and uses it in classifying new data

Classification predicts categorical (discrete, unordered) labels


Prediction models continuous valued functions.

Definition & Procedure


Given a Set of Records (Called the Training Set)

Each record contains a set of attribute


One of the attributes is the class

Find a model for the class attribute as a function of the values of other attributes
• Goal: Previously unseen records should be assigned to a class as accurately as possible.
Usually, the given data set is divided into training and test set, with training set used to build the
model and test set used to validate it. The accuracy of the model is determined on the test set.

For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a prediction model to predict the expenditures of potential customers on
computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered
value, as opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.

Typical Applications
Document categorization
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness analysis

Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations, measurements, etc.) are accompanied by


labels indicating the class of the observation.

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown


Given a set of measurements, observations, etc. with the aim of establishing the existence
of classes or clusters in the data

Classification—A Two-Step Process

⚫ Model Construction: describing a set of predetermined classes


◦ Each tuple/sample is assumed to belong to a predefined class, as determined by
the Class Label Attribute
◦ The set of tuples used for model construction is Training Set
◦ The model is represented as Classification Rules, Decision Trees, Or Mathematical
Formulae
⚫ Model usage: for classifying future or unknown objects
◦ Estimate Accuracy of the model
� The known label of test sample is compared with the classified result from
the model
� Accuracy rate is the percentage of test set samples that are correctly
classified by the model
� Test set is independent of training set, otherwise over-fitting will occur
◦ If the accuracy is acceptable, use the model to Classify Data tuples whose class
labels are not known
⚫ Model Construction: Describing a set of predetermined classes

Each data sample is assumed to belong to a predefined class, as determined by the


class label attribute
Use a training dataset for model construction.
The model is represented as classification rules, decision trees, or mathematical
formula
Model usage: for classifying future or unknown objects, Estimate accuracy of the model

The known label of test sample is compared with the classified result from the
model
Test accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set but from the same probability distribution
Example 2:
Issues Regarding Classification and Prediction

Data Preparation
 Data Cleaning - This refers to the preprocessing of data in order to remove or reduce
noise (by applying smoothing techniques) and the treatment of missing values (e.g., by
replacing a missing value with the most commonly occurring value for that attribute, or
with the most probable value based on statistics).
o Preprocess data in order to reduce noise and handle missing values
 Relevance Analysis (feature selection) - Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether any two given attributes
are statistically related. For example, a strong correlation between attributes A1 and A2
would suggest that one of the two could be removed from further analysis. A database
may also contain irrelevant attributes. Attribute subset selection can be used in these
cases to find a reduced set of attributes such that the resulting probability distribution of
the data classes is as close as possible to the original distribution obtained using all
attributes. Hence, relevance analysis, in the form of correlation analysis and attribute
subset selection, can be used to detect attributes that do not contribute to the
classification or prediction task. Such analysis can help improve classification efficiency
and scalability.
o Remove the irrelevant or redundant attributes
 Data Transformation - The data may be transformed by normalization; Normalization
involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1 to +1 or 0 to 1. The data can also be transformed by generalizing it to
higher-level concepts. Concept hierarchies may be used for this purpose. This is
particularly useful for continuous valued attributes. For example, numeric values for the
attribute income can be generalized to discrete ranges, such as low, medium, and high.
Similarly, categorical attributes, like street, can be generalized to higher-level concepts,
like city.
o Generalize and/or normalize data
Performance Analysis

 Predictive Accuracy - The accuracy of a classifier refers to the ability of a given classifier to
correctly predict the class label of new or previously unseen data (i.e., tuples without class
label information). The accuracy of a predictor refers to how well a given predictor can
guess the value of the predicted attribute for new or previously unseen data.
o Ability to classify new or previously unseen data.
 Speed and Scalability
o Time to construct the model
o Time to use the model
 Robustness
o Model makes correct predictions: Handling noise and missing values
 Scalability
o This refers to the ability to construct the classifier or predictor efficiently given
large amounts of data
 Interpretability:
o Understanding and insight provided by the model

Comparing Classification and Prediction Methods

 Accuracy
o The accuracy of a classifier refers to the ability of a given classifier to correctly
predict the class label of new or previously unseen data (i.e., tuples without class
label information).
o The accuracy of a predictor refers to how well a given predictor can guess the value
of the predicted attribute for new or previously unseen data.
 Speed:
o This refers to the computational costs involved in generating and using the given
classifier or predictor.
 Robustness:
o This is the ability of the classifier or predictor to make correct predictions given
noisy data or data with missing values.
 Scalability:
o This refers to the ability to construct the classifier or predictor efficiently given
large amounts of data.
 Interpretability:
o This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more difficult to
assess.

Classification by Decision Tree Induction


 Decision tree induction is the learning of decision trees from class-labeled training tuples.
 A decision tree is a flowchart-like tree structure, where
o Each internal node denotes a test on an attribute.
o Each branch represents an outcome of the test.
o Each leaf node holds a class label.
o The topmost node in a tree is the root node.

Decision Tree
 The construction of decision tree classifiers does not require any domain knowledge or
parameter setting.
 Decision trees can handle high dimensional data.
 Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.
 The learning and classification steps of decision tree induction are simple and fast.
 In general, decision tree classifiers have good accuracy.
 Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy, and
molecular biology.

Decision Tree Induction Algorithm


 The algorithm is called with three parameters:
o Data partition

o Attribute list

o Attribute selection method

 Basic Algorithm (A Greedy Algorithm)


o Tree is constructed in a top-down recursive divide-and conquer manner
o The tree starts as a single node, N, representing the training tuples in D.
o At start, all the training tuples are at the root, If the tuples in D are all of the same
class, then node N becomes a leaf and is labeled with that class.
o Otherwise, the algorithm calls Attribute selection method to determine the
splitting criterion.
o The splitting criterion tells us which attribute to test at node N by determining the
―best way to separate or partition the tuples in D into individual classes.
o Tuples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,
information gain)
 Conditions for stopping partitioning
o All samples for a given node belong to the same class
o There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
o There are no samples left

Partitioning Scenarios:

There are three possible scenarios for splitting attribute. Let A be the splitting attribute. A has v
distinct values, {a1, a2, … ,av}, based on the training data.

 A is discrete-valued:
o In this case, the outcomes of the test at node N correspond directly to the known
values of A.
o A branch is created for each known value, aj, of A and labeled with that value.
o A need not be considered in any future partitioning of the tuples.
 A is continuous-valued:
o In this case, the test at node N has two possible outcomes, corresponding to the
conditions A <=split point and A >split point, respectively, where split point is the
split-point returned by Attribute selection method as part of the splitting criterion.
 A is discrete-valued and a binary tree must be produced:
o The test at node N is of the form―A€SA? . SA is the splitting subset for A, returned
by Attribute selection method as part of the splitting criterion. It is a subset of the
known values of A.

Example: Training Dataset


Attribute Selection Measure: Information gain

 Select the attribute with the highest information gain


 S contains si tuples of class Ci for i = {1,…, m}
 Entropy of the set of tuples:
o It measures how informative is a node.

 Entropy after choosing attribute A with values {a1,a2,…,av}

 Information gained by branching on attribute A

 Information Gain Computation


o Class P:
 buys_computer = “yes”
 p: number of samples
o Class N:
 buys_computer = “no”
 n: number of samples
o The expected information:
 I(p, n) = I(9, 5) =0.940
o Compute the entropy for age:

Hence
Gain(age) = I(p,n)-E(age) = 0.94- 0.694 = 0.246
Similarly,
Gain(income) = 0.029, Gain(student) = 0.151, Gain(credit_rating) = 0.048
We say that age is more informative than income, student, and credit_rating. So age would be
chosen as the root of the tree.
Overfitting and Tree Pruning
⚫ Overfitting: An induced tree may overfit the training data
◦ Too many branches, some may reflect anomalies due to noise or outliers
◦ Poor accuracy for unseen samples
⚫ Two approaches to avoid overfitting
◦ Prepruning: Halt tree construction early—do not split a node if this would result in
the goodness measure falling below a threshold
� Difficult to choose an appropriate threshold
◦ Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
� Use a set of data different from the training data to decide which is the
“best pruned tree”

Enhancements to Basic Decision Tree Induction


⚫ Allow for continuous-valued attributes
◦ Dynamically define new discrete-valued attributes that partition the continuous
attribute value into a discrete set of intervals
⚫ Handle missing attribute values
◦ Assign the most common value of the attribute
◦ Assign probability to each of the possible values
⚫ Attribute construction
◦ Create new attributes based on existing ones that are sparsely represented
◦ This reduces fragmentation, repetition, and replication
Bayesian Classification

 Bayesian classifiers are statistical classifiers.


 They can predict class membership probabilities, such as the probability that a given tuple
belongs to a particular class.
 Bayesian classification is based on Bayes’ theorem.

Bayes’ Theorem:
Basics

 Let X be a data tuple whose class label is unknown


 In Bayesian terms, X is considered as “evidence.” and it is described by measurements
made on a set of n attributes.
 Let H be a hypothesis that X belongs to class C
 For classification problems, we want to determine Posterior Probability P (H|X) and Prior
Probability P (H).
 Posterior Probability:
o For classification problems, determine P(H/X): the probability that the hypothesis H
holds given the observed data sample X
 Prior Probability:
o P(H): prior probability of hypothesis H
o It is the initial probability before we observe any data
o It reflects the background knowledge
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X, given that the hypothesis H holds
 Bayes’ theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).
 Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes
theorem
Informally, this can be written as

Naïve Bayesian Classification


The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X. That
is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if

The maximum posteriori hypothesis by Bayes’ theorem

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive
assumption of class conditional independence is made. This presumes that the values of
the attributes are conditionally independent of one another, given the class label of the
tuple. Thus

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued.
5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if

Example:

Naïve Bayesian Classifier: Training Dataset

From the above data set


Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Now, predict the label for data sample X, whether X will buys_computer or not
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
⚫ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
⚫ Compute P(X|Ci) for each class

o P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222


o P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
o P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
o P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
o P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
o P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
o P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
o P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

⚫ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

o P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
o P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
⚫ Therefore, X belongs to class (“buys_computer = yes”)
Bayesian Belief Networks

A Bayesian network (also known as a Bayes network, belief network, or decision network) is a
probabilistic graphical model that represents a set of variables and their conditional
dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event
that occurred and predicting the likelihood that any one of several possible known causes was
the contributing factor. For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms. Given symptoms, the network can be used to
compute the probabilities of the presence of various diseases.

Graphical Model:

Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables
in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters
or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no
path connects one node to another) represent variables that are conditionally independent of
each other. Each node is associated with a probability function that takes, as input, a particular
set of values for the node's parent variables, and gives (as output) the probability (or probability
distribution, if applicable) of the variable represented by the node.

Nodes: Random variables


Links: Dependency
 X and Y are the parents of Z, and Y is the parent of P
 No dependency between Z and P
 Has no loop or cycles.

 We must specify the conditional probability distribution for each node.


 If the variables are discrete, each node is described by a table (Conditional Probability
Table (CPT)) which lists the probability that the child node takes on each of its different
values for each combination of values of its parents.
Example:

 In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a person ‘X’,
which rings upon two probabilities i.e burglary ‘B’ and fire ‘F’, which are – parent nodes of
the alarm node. The alarm is the parent node of two probabilities P1 calls ‘P1’ & P2 calls
‘P2’ person nodes.
 Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘X’, respectively. But,
there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the person ‘X’,
even after hearing the alarm, as he has a tendency to forget things, quick. Similarly, ‘P2’,
sometimes fails to call the person ‘X’, as he is only able to hear the alarm, from a certain
distance.

Q) Find the probability that ‘P1’ is true (P1 has called ‘X’), ‘P2’ is true (P2 has called ‘X’) when the
alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]

Burglary ‘B’ –
 P (B=T) = 0.001 (‘B’ is true i.e. burglary has occurred)
 P (B=F) = 0.999 (‘B’ is false i.e. burglary has not occurred)

Fire ‘F’ –
 P (F=T) = 0.002 (‘F’ is true i.e. fire has occurred)
 P (F=F) = 0.998 (‘F’ is false i.e. fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999

 The alarm ‘A’ node can be ‘true’ or ‘false’ (i.e. may have rung or may not have rung). It has
two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e. may have occurred
or may not have occurred) depending upon different conditions.

Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95

 The person ‘P1’ node can be ‘true’ or ‘false’ (i.e. may have called the person ‘X’ or not). It has
a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e. may have rung or may not
have rung, upon burglary ‘B’ or fire ‘F’).

Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99

 The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or not). It has
a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have
rung, upon burglary ‘B’ or fire ‘F’).

Solution: Considering the observed probabilistic scan –


With respect to the question — P (P1, P2, A, ~B, ~F), we need to get the probability of ‘P1’. We
find it with regard to its parent node – alarm ‘A’. To get the probability of ‘P2’, we find it with
regard to its parent node — alarm ‘A’.

We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and fire ‘F’
are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P (P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
K-Nearest Neighbors Algorithm
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems.
A supervised machine learning algorithm (as opposed to an unsupervised machine learning
algorithm) is one that relies on labeled input data to learn a function that produces an
appropriate output when given new unlabeled data.
An unsupervised machine learning algorithm makes use of input data without any labels —in
other words, no teacher (label) telling the child (computer) when it is right or when it has made a
mistake so that it can self-correct.

Unlike supervised learning that tries to learn a function that will allow us to make predictions
given some new unlabeled data, unsupervised learning tries to learn the basic structure of the
data to give us more insight into the data.

K-Nearest Neighbors

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar
things are near to each other. KNN captures the idea of similarity (sometimes called distance,
proximity, or closeness) with some mathematics calculating distance. The straight-line distance
(also called the Euclidean distance) is a popular and familiar choice.

The KNN Algorithm

1. Load the data


2. Initialize K to your chosen number of neighbors
3. For each example in the data
1. Calculate the distance between the query example and the current example from
the data.
2. Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending
order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels

Choosing the right value for K

To select the K that’s right for your data, we run the KNN algorithm several times with different
values of K and choose the K that reduces the number of errors we encounter while maintaining
the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.
Here are some things to keep in mind:

1. As we decrease the value of K to 1, our predictions become less stable.


2. Inversely, as we increase the value of K, our predictions become more stable due to majority
voting / averaging, and thus, more likely to make more accurate predictions (up to a certain
point)
3. In cases where we are taking a majority vote among labels, we usually make K an odd
number to have a tiebreaker.

Advantages

1. The algorithm is simple and easy to implement.


2. There’s no need to build a model, tune several parameters, or make additional assumptions.
3. The algorithm is versatile. It can be used for classification, regression, and search (as we will
see in the next section).

Disadvantages

1. The algorithm gets significantly slower as the number of examples and/or


predictors/independent variables increase.

You might also like