Professional Documents
Culture Documents
UNIT – III
Classification and Prediction
Classification and Prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Classifies data
(constructs a model) based on the training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Find a model for the class attribute as a function of the values of other attributes
• Goal: Previously unseen records should be assigned to a class as accurately as possible.
Usually, the given data set is divided into training and test set, with training set used to build the
model and test set used to validate it. The accuracy of the model is determined on the test set.
For example, we can build a classification model to categorize bank loan applications as
either safe or risky, or a prediction model to predict the expenditures of potential customers on
computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered
value, as opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Typical Applications
Document categorization
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness analysis
The known label of test sample is compared with the classified result from the
model
Test accuracy rate is the percentage of test set samples that are correctly classified
by the model
Test set is independent of training set but from the same probability distribution
Example 2:
Issues Regarding Classification and Prediction
Data Preparation
Data Cleaning - This refers to the preprocessing of data in order to remove or reduce
noise (by applying smoothing techniques) and the treatment of missing values (e.g., by
replacing a missing value with the most commonly occurring value for that attribute, or
with the most probable value based on statistics).
o Preprocess data in order to reduce noise and handle missing values
Relevance Analysis (feature selection) - Many of the attributes in the data may be
redundant. Correlation analysis can be used to identify whether any two given attributes
are statistically related. For example, a strong correlation between attributes A1 and A2
would suggest that one of the two could be removed from further analysis. A database
may also contain irrelevant attributes. Attribute subset selection can be used in these
cases to find a reduced set of attributes such that the resulting probability distribution of
the data classes is as close as possible to the original distribution obtained using all
attributes. Hence, relevance analysis, in the form of correlation analysis and attribute
subset selection, can be used to detect attributes that do not contribute to the
classification or prediction task. Such analysis can help improve classification efficiency
and scalability.
o Remove the irrelevant or redundant attributes
Data Transformation - The data may be transformed by normalization; Normalization
involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1 to +1 or 0 to 1. The data can also be transformed by generalizing it to
higher-level concepts. Concept hierarchies may be used for this purpose. This is
particularly useful for continuous valued attributes. For example, numeric values for the
attribute income can be generalized to discrete ranges, such as low, medium, and high.
Similarly, categorical attributes, like street, can be generalized to higher-level concepts,
like city.
o Generalize and/or normalize data
Performance Analysis
Predictive Accuracy - The accuracy of a classifier refers to the ability of a given classifier to
correctly predict the class label of new or previously unseen data (i.e., tuples without class
label information). The accuracy of a predictor refers to how well a given predictor can
guess the value of the predicted attribute for new or previously unseen data.
o Ability to classify new or previously unseen data.
Speed and Scalability
o Time to construct the model
o Time to use the model
Robustness
o Model makes correct predictions: Handling noise and missing values
Scalability
o This refers to the ability to construct the classifier or predictor efficiently given
large amounts of data
Interpretability:
o Understanding and insight provided by the model
Accuracy
o The accuracy of a classifier refers to the ability of a given classifier to correctly
predict the class label of new or previously unseen data (i.e., tuples without class
label information).
o The accuracy of a predictor refers to how well a given predictor can guess the value
of the predicted attribute for new or previously unseen data.
Speed:
o This refers to the computational costs involved in generating and using the given
classifier or predictor.
Robustness:
o This is the ability of the classifier or predictor to make correct predictions given
noisy data or data with missing values.
Scalability:
o This refers to the ability to construct the classifier or predictor efficiently given
large amounts of data.
Interpretability:
o This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more difficult to
assess.
Decision Tree
The construction of decision tree classifiers does not require any domain knowledge or
parameter setting.
Decision trees can handle high dimensional data.
Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy, and
molecular biology.
o Attribute list
Partitioning Scenarios:
There are three possible scenarios for splitting attribute. Let A be the splitting attribute. A has v
distinct values, {a1, a2, … ,av}, based on the training data.
A is discrete-valued:
o In this case, the outcomes of the test at node N correspond directly to the known
values of A.
o A branch is created for each known value, aj, of A and labeled with that value.
o A need not be considered in any future partitioning of the tuples.
A is continuous-valued:
o In this case, the test at node N has two possible outcomes, corresponding to the
conditions A <=split point and A >split point, respectively, where split point is the
split-point returned by Attribute selection method as part of the splitting criterion.
A is discrete-valued and a binary tree must be produced:
o The test at node N is of the form―A€SA? . SA is the splitting subset for A, returned
by Attribute selection method as part of the splitting criterion. It is a subset of the
known values of A.
Hence
Gain(age) = I(p,n)-E(age) = 0.94- 0.694 = 0.246
Similarly,
Gain(income) = 0.029, Gain(student) = 0.151, Gain(credit_rating) = 0.048
We say that age is more informative than income, student, and credit_rating. So age would be
chosen as the root of the tree.
Overfitting and Tree Pruning
⚫ Overfitting: An induced tree may overfit the training data
◦ Too many branches, some may reflect anomalies due to noise or outliers
◦ Poor accuracy for unseen samples
⚫ Two approaches to avoid overfitting
◦ Prepruning: Halt tree construction early—do not split a node if this would result in
the goodness measure falling below a threshold
� Difficult to choose an appropriate threshold
◦ Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
� Use a set of data different from the training data to decide which is the
“best pruned tree”
Bayes’ Theorem:
Basics
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X. That
is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive
assumption of class conditional independence is made. This presumes that the values of
the attributes are conditionally independent of one another, given the class label of the
tuple. Thus
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued.
5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if
Example:
A Bayesian network (also known as a Bayes network, belief network, or decision network) is a
probabilistic graphical model that represents a set of variables and their conditional
dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event
that occurred and predicting the likelihood that any one of several possible known causes was
the contributing factor. For example, a Bayesian network could represent the probabilistic
relationships between diseases and symptoms. Given symptoms, the network can be used to
compute the probabilities of the presence of various diseases.
Graphical Model:
Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables
in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters
or hypotheses. Edges represent conditional dependencies; nodes that are not connected (no
path connects one node to another) represent variables that are conditionally independent of
each other. Each node is associated with a probability function that takes, as input, a particular
set of values for the node's parent variables, and gives (as output) the probability (or probability
distribution, if applicable) of the variable represented by the node.
In the above figure, we have an alarm ‘A’ – a node, say installed in a house of a person ‘X’,
which rings upon two probabilities i.e burglary ‘B’ and fire ‘F’, which are – parent nodes of
the alarm node. The alarm is the parent node of two probabilities P1 calls ‘P1’ & P2 calls
‘P2’ person nodes.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘X’, respectively. But,
there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the person ‘X’,
even after hearing the alarm, as he has a tendency to forget things, quick. Similarly, ‘P2’,
sometimes fails to call the person ‘X’, as he is only able to hear the alarm, from a certain
distance.
Q) Find the probability that ‘P1’ is true (P1 has called ‘X’), ‘P2’ is true (P2 has called ‘X’) when the
alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’ events]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e. burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e. burglary has not occurred)
Fire ‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e. fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e. fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ (i.e. may have rung or may not have rung). It has
two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e. may have occurred
or may not have occurred) depending upon different conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e. may have called the person ‘X’ or not). It has
a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e. may have rung or may not
have rung, upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or not). It has
a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may not have
rung, upon burglary ‘B’ or fire ‘F’).
We find the probability of alarm ‘A’ node with regard to ‘~B’ & ‘~F’ since burglary ‘B’ and fire ‘F’
are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P (P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
K-Nearest Neighbors Algorithm
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems.
A supervised machine learning algorithm (as opposed to an unsupervised machine learning
algorithm) is one that relies on labeled input data to learn a function that produces an
appropriate output when given new unlabeled data.
An unsupervised machine learning algorithm makes use of input data without any labels —in
other words, no teacher (label) telling the child (computer) when it is right or when it has made a
mistake so that it can self-correct.
Unlike supervised learning that tries to learn a function that will allow us to make predictions
given some new unlabeled data, unsupervised learning tries to learn the basic structure of the
data to give us more insight into the data.
K-Nearest Neighbors
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar
things are near to each other. KNN captures the idea of similarity (sometimes called distance,
proximity, or closeness) with some mathematics calculating distance. The straight-line distance
(also called the Euclidean distance) is a popular and familiar choice.
To select the K that’s right for your data, we run the KNN algorithm several times with different
values of K and choose the K that reduces the number of errors we encounter while maintaining
the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before.
Here are some things to keep in mind:
Advantages
Disadvantages