Professional Documents
Culture Documents
Classification Problem
Classification Problem
• Problem statement:
– Given features X1, X2,…, Xn
– Predict a label Y
• Definition: (Classification) It is the task of
learning a target function f that maps each
attribute set X to one of the predefined class
labels Y.
Example
Day Outlook Temperature Humidity Wind Play
Tennis
• Medical Diagnosis
– Given a list of symptoms, predict whether a patient has
disease X or not
• Weather
– Based on temperature, humidity, etc… predict if it will rain
tomorrow
Learning
• A classification technique is a systematic approach to build classification models
from an input dataset
• Each technique employs a learning algorithm to identify a model that best fits
the relationship between the attribute set and class label of the input data.
• Formally, a computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
• Thus a learning system is characterized by:
❑ task T
❑ experience E, and
❑ performance measure P
Performance of Classification
• Evaluation is based on the counts of test records correctly and
incorrectly predicted by the model
• These counts are tabulated in a table known as confusion matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 f11 f10
Class = 0 f01 f00
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
Thomas Bayes
1702 - 1761
A word about the Bayesian
framework
• Allows us to combine observed data and prior knowledge
– This means that any kind of objects (e.g. time series, trees, etc.)
can be classified, based on a probabilistic model specification
– Find out the probability of the previously unseen instance
belonging to each class, then simply pick the most probable class.
Another Application
• Digit Recognition
Classifier 5
• X1,…,Xn {0,1} (Black vs. White pixels)
• Y {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier
• A good strategy is to predict:
Likelihood Prior
Normalization Constant
• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
Bayes Classifiers
• Bayesian classifiers use Bayes’ theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
• To classify, we’ll simply compute these two probabilities and predict based
on which one is greater
Model Parameters
• For the Bayes classifier, we need to “learn” two functions, the
likelihood and the prior
?
Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that
there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of time
– And we’ll need tons of training data (which is usually not
available)
Why is this useful?
• # of parameters for modeling P(X1,…,Xn|Y):
▪ 2(2n-1)
▪ 2n
Example
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small database with
names and sex. Name Sex
We can use it to apply Bayes rule… Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Assume that we have two classes
c1 = male, and c2 = female.
Officer Drew
Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm p(officer drew| Male) = 2/3 * 2/3 * ….
tall, and has
long hair
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis
• Answer
P =yes
(PlayTennis )=9 /14=0 .64
P =
(PlayTennis
no )= 5/14 =0.36
P(Wind =strong
|PlayTennis=yes )=3 /9= 0.33
P(Wind =strong
|PlayTennis= )=
no 3/5 =0.60
etc
.
P(yes)P(sunny
|yes)P(cool|yes)P(high|yes
)P(strong)=
|yes0.0053
P(no)P(sunny
|no)P (cool|no
)P (high
|no )P
(strong )=
|no 0.0206
answer:PlayTennis
(x )=no
Naïve Bayes Classifier With Continuous Valued
Attribute
Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the feature value X j = a jk , Pˆ ( X j = a jk |C = ci ) = 0
– In this circumstance, Pˆ ( x1 |ci ) Pˆ ( a jk |ci ) Pˆ ( xn | ci ) = 0 during test
– For a remedy, conditional probabilities re-estimated with
n + mp
Pˆ ( X j = a jk |C = ci ) = c
n+m
nc : number of training examples for which X j = a jk and C = ci
n : number of training examples for which C = ci
p : prior estimate (usually, p = 1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m 1)
Summary
• Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even
in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…