You are on page 1of 39

Learning

Classification Problem
Classification Problem
• Problem statement:
– Given features X1, X2,…, Xn
– Predict a label Y
• Definition: (Classification) It is the task of
learning a target function f that maps each
attribute set X to one of the predefined class
labels Y.
Example
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No


Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Classification problem

• Training data: examples of the form (d, h(d))


– where d are the data objects to classify (inputs)
– and h(d) are the correct class info for d, h(d){1,…,K}
• Goal: given dnew, provide h(dnew)
Things We’d Like to Do
• Spam Classification
– Given an email, predict whether it is spam or not

• Medical Diagnosis
– Given a list of symptoms, predict whether a patient has
disease X or not

• Weather
– Based on temperature, humidity, etc… predict if it will rain
tomorrow
Learning
• A classification technique is a systematic approach to build classification models
from an input dataset
• Each technique employs a learning algorithm to identify a model that best fits
the relationship between the attribute set and class label of the input data.
• Formally, a computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
• Thus a learning system is characterized by:
❑ task T
❑ experience E, and
❑ performance measure P
Performance of Classification
• Evaluation is based on the counts of test records correctly and
incorrectly predicted by the model
• These counts are tabulated in a table known as confusion matrix

Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 f11 f10
Class = 0 f01 f00

NumberofCorrect Pr edictions f11 + f 00


Accuracy = =
TotalNumberof Pr edictions f11 + f10 + f 01 + f 00

• Accuracy will yield misleading results if the data set is unbalanced


• For example, if there were 95 cats and only 5 dogs in the data, a particular classifier might classify all the
observations as cats.
• The classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog
class.
Confusion Matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 TP FN
Class = 0 FP TN

• True Positive: You predicted positive and it’s true.


• True Negative: You predicted negative and it’s true.
• False Positive: (Type 1 Error) You predicted positive and it’s false.
• False Negative: (Type 2 Error) You predicted negative and it’s false.
Performance of Classification contd..

• In addition to classification accuracy there are two other metrics for


performance evaluation
• Precision: (also called positive predictive value) is the fraction of relevant
instances among the retrieved instances, i.e. Out of all the positive classes we
have predicted correctly, how many are actually positive
• Recall: (also known as sensitivity) is the fraction of relevant instances that have
been retrieved over the total amount of relevant instances, i.e. Out of all the
positive classes, how much we predicted correctly. It should be high as possible.

• Example: Suppose a computer program for recognizing dogs in photographs


identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight
dogs identified, five actually are dogs (true positives), while the rest are cats
(false positives).
Performance of Classification contd..

• In addition to classification accuracy there are two other metrics for


performance evaluation
• Precision: (also called positive predictive value) is the fraction of relevant
instances among the retrieved instances, i.e. Out of all the positive classes we
have predicted correctly, how many are actually positive
• Recall: (also known as sensitivity) is the fraction of relevant instances that have
been retrieved over the total amount of relevant instances, i.e. Out of all the
positive classes, how much we predicted correctly. It should be high as possible.

• Example: Suppose a computer program for recognizing dogs in photographs


identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight
dogs identified, five actually are dogs (true positives), while the rest are cats
(false positives).

• Answer: The program's precision is 5/8 while its recall is 5/12.


Performance of Classification contd..

• In simple terms, high precision means that an algorithm returned substantially


more relevant results than irrelevant ones, while high recall means that an
algorithm returned most of the relevant results

Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn

• Precision P = tp/(tp + fp)


• Recall R = tp/(tp + fn)
• F-Measure F = 2*R*P/(R+P)
Classification Techniques

• Decision Tree based Methods


• Rule-based Methods
• Artificial Neural Networks
• Naïve Bayes Classifier
• Support Vector Machines
Naïve Bayes Classifier

Thomas Bayes
1702 - 1761
A word about the Bayesian
framework
• Allows us to combine observed data and prior knowledge

• It is a generative (model based) approach, which offers a useful


conceptual framework

– This means that any kind of objects (e.g. time series, trees, etc.)
can be classified, based on a probabilistic model specification
– Find out the probability of the previously unseen instance
belonging to each class, then simply pick the most probable class.
Another Application
• Digit Recognition

Classifier 5
• X1,…,Xn  {0,1} (Black vs. White pixels)
• Y  {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier
• A good strategy is to predict:

– (for example: what is the probability that the image


represents a 5 given its pixels?)

• So … How do we compute that?


The Bayes Classifier
• Use Bayes Rule!

Likelihood Prior

Normalization Constant
• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
Bayes Classifiers
• Bayesian classifiers use Bayes’ theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)

• p(cj | d) = probability of instance d being in class cj,


This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature d
with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
The Bayes Classifier
• Let’s expand this for our digit recognition task:

• To classify, we’ll simply compute these two probabilities and predict based
on which one is greater
Model Parameters
• For the Bayes classifier, we need to “learn” two functions, the
likelihood and the prior

• How many parameters are required to specify the prior for


our digit recognition example?
Model Parameters
• How many parameters are required to specify the likelihood?
– (Supposing that each image is 30x30 pixels)

?
Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that
there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of time
– And we’ll need tons of training data (which is usually not
available)
Why is this useful?
• # of parameters for modeling P(X1,…,Xn|Y):

▪ 2(2n-1)

• # of parameters for modeling P(X1|Y),…,P(Xn|Y)

▪ 2n
Example
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small database with
names and sex. Name Sex
We can use it to apply Bayes rule… Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Assume that we have two classes
c1 = male, and c2 = female.

We have a person whose sex we do not know,


say “drew” or d.
Classifying drew as male or female is
equivalent to asking is it more probable that
drew is male or female, i.e which is greater
p(male | drew) or p(female | drew)

What is the probability of being called


“drew” given that you are a male?
What is the probability
of being a male?
p(male | drew) = p(drew | male ) p(male)
p(drew) What is the probability of
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Officer Drew
Nina Female
Sergio Male
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8
Officer Drew is more
likely to be a Female.
p(female | drew) = 2/5 * 5/8 = 0.250
3/8 3/8
Officer Drew IS a female!

Officer Drew

p(male | drew) = 1/3 * 3/8 = 0.125


3/8 3/8

p(female | drew) = 2/5 * 5/8 = 0.250


3/8 3/8
The Naïve Bayes Assumption
• So far we have only considered Bayes Classification when we have one
attribute
• What can we do if our data d has several attributes?
• Naïve Bayes assumption: Attributes that describe data instances are
conditionally independent given the classification hypothesis
P
(
d|
h)=
P(
a1 a
|
Th=
,...,
)P
(
at|h
)
t

– it is a simplifying assumption, obviously it may be violated in reality


– in spite of that, it works well in practice
• The Bayesian classifier that uses the Naïve Bayes assumption and
computes the MAP hypothesis is called Naïve Bayes classifier
Name Over 170CM Eye Hair length Sex
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)

p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….

Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm p(officer drew| Male) = 2/3 * 2/3 * ….
tall, and has
long hair
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No


Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Example. ‘Play Tennis’ data

• Based on the examples in the table, classify the


following datum x:
x=(Outlook=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
Example. ‘Play Tennis’ data

• Answer
P =yes
(PlayTennis )=9 /14=0 .64
P =
(PlayTennis
no )= 5/14 =0.36
P(Wind =strong
|PlayTennis=yes )=3 /9= 0.33
P(Wind =strong
|PlayTennis= )=
no 3/5 =0.60
etc
.
P(yes)P(sunny
|yes)P(cool|yes)P(high|yes
)P(strong)=
|yes0.0053
P(no)P(sunny
|no)P (cool|no
)P (high
|no )P
(strong )=
|no 0.0206
 answer:PlayTennis
(x )=no
Naïve Bayes Classifier With Continuous Valued
Attribute
Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class

1 N 1 N Yes = 21.64 , Yes = 2.35


 =  xn , 2 =  ( xn − )2
N n=1 N n=1  No = 23.88 , No = 7.09

– Learning Phase: output two Gaussian models


for P(temp|C)
ˆ ( x | Yes) = 1  ( x − 21.64) 2  1  ( x − 21.64) 2 
P exp
 − 2  2.352   = 2.35 2 exp
− 

2.35 2    11.09 
ˆ 1  ( x − 23.88) 2  1  ( x − 23.88) 2 
P ( x | No) = exp
 − 2  7.09 2   = 7.09 2 exp
− 

7.09 2    50.25 
Outputting Probabilities
• What’s nice about Naïve Bayes (and generative models in
general) is that it returns probabilities
– These probabilities can tell us how confident the algorithm
is
– So… don’t throw away those probabilities!
Exclusive-OR Example

• For an example where conditional independence fails:


– Y=XOR(X1,X2)

X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the feature value X j = a jk , Pˆ ( X j = a jk |C = ci ) = 0
– In this circumstance, Pˆ ( x1 |ci )    Pˆ ( a jk |ci )    Pˆ ( xn | ci ) = 0 during test
– For a remedy, conditional probabilities re-estimated with
n + mp
Pˆ ( X j = a jk |C = ci ) = c
n+m
nc : number of training examples for which X j = a jk and C = ci
n : number of training examples for which C = ci
p : prior estimate (usually, p = 1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
Summary
• Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even
in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…

You might also like