Professional Documents
Culture Documents
Lecture 12-14
Learning Outcomes
• Bayes Theorem
• Naïve Bayes Classifier
• Bayesian Belief Network
• Markove Model
Probability Theory
Experiments, Outcomes, Events and
Random Variables
Experiment: An experiment is any activity from which results are obtained. A random
experiment is one in which the outcomes, or results, cannot be predicted with certainty.
Experiment Examples:
1. Toss a coin twice
2. Roll a die
Mutually Exclusive Events: Consider a fair six-sided die as before, only in addition to the num
bers 1 through 6 on each face, we have the property that the even-numbered faces are colored re
d, and the odd-numbered faces are colored green. Let event A be rolling a green face, and even
t B be rolling a 6. Then Pr[A] = 1/2 Pr[B] = 1/6 as in our previous example. But it is obvious t
hat events A and B cannot simultaneously occur, since rolling a 6 means the face is red, and r
olling a green face means the number showing is odd.
Therefore Pr[A and B] = 0.
Joint Probability
• Joint probability of X and Y, p(X, Y), is probability that X and Y simulta
neously assume particular values
– If X, Y independent, p(X, Y) = p(X)p(Y)
• Marginal Probability:
• Conditional Probability:
Sum and Product Rule
Baye’s Theorem
Example 1/4
• Consider a simple example of two colored
boxes each containing fruit (apple shown i
n green and orange shown in orange)
• Red box pick 40%
• Blue box pick 60%
• After picking a box remove an item from t
he box.
• P(B=r)=4/10 and P(B=b)=6/10
• Mutually Exclusive Event (pick one box ei
ther red or blue)
Example 2/4
Example 3/4
Example 4/4
Example 2
Baye’s Theorem
Baye’s Theorem (Example-1)
Problem: Suppose you have been tested positive for a disease; what is the probability that y
ou actually have the disease? It depends on the accuracy and sensitivity of the test, and on th
e background (prior) probability of the disease.
Solution: Let
P(Test=+ve | Disease=true) = 0.95, so the false negative rate,
P(Test=-ve | Disease=true) =0.05m. Let,
P(Test=+ve | Disease=false) = 0.05,
Suppose the disease is rare: P(Disease=true) = 0.01 (1%).
Let D denote Disease (R in the above equation) and
"T=+ve" denote the positive Test (e in the above equation). Then
Baye’s Theorem (Example-2)
Probabilistic Graphical Model
Probabilistic Graphical Model
Joint Probability Distribution
(Role of Graphs)
• Consider an arbitrary joint distribution
The naïve Bayes classifier applies to learning tasks where each inst
ance x is described by a conjunction of attribute values and where t
he target function f(x) can take on any value from some finite set V
. A set of training examples of the target function is provided, and
a new instance is presented, described by the tuple of attribute valu
es <a1,a2,…,an>. The learner is asked to predict the target value, or
classification, for this new instance.
Naïve Bayes Classifier
Abstractly, probability model for a classifier is a conditional model P(C|F1
,F2,…,Fn)
Over a dependent class variable C with a small number of outcome or clas
ses conditional over several feature variables F1,…,Fn.
Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmaxc [P(C) x P(F1|C) x P(F2|C) x…x P(Fn|C)] / P(
F1,F2,…,Fn)
Since P(F1,F2,…,Fn) is common to all probabilities, we do not need to eva
luate the denomitator for comparisons.
Naïve Bayes Classifier
Tennis-Example
Naïve Bayes Classifier
Problem:
Use training data from above to classify the following instances:
a) <Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong>
b) <Outlook=overcast, Temperature=cool, Humidity=high, Wind=strong>
Naïve Bayes Classifier
Answer to (a):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=n) = 5/14 = 0.36
P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22
P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
Naïve Bayes Classifier
P(yes)xP(sunny|yes)xP(cool|yes)xP(high|yes)xP(strong|yes) = 0.0053
So the class for this instance is ‘no’. We can normalize the probility by:
[0.0206]/[0.0206+0.0053] = 0.795
Naïve Bayes Classifier
Answer to (b):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=no) = 5/14 = 0.36
P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44
P(Outlook=overcast|PlayTennis=no) = 0/5 = 0
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
Example
• Example: Play Tennis
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No
Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Pˆ ( X a |C c ) nc mp
nm
j jk i
1 ( X j ji )2
ˆ ( X |C c )
P exp
j i
2 ji 2 ji 2
ji : mean (avearage) of attributevalues X j of examples for which C ci
ji : standarddeviation of attributevalues X j of examples for which C ci
– Learning Phase: for X (X1 , , Xn ), C c1 , , cL
Output: nnormal
L distributions and P(C ci ) i 1, , L
– Test Phase:
for X (X1 , , Xn )
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
Conclusions
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each attribute in each c
lass separately
– Test is straightforward; just looking up tables or calculating conditional probabiliti
es with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even in presence
of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
Bayesian Belief Network
Expert Systems: Rule Based Systems
B E A P(A|B,E)
false false false 0.999
Alarm
false false true 0.001
false true false 0.71
false true true 0.29
true false false 0.06
true false true 0.94
true true false 0.05
true true true 0.95
A Directed Acyclic Graph
Burglary Earthquake
Alarm
Alarm
B E A P(A|B,E)
false false false 0.999
false false true 0.001 Each node Xi has a conditional probability distribution
false true false 0.71 P(Xi | Parents(Xi)) that quantifies the effect of the
false true true 0.29 parents on the node
true false false 0.06 The parameters are the probabilities in these
true false true 0.94 conditional probability distributions
true true false 0.05
Because we have discrete random variables, we have
true true true 0.95
conditional probability tables (CPTs)
A Set of Parameters
Conditional Probability Stores the probability distribution for Alarm
Distribution for Alarm given the values of Burglary and
Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the parents
false false true 0.001
(B and E in this example), the entries for
false true false 0.71
P(A=true|B,E) and P(A=false|B,E) must add up to
false true true 0.29 1 eg. P(A=true|B=false,E=false) +
true false false 0.06 P(A=false|B=false,E=false)=1
true false true 0.94
true true false 0.05
true true true 0.95
If you have a Boolean variable with k Boolean parents, how big is the
conditional probability table?
How many entries are independently specifiable?
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented directed acyclic gr
aph, represented by the pair V , E where:
– V is a set of vertices.
– E is a set of directed edges joining vertices. No loops of any length are
allowed.
n
P( x1 ,..., xn ) P( xi | parents( X i ))
i 1
The Full Joint Distribution
P ( x1 ,..., xn )
P ( xn | xn 1 ,..., x1 ) P ( xn 1 ,..., x1 ) ( Chain Rule)
P( x
i 1
i | xi 1 ,..., x1 ) P( xi | parents( xi ))
i 1
To be able to do this, we need two things:
1. Parents(Xi) {Xi-1, …, X1}
This is easy – we just label the nodes according to the partial order in
the graph
2. We need Xi to be conditionally independent of its predecessors given its
parents
This can be done when constructing the network. Choose parents that
directly influence Xi.
Example
Burglary Earthquake
Alarm
JohnCalls MaryCalls
What is the probability that there is a burglary given that John calls? (0.0162)
Conditional Independence
B1 B2 ... parents of C
To determine if I(X, Y | E), ignore the directions of the arrows, find all
paths between X and Y
Now pay attention to the arrows. Determine if the paths are blocked ac
cording to the 3 cases
If all the paths are blocked, X and Y are d-separated given E
Which means they are conditionally independent given E
Blocking
X Y
E
Z
Z
d-Separation - Example
Battery
• Moves and Battery are independent
given it is known about Ignition
• Moves and Radio are independent if Radio Ignition
Moves
Case 1
What does it mean for a path to be blocked? There are 3 cases.
Case 1: There exists a node N on the path such that
• It is in the evidence set E (shaded grey)
• The arcs putting N in the path are “tail-to-tail”.
X N Y
X N Y
X N Y
The path between X and Y is blocked by N (Note N is not in the evidence set)
Case 3 (Explaining Away)
Burglary Earthquake
Alarm
Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes
Earth obviously doesn’t care if your house is currently being broken into
While you are on vacation, one of your nice neighbors calls and lets you know your
alarm went off
Case 3 (Explaining Away)
Burglary Earthquake
Alarm
But if you knew that a medium-sized earthquake happened, then you’re probably
relieved that it’s probably not a burglar
The earthquake “explains away” the hypothetical burglar
This means that Burglary and Earthquake are not independent given Alarm.
But Burglary and Earthquake are independent given no evidence ie. learning about an
earthquake when you know nothing about the status of your alarm doesn’t give you any
information about the burglary
Holmes and Watson: “Icy roads” Example
E
Holmes and Watson: “Wet grass” Example
grass
E
grass
Holmes and Watson: “Burglar alarm” Example
yes E
Conditional Independence
• a is independent of b given c
• Equivalently
• Notation
Conditional Independence: Example 1
Conditional Independence: Example 1
Conditional Independence: Example 2
Conditional Independence: Example 2
Conditional Independence: Example 3
Rain Dry
0.2 0.8
Low High
0.2 0.8
0.6
0.6
0.4 0.4
Rain Dry