You are on page 1of 99

MACHINE LEARNING

Baysian Belief Network and Naïve Bayes Classifier

Presented by: Dr. S. Nadeem Ahsan


(Slides are adapted from the book, Pattern Recognition and Machine Learning written by Bishop. Also, adopted from book
Machine Learning written by Tom Mitchell)
Welcome!!

Lecture 12-14
Learning Outcomes
• Bayes Theorem
• Naïve Bayes Classifier
• Bayesian Belief Network
• Markove Model
Probability Theory
Experiments, Outcomes, Events and
Random Variables
Experiment: An experiment is any activity from which results are obtained. A random
experiment is one in which the outcomes, or results, cannot be predicted with certainty.

Experiment Examples:
1. Toss a coin twice
2. Roll a die

Sample space: The set of all possible outcomes of an experiment


S = {HH, HT, TH, TT}

Event: a subset of possible outcomes


A={HH}, B={HT, TH}

Random Variables: Variable that represents outcome of an event


Definition of Probability

• Probability of an event : A number assigned to an event Pr(A)


– Axiom 1: Pr(A)  0
– Axiom 2: Pr(S) = 1
– Axiom 3: For every sequence of disjoint events
Pr(Ui Ai )   i Pr( Ai )
– Example: Pr(A) = n(A)/N
Independent and Mutually Exclusive Events
If two events A and B are independent, then
Pr[A and B] = Pr[A]Pr[B]
That is, the probability that both A and B occur is equal to the probability th
at A occurs times the probability that B occurs.

If A and B are mutually exclusive, then


Pr[A and B] = 0;
that is, the probability that both A and B occur is zero.
Independent and Mutually Exclusive Events
(Examples)
 Independent Events: Consider a fair coin and a fair six-sided die. Let event A be obtaining hea
ds, and event B be rolling a 6. Then we can reasonably assume that events A and B are indepen
dent, because the outcome of one does not affect the outcome of the other. The probability that b
oth A and B occur is Pr[A and B] = Pr[A]Pr[B] = (1/2)(1/6) = 1/12. Since this value is not zero,
then events A and B cannot be mutually exclusive.

 Mutually Exclusive Events: Consider a fair six-sided die as before, only in addition to the num
bers 1 through 6 on each face, we have the property that the even-numbered faces are colored re
d, and the odd-numbered faces are colored green. Let event A be rolling a green face, and even
t B be rolling a 6. Then Pr[A] = 1/2 Pr[B] = 1/6 as in our previous example. But it is obvious t
hat events A and B cannot simultaneously occur, since rolling a 6 means the face is red, and r
olling a green face means the number showing is odd.
Therefore Pr[A and B] = 0.
Joint Probability
• Joint probability of X and Y, p(X, Y), is probability that X and Y simulta
neously assume particular values
– If X, Y independent, p(X, Y) = p(X)p(Y)

• Roll die, toss coin


p(X = 3, Y = heads) = p(X = 3)p(Y = heads)
= 1/6  1/2
= 1/12
Marginal and Conditional Probability
Marginal and Conditional Probability
• The probability of X will take the value xi and Y will take the value yi is written as p(X=xi,
Y=yi) and is called Joint probability of X=xi and Y=yi
p(X=xi, Y=yi)=nij/N
• The probability that X takes the value xi irrespective the value of Y is written as p(X=xi)
=ci/N, where

• Marginal Probability:

• Conditional Probability:
Sum and Product Rule
Baye’s Theorem
Example 1/4
• Consider a simple example of two colored
boxes each containing fruit (apple shown i
n green and orange shown in orange)
• Red box pick 40%
• Blue box pick 60%
• After picking a box remove an item from t
he box.
• P(B=r)=4/10 and P(B=b)=6/10
• Mutually Exclusive Event (pick one box ei
ther red or blue)
Example 2/4
Example 3/4
Example 4/4
Example 2
Baye’s Theorem
Baye’s Theorem (Example-1)
Problem: Suppose you have been tested positive for a disease; what is the probability that y
ou actually have the disease? It depends on the accuracy and sensitivity of the test, and on th
e background (prior) probability of the disease.
Solution: Let
P(Test=+ve | Disease=true) = 0.95, so the false negative rate,
P(Test=-ve | Disease=true) =0.05m. Let,
P(Test=+ve | Disease=false) = 0.05,
Suppose the disease is rare: P(Disease=true) = 0.01 (1%).
Let D denote Disease (R in the above equation) and
"T=+ve" denote the positive Test (e in the above equation). Then
Baye’s Theorem (Example-2)
Probabilistic Graphical Model
Probabilistic Graphical Model
Joint Probability Distribution
(Role of Graphs)
• Consider an arbitrary joint distribution

• By successive application of the product rule


Bayesian Network (Directed Graph Model)
Bayesian Network (Directed Graph Model)
We can write the joint probability distribution in the form
Bayesian Network (Directed Graph Model)
Application of Bayes Theorem and Proba
bilistic Graphical Model
Bayesian Classification
Why use Bayesian Classification:
• Probabilistic learning: Calculate explicit probabilities for hypoth
esis, among the most practical approaches to certain types of lear
ning problems
• Incremental: Each training example can incrmentally increase/dec
rease the probability that a hypothesis is correct. Prior knowledge
can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted b
y their probabilities
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent:
• Greatly reduces the computation cost, only count the class distribution
.
Naïve Bayes Classifier
 The probabilistic model of NBC is to find the probability of a certai
n class given multiple dijoint (assumed) events.

 The naïve Bayes classifier applies to learning tasks where each inst
ance x is described by a conjunction of attribute values and where t
he target function f(x) can take on any value from some finite set V
. A set of training examples of the target function is provided, and
a new instance is presented, described by the tuple of attribute valu
es <a1,a2,…,an>. The learner is asked to predict the target value, or
classification, for this new instance.
Naïve Bayes Classifier
 Abstractly, probability model for a classifier is a conditional model P(C|F1
,F2,…,Fn)
 Over a dependent class variable C with a small number of outcome or clas
ses conditional over several feature variables F1,…,Fn.
 Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmaxc [P(C) x P(F1|C) x P(F2|C) x…x P(Fn|C)] / P(
F1,F2,…,Fn)
 Since P(F1,F2,…,Fn) is common to all probabilities, we do not need to eva
luate the denomitator for comparisons.
Naïve Bayes Classifier
Tennis-Example
Naïve Bayes Classifier

Problem:
Use training data from above to classify the following instances:
a) <Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong>
b) <Outlook=overcast, Temperature=cool, Humidity=high, Wind=strong>
Naïve Bayes Classifier
Answer to (a):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=n) = 5/14 = 0.36
P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22
P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
Naïve Bayes Classifier
P(yes)xP(sunny|yes)xP(cool|yes)xP(high|yes)xP(strong|yes) = 0.0053

P(no)xP(sunny|no)xP(cool|no)xP(high|no)x P(strong|no) = 0.0206

So the class for this instance is ‘no’. We can normalize the probility by:
[0.0206]/[0.0206+0.0053] = 0.795
Naïve Bayes Classifier
Answer to (b):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=no) = 5/14 = 0.36
P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44
P(Outlook=overcast|PlayTennis=no) = 0/5 = 0
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
Example
• Example: Play Tennis
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No
Strong 3/9 3/5
High 3/9 4/5 Weak 6/9 2/5
Normal 6/9 1/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14


Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.


Relevant Issues
• Violation of Independence Assumption
– For many real world tasks, P(X1 ,  , Xn |C)  P(X1 |C)    P(Xn |C)
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute value Xj  ajk , Pˆ (Xj  ajk |C  ci )  0
– In this circumstance, Pˆ ( x1 |ci )    Pˆ (ajk |ci )    Pˆ (during
xn |ci ) test
0
– For a remedy, conditional probabilities estimated with

Pˆ ( X  a |C  c )  nc  mp
nm
j jk i

nc : number of trainingexamples for which X j  a jk and C  ci


n : number of trainingexamples for which C  ci
p : prior estimate(usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual"examples, m  1)
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution

1  ( X j   ji )2 
ˆ ( X |C  c ) 
P exp  
j i
2  ji  2 ji 2 
 
 ji : mean (avearage) of attributevalues X j of examples for which C  ci
 ji : standarddeviation of attributevalues X j of examples for which C  ci
– Learning Phase: for X  (X1 ,  , Xn ), C  c1 ,  , cL
Output: nnormal
L distributions and P(C  ci ) i  1,  , L
– Test Phase:
for X  (X1 ,  , Xn )
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
Conclusions
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each attribute in each c
lass separately
– Test is straightforward; just looking up tables or calculating conditional probabiliti
es with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even in presence
of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
Bayesian Belief Network
Expert Systems: Rule Based Systems

 1960s - Rule Based Systems


 Model human Expertise using IF .. THEN rules or Production Rules.
 Combines the rules (or Knowledge Base) with an inference engine to rea
son about the world.
 Given certain observations, produces conclusions.
 Relatively successful but limited.
Uncertainty

 Rule based systems failed to handle uncertainty


 Only dealt with true or false facts
 Partly overcome using Certainty factors
 However, other problems: no differentiation between causal rules and
diagnostic rules.
Normative Expert Systems
 Model Domain rather than Expert
 Classical probability used rather than ad-hoc calculus
 Expert support rather than Expert Model
 1980s - More Powerful Computers make complex probability calculati
ons feasible
 Bayesian Networks introduced (Pearl 1986) e.g. MUNIN. Bayesian net
works have been the most important contribution to the field of AI in th
e last 10 years
 Provide a way to represent knowledge in an uncertain domain and a wa
y to reason about this knowledge
 Many applications: medicine, factories, help desks, spam filtering, etc.
A Bayesian Network
B P(B) E P(E) A Bayesian network is made up of two
false 0.999 false 0.998 parts:
true 0.001 true 0.002 1. A directed acyclic graph
2. A set of parameters
Burglary Earthquake

B E A P(A|B,E)
false false false 0.999
Alarm
false false true 0.001
false true false 0.71
false true true 0.29
true false false 0.06
true false true 0.94
true true false 0.05
true true true 0.95
A Directed Acyclic Graph
Burglary Earthquake

Alarm

 The nodes are random variables (which can be discrete or continuous)


 Arrows connect pairs of nodes (X is a parent of Y if there is an arrow from node
X to node Y).
 Intuitively, an arrow from node X to node Y means X has a direct influence on Y
(we can say X has a casual effect on Y)
 Easy for a domain expert to determine these relationships
 The absence/presence of arrows will be made more precise later on
A Set of Parameters
B P(B) E P(E) Burglary Earthquake
false 0.999 false 0.998
true 0.001 true 0.002

Alarm
B E A P(A|B,E)
false false false 0.999
false false true 0.001  Each node Xi has a conditional probability distribution
false true false 0.71 P(Xi | Parents(Xi)) that quantifies the effect of the
false true true 0.29 parents on the node
true false false 0.06  The parameters are the probabilities in these
true false true 0.94 conditional probability distributions
true true false 0.05
 Because we have discrete random variables, we have
true true true 0.95
conditional probability tables (CPTs)
A Set of Parameters
Conditional Probability Stores the probability distribution for Alarm
Distribution for Alarm given the values of Burglary and
Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the parents
false false true 0.001
(B and E in this example), the entries for
false true false 0.71
P(A=true|B,E) and P(A=false|B,E) must add up to
false true true 0.29 1 eg. P(A=true|B=false,E=false) +
true false false 0.06 P(A=false|B=false,E=false)=1
true false true 0.94
true true false 0.05
true true true 0.95

If you have a Boolean variable with k Boolean parents, how big is the
conditional probability table?
How many entries are independently specifiable?
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented directed acyclic gr
aph, represented by the pair V , E where:
– V is a set of vertices.
– E is a set of directed edges joining vertices. No loops of any length are
allowed.

Each vertex in V contains the following information:


– The name of a random variable
– A probability distribution table indicating how the probability of this va
riable’s values depends on all possible combinations of parental values.
Semantics of Bayesian Networks
Two ways to view Bayes nets:
1. A representation of a joint probability distribution
2. An encoding of a collection of conditional independence statements
Bayesian Network Example
Weather Cavity

Toothache Catch P(A|B,C) = P(A|C)


I(ToothAche,Catch|Cavity)

• Weather is independent of the other variables, I(Weather,Cavity) or P(Weather)


= P(Weather|Cavity) = P(Weather|Catch) = P(Weather|Toothache)
• Toothache and Catch are conditionally independent given Cavity
• I(Toothache,Catch|Cavity) meaning P(Toothache|Catch,Cavity) =
P(Toothache|Cavity)
A Representation of the Full Joint Distribution
• We will use the following abbrevations:
– P(x1, …, xn) for P( X1 = x1  …  Xn = xn)
– parents(Xi) for the values of the parents of Xi
• From the Bayes net, we can calculate:

n
P( x1 ,..., xn )   P( xi | parents( X i ))
i 1
The Full Joint Distribution
P ( x1 ,..., xn )
 P ( xn | xn 1 ,..., x1 ) P ( xn 1 ,..., x1 ) ( Chain Rule)

 P ( xn | xn 1 ,..., x1 ) P ( xn 1 | xn  2 ,..., x1 ) P ( xn  2 ,..., x1 ) ( Chain Rule)


 P ( xn | xn 1 ,..., x1 ) P ( xn 1 | xn  2 ,..., x1 )...P ( x2 | x1 ) P ( x1 )
n ( Chain Rule)
  P ( xi | xi 1 ,..., x1 )
i 1 We’ll look at this step more
n closely
  P ( xi | parents( xi ))
i 1
The Full Joint Distribution
n n

 P( x
i 1
i | xi 1 ,..., x1 )   P( xi | parents( xi ))
i 1
To be able to do this, we need two things:
1. Parents(Xi)  {Xi-1, …, X1}
This is easy – we just label the nodes according to the partial order in
the graph
2. We need Xi to be conditionally independent of its predecessors given its
parents
This can be done when constructing the network. Choose parents that
directly influence Xi.
Example
Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(JohnCalls, MaryCalls, Alarm, Burglary, Earthquake)


= P(JohnCalls | Alarm) P(MaryCalls | Alarm ) P(Alarm | Burglary, Earthquake )
P( Burglary ) P( Earthquake )
Example

What is the probability that there is a burglary given that John calls? (0.0162)
Conditional Independence

 To do inference in a Belief Network we have to know if two sets of vari


ables are conditionally independent given a set of evidence.
 Method to do this is called Direction-Dependent Separation or d-Separa
tion.
 d-Separation determines whether a set of nodes X is independent of ano
ther set Y given a third set E
Conditional Independence
We can look at the actual graph structure and determine conditional inde
pendence relationships.
A node (X) is conditionally independent of its non-descendants (Z1j, Znj),
given its parents (U1, Um).
A1, A2, ..... non-descendants of C

B1 B2 ... parents of C

D1, D2, ... descendants of C

C is independent of C's non-descendants given C's parents


p( C | A1, ..., B1, ..., D1, ...) = p( C | B1, ..., D1, ...)
d-Separation
 If every undirected path from a node in X to a node in Y is d-separated by
E, then X and Y are conditionally independent given E. We will use the no
tation I(X, Y | E) to mean that X and Y are conditionally independent give
nE
 Theorem [Verma and Pearl 1988]: If a set of evidence variables E d-separa
tes X and Y in the Bayesian Network’s graph, then I(X, Y | E)
 A set of nodes, E, d-separates two sets of nodes, X and Y, if every undirect
ed path from a node in X to a node in Y is Blocked given E. A path is bloc
ked given a set of nodes, E if:
a) Z is in E and Z has one arrow leading in and one leading out.
b) Z is in E and has both arrows leading out.
c) Neither Z nor any descendant of Z is in E and both path arrows lead in to Z.
How to determine d-Separation

 To determine if I(X, Y | E), ignore the directions of the arrows, find all
paths between X and Y
 Now pay attention to the arrows. Determine if the paths are blocked ac
cording to the 3 cases
 If all the paths are blocked, X and Y are d-separated given E
 Which means they are conditionally independent given E
Blocking
X Y
E
Z

Z
d-Separation - Example
Battery
• Moves and Battery are independent
given it is known about Ignition
• Moves and Radio are independent if Radio Ignition

it is known that Battery works


• Petrol and Radio are independent gi Petrol
ven no evidence. But are dependent
given evidence of Starts Starts

Moves
Case 1
 What does it mean for a path to be blocked? There are 3 cases.
 Case 1: There exists a node N on the path such that
• It is in the evidence set E (shaded grey)
• The arcs putting N in the path are “tail-to-tail”.

X N Y

X = “Owns expensive car” N = “Rich” Y = “Owns expensive home”

The path between X and Y is blocked by N


Case 2
There exists a node N on the path such that
• It is in the evidence set E
• The arcs putting N in the path are “tail-to-head”.

X N Y

X=Education N=Job Y=Rich

The path between X and Y is blocked by N


Case 3
There exists a node N on the path such that
• It is NOT in the evidence set E (not shaded)
• Neither are any of its descendants
• The arcs putting N in the path are “head-to-head”.

X N Y

The path between X and Y is blocked by N (Note N is not in the evidence set)
Case 3 (Explaining Away)

Burglary Earthquake

Alarm

 Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes
 Earth obviously doesn’t care if your house is currently being broken into
 While you are on vacation, one of your nice neighbors calls and lets you know your
alarm went off
Case 3 (Explaining Away)
Burglary Earthquake

Alarm

 But if you knew that a medium-sized earthquake happened, then you’re probably
relieved that it’s probably not a burglar
 The earthquake “explains away” the hypothetical burglar
 This means that Burglary and Earthquake are not independent given Alarm.
 But Burglary and Earthquake are independent given no evidence ie. learning about an
earthquake when you know nothing about the status of your alarm doesn’t give you any
information about the burglary
Holmes and Watson: “Icy roads” Example

E
Holmes and Watson: “Wet grass” Example

grass
E

grass
Holmes and Watson: “Burglar alarm” Example

yes E
Conditional Independence

• a is independent of b given c

• Equivalently

• Notation
Conditional Independence: Example 1
Conditional Independence: Example 1
Conditional Independence: Example 2
Conditional Independence: Example 2
Conditional Independence: Example 3

• Note: this is the opposite of Example 1, with c unobserved.


Conditional Independence: Example 3

Note: this is the opposite of Example 1, with c observed.


“Am I out of fuel?”

B = Battery (0=flat, 1=fully charged)


F = Fuel Tank (0=empty, 1=full)
and hence G = Fuel Gauge Reading
(0=empty, 1=full)
“Am I out of fuel?”

Probability of an empty tank increased by observing G = 0.


“Am I out of fuel?”

Probability of an empty tank reduced by observing B = 0.


This referred to as “explaining away”.
Markov Models
Introduction to Markov Models
• Set of states: {s1 , s2 ,  , s N }
• Process moves from one state to another generating a sequence of
states : si1 , si 2 ,  , sik , 
• Markov chain property: probability of each subsequent state depends only
on what was the previous state:
P( sik | si1 , si 2 ,  , sik 1 )  P( sik | sik 1 )
• To define Markov model, the following probabilities have to be specified:
transition probabilities ij 
aand P( sprobabilities
initial i | sj)
 i  P( si )
Example of Markov Model
0.3 0.7

Rain Dry

0.2 0.8

• Two states : ‘Rain’ and ‘Dry’.


• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,
P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Calculation of sequence probability
• By Markov chain property, probability of state sequence can be found by the
formula:
P( si1 , si 2 ,  , sik )  P( sik | si1 , si 2 ,  , sik 1 ) P( si1 , si 2 ,  , sik 1 )
 P( sik | sik 1 ) P( si1 , si 2 ,  , sik 1 )  
 P( sik | sik 1 ) P( sik 1 | sik  2 )  P( si 2 | si1 ) P( si1 )
• Suppose we want to calculate a probability of a sequence of states in our
example, {‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’,’Dry’,’Rain’,Rain’} ) =
P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)=
= 0.3*0.2*0.8*0.6
Hidden Markov Models.
• Set of states: {s1 , s2 ,  , s N }
• Process moves from one state to another generating a sequence of states :
si1 , si 2 ,  , sik , 
• Markov chain property: probability of each subsequent state depends only on
what was the previous state:
P( sik | si1 , si 2 ,  , sik 1 )  P( sik | sik 1 )
• States are not visible, but each state randomly generates one of M observations
(or visible states) {v1 , v2 , , vM }
• To define hidden Markov model, the following probabilities have to be specified:
• matrix of transition probabilities A=(aij), aij= P(si | sj)
• matrix of observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si)
• initial probabilities =(i), i = P(si) . Model is represented by M=(A, B, ).
Example of Hidden Markov Model
0.3 0.7

Low High

0.2 0.8
0.6
0.6
0.4 0.4

Rain Dry

Two states : ‘Low’ and ‘High’ atmospheric pressure.


Example of Hidden Markov Model
1. Two states : ‘Low’ and ‘High’ atmospheric pressure.
2. Two observations : ‘Rain’ and ‘Dry’.
3. Transition probabilities: P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,
P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8
4. Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,
P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .
5. Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .
Calculation of observation sequence probability
•Suppose we want to calculate a probability of a sequence of observations in
our example, {‘Dry’,’Rain’}.
•Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) +
P({‘Dry’,’Rain’} , {‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’})
+ P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’})=
P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) =
P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.6*0.4*0.3
Hidden Markov Model
Hidden Markov Model
Hidden Markov Model
Table - 1
(state sequence probabilities of example Hot and Cold weather)
Review Questions
1. Is it true that Naïve Bayes Classifier is based on the assumption that input featu
res are independent with each others?
2. What is Bayes Theorem?
3. What is marginal probability?
4. How we compute joint probability distribution using network graph model?
Thank you

You might also like