You are on page 1of 65

Machine Learning

Graphical Models

Lecturer: Duc Dung Nguyen, PhD.


Contact: nddung@hcmut.edu.vn

Faculty of Computer Science and Engineering


Hochiminh city University of Technology
Contents

1. Bayesian Networks (revisited)

2. Naive Bayes Classifier (revisited)

3. Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 1 / 35


Bayesian Networks (revisited)
Bayesian Networks

relationship between events

A B

B depend on A

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 2 / 35


Bayesian Networks

Advantages of graphical modeling:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 3 / 35


Bayesian Networks

Advantages of graphical modeling:

• Conditional independence:
p(D|C, E, A, B) = p(D|C)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 3 / 35


Bayesian Networks

Advantages of graphical modeling:

• Conditional independence:
p(D|C, E, A, B) = p(D|C)
• Factorization:

p(A, B, C, D, E) = p(D|C)p(E|C)p(C|A, B)p(A)p(B)

small prob
-> easier to
compute

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 3 / 35


Naive Bayes Classifier
(revisited)
Naive Bayes Classifier

• Each instance x is described by a conjunction of attribute values < a1 , a2 , ..., an >


• It is to assign the most probable class c to an instance

CN B = arg max(a1 , a2 , ..., an |c)p(c)


c∈C
Y
= arg max p(ai |c).p(c)
c∈C i=1,n

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 4 / 35


Naive Bayes Classifier

independent

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 5 / 35


Naive Bayes Classifier

Joint distribution: p(C, A1 , A2 , ..., An )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 6 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

In contrast to a discriminative model (e.g., CRF)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35


Naive Bayes Classifier

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

In contrast to a discriminative model (e.g., CRF)

• Conditional distribution: P (C|A)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35


Naive Bayes Classifier

if have enough distribution of data


-> we can generate data

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.

In contrast to a discriminative model (e.g., CRF)

• Conditional distribution: P (C|A)


• It discriminates C given A

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35


Hidden Markov Models
Hidden Markov Models

• Introduction
• Example
• Independence assumptions
• Forward algorithm
• Viterbi algorithm
• Training
• Application to NER

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 8 / 35


Hidden Markov Models

have some relationship between data

• One of the most popular graphical models.


• Dynamic extension of Bayesian networks.
• Sequential extension of Naive Bayes classifier.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 9 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.

A model:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study)
• Given the sequence of observations of your looking, guess what you did in previous nights.

A model:

• Your looking depends on what you did in the night before.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35


Hidden Markov Models

Example:

• Your possible looking prior to the exam: (tired, hungover, scared, f ine)
• Your possible activity last night: (T V, pub, party, study) hidden state, to guess
• Given the sequence of observations of your looking, guess what you did in previous nights.

A model:

• Your looking depends on what you did in the night before.


• Your activity in a night depends on what you did in some previous nights.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35


Hidden Markov Models

• A finite set of possible observations.


• A finite set of possible hidden states.
• To predict the most probable sequence of underlying stats {y1 , y2 , ..., yT } for a given
sequence of observations {x1 , x2 , ..., xT }

p(y|x)
state

= p(x|y)p(y) / p(x)

observation

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 11 / 35


Hidden Markov Models

Marsland, S. (2009) Machine Learning:Machine


Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn An Algorithmic
Learning Perspective. 12 / 35
Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 13 / 35


Hidden Markov Models

HMM conditional independence assumptions:


• State at time t depends only on state at time t − 1.
p(yt |yt−1 , Z) = p(yt |yt−1 )
• Observation at time t depends only on state at time t.
P (xt |yt , Z) = p(xt |yt )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 14 / 35


Hidden Markov Models

HMM is a generative model:


• Joint distributions:
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 15 / 35


Hidden Markov Models

HMM is a generative model:


• Joint distributions:
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )


• It can generate any distribution on Y and X

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 15 / 35


Hidden Markov Models

HMM is a generative model:

• Joint distributions:

p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )


Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T

p(y1 |y0 ) = p(y1 )

• It can generate any distribution on Y and X top layer for speech recognization

In contrast to a discriminative model (e.g., CRF):

• Conditional distributions: p(Y |X)


• It discriminates Y given X.
Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 16 / 35
Hidden Markov Model

Forward algorithm:

• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:

αt (yt ) = p(yt , x1 , x2 , ..., xt )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 17 / 35


Hidden Markov Model

Forward algorithm:

• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:

yt = argmax alphat(yt) αt (yt ) = p(yt , x1 , x2 , ..., xt )

• Bayes’ theorem gives:

p(yt |x1 , x2 , ..., xt ) = p(yt , x1 , x2 , ..., xt )/p(x1 , x2 , ..., xt )


= αt (yt )/p(x1 , x2 , ..., xt )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 17 / 35


Hidden Markov Model

Forward algorithm:

• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:

αt (yt ) = p(yt , x1 , x2 , ..., xt )

• Bayes’ theorem gives:

p(yt |x1 , x2 , ..., xt ) = p(yt , x1 , x2 , ..., xt )/p(x1 , x2 , ..., xt )


= αt (yt )/p(x1 , x2 , ..., xt )

• The highest αt (yt ) is the most likely yt would be given the same {x1 , x2 , ..., xt }.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 17 / 35


Hidden Markov Models

Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
P(yt, x1:t) = p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 18 / 35


Hidden Markov Models

Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
= p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1

α1 (y1 ) = p(y1 , x1 ) = p(x1 |y1 )p(y1 )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 18 / 35


Hidden Markov Models

Forward algorithm:

X
αt (yt ) = p(xt |yt ) p(yt |yt−1 ).αt−1 (yt−1 )
yt−1

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 19 / 35


Hidden Markov Models

Viterbi algorithm:
find best sequence of y
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 20 / 35


Hidden Markov Models

Viterbi algorithm:
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 20 / 35


Hidden Markov Models

• Viterbi algorithm:
max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T yT y1:T −1

= max max {p(xT |yT ).p(yT |yT −1 )p(y1 , ..., yT −1 , x1 , x2 , ..., xT −1 )}


yT y1:T −1
 
= max max p(xT |yT ).p(yT |yT −1 ) max p(y1 , ..., yT −1 , x2 , ..., xT −1 )
yT yT −1 y1:T −2

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 21 / 35


• Viterbi algorithm:

max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )


y1:T yT y1:T −1

= max max {p(xT |yT ).p(yT |yT −1 )p(y1 , ..., yT −1 , x1 , x2 , ..., xT −1 )}


yT y1:T −1
 
= max max p(xT |yT ).p(yT |yT −1 ) max p(y1 , ..., yT −1 , x2 , ..., xT −1 )
yT yT −1 y1:T −2

• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1

• For each t from 2 to T, and for each state yt , compute:

arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )


y1:t−1

• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 22 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 23 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )
y1:t−1

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 23 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )
y1:t−1

• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 23 / 35


• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )
y1:t−1

• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 23 / 35


Hidden Markov Models

• Could the results from the forward algorithm be used for Viterbi algorithm?

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 24 / 35


Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 25 / 35


Hidden Markov Models

Could the results from the forward algorithm be used for Viterbi algorithm?

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 26 / 35


Hidden Markov Models

Training HMMs:

• Topology is designed beforehand.


• Parameters to be learned: emission and transition probabilities.
• Supervised or unsupervised training.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 27 / 35


Hidden Markov Models

Supervised learning:

• Training data: paired sequences of states and observations (y1 , y2 , ..., yT , x1 , x2 , ..., xt )
• p(yi ) = number of sequences starting with yi /number of all sequences.
• p(yj |yi ) = number of (yi , yj )’s / number of all (yi , y)’s
• p(xj |yi ) = number of (yi , xj )’s / number of all (yi , x)’s

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 28 / 35


Hidden Markov Models

Supervised learning example:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 29 / 35


Hidden Markov Models

Unsupervised learning:

• Only observation sequences are available

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 30 / 35


Hidden Markov Models

Unsupervised learning:

• Only observation sequences are available

• Iterative improvement of model parameters.


• How?

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 30 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 31 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 31 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 31 / 35


Hidden Markov Models

Unsupervised learning:

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.
• Repeat it until convergence.
Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 31 / 35
Hidden Markov Models

Application to NER:

• Example: "Facebook CEO Zuckerberg visited Vietnam".


ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 32 / 35


Hidden Markov Models

Application to NER:

• Example: "Facebook CEO Zuckerberg visited Vietnam".


ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"
• States = Class labels

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 32 / 35


Hidden Markov Models

Application to NER:

• Example: "Facebook CEO Zuckerberg visited Vietnam".


ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"
• States = Class labels
• Observations = Words + Features

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 32 / 35


Hidden Markov Models

Application to NER:

Bikel, D.M., (1997) A high-performance learning name-finder


Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 33 / 35
Hidden Markov Models

Application to NER:

• What if a name is a multi-word phrase?


• Example: "...John von Neumann is ..."
B-PER = "John"
I-PER = "von","Neumann"
O = "is"
• BIO notation: {B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC, O}

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 34 / 35


Homework

• Readings
• Marsland, S. (2009) Machine learning:An algorithmic perspective. Chapter 15 (graphical
models).
• Bikel, D. M. etal. (1997) Nymble: a high performance learning name-finder.
• HW
• Apply Viterbi algorithm to find the most probable 3-state sequence in the looking-activity
example in the lecture.
• Write a program to carry out the unsupervised learning example for HMM in the lecture.
Discuss on the result, in particular the convergence of the process.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 35 / 35

You might also like