Chapter 5 - Graphical Models

Machine Learning
Graphical Models
Lecturer: Duc Dung Nguyen, PhD.

Contact: nddung@hcmut.edu.vn
Faculty of Computer Science and Engineering

Hochiminh city University of Technology
Contents
1. Bayesian Networks (revisited)
2. Naive Bayes Classifier (revisited)
3. Hidden Markov Models
Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 1 / 35

Bayesian Networks (revisited)
Bayesian Networks
relationship between events
A B
B depend on A

Bayesian Networks
Advantages of graphical modeling:

Bayesian Networks
• Conditional independence:
p(D|C, E, A, B) = p(D|C)

Bayesian Networks
• Conditional independence:
p(D|C, E, A, B) = p(D|C)
• Factorization:
p(A, B, C, D, E) = p(D|C)p(E|C)p(C|A, B)p(A)p(B)
small prob
-> easier to
compute

Naive Bayes Classifier
(revisited)
• Each instance x is described by a conjunction of attribute values < a1 , a2 , ..., an >

• It is to assign the most probable class c to an instance
CN B = arg max(a1 , a2 , ..., an |c)p(c)

c∈C
Y
= arg max p(ai |c).p(c)
c∈C i=1,n

independent

Joint distribution: p(C, A1 , A2 , ..., An )

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)


• It can generate any distribution on C and A.


In contrast to a discriminative model (e.g., CRF)


• Conditional distribution: P (C|A)

if have enough distribution of data

-> we can generate data

• Conditional distribution: P (C|A)

• It discriminates C given A

Hidden Markov Models
• Introduction
• Example
• Independence assumptions
• Forward algorithm
• Viterbi algorithm
• Training
• Application to NER

have some relationship between data
• One of the most popular graphical models.

• Dynamic extension of Bayesian networks.
• Sequential extension of Naive Bayes classifier.

Example:
• Your possible looking prior to the exam: (tired, hungover, scared, f ine)

Example:
• Your possible activity last night: (T V, pub, party, study)

Example:
• Given the sequence of observations of your looking, guess what you did in previous nights.

Example:
A model:

Example:
A model:
• Your looking depends on what you did in the night before.

Example:
• Your possible activity last night: (T V, pub, party, study) hidden state, to guess
A model:
• Your looking depends on what you did in the night before.

• Your activity in a night depends on what you did in some previous nights.

• A finite set of possible observations.

• A finite set of possible hidden states.
• To predict the most probable sequence of underlying stats {y1 , y2 , ..., yT } for a given
sequence of observations {x1 , x2 , ..., xT }
p(y|x)
state
= p(x|y)p(y) / p(x)
observation

Marsland, S. (2009) Machine Learning:Machine

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn An Algorithmic
Learning Perspective. 12 / 35

HMM conditional independence assumptions:

• State at time t depends only on state at time t − 1.
p(yt |yt−1 , Z) = p(yt |yt−1 )
• Observation at time t depends only on state at time t.
P (xt |yt , Z) = p(xt |yt )

HMM is a generative model:

• Joint distributions:
p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
Y
= p(yt |yt−1 ).p(xt |yt )
t=1,T
p(y1 |y0 ) = p(y1 )


p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
Y
t=1,T
p(y1 |y0 ) = p(y1 )

• It can generate any distribution on Y and X

p(Y, X) = p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )

Y
t=1,T
p(y1 |y0 ) = p(y1 )
• It can generate any distribution on Y and X top layer for speech recognization
In contrast to a discriminative model (e.g., CRF):
• Conditional distributions: p(Y |X)

• It discriminates Y given X.
Hidden Markov Model
Forward algorithm:
• To compute the joint probability of the state at the time t being yt and the sequence of
observations in the first t steps being {x1 , x2 , ..., xt }:
αt (yt ) = p(yt , x1 , x2 , ..., xt )

Hidden Markov Model
Forward algorithm:
yt = argmax alphat(yt) αt (yt ) = p(yt , x1 , x2 , ..., xt )
• Bayes’ theorem gives:
p(yt |x1 , x2 , ..., xt ) = p(yt , x1 , x2 , ..., xt )/p(x1 , x2 , ..., xt )

= αt (yt )/p(x1 , x2 , ..., xt )

Hidden Markov Model
Forward algorithm:
αt (yt ) = p(yt , x1 , x2 , ..., xt )
• Bayes’ theorem gives:
p(yt |x1 , x2 , ..., xt ) = p(yt , x1 , x2 , ..., xt )/p(x1 , x2 , ..., xt )

= αt (yt )/p(x1 , x2 , ..., xt )
• The highest αt (yt ) is the most likely yt would be given the same {x1 , x2 , ..., xt }.

Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
P(yt, x1:t) = p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1

Forward algorithm:
X
αt (yt ) = p(yt , x1 , x2 , ..., xt ) = p(yt , yt−1 , x1 , x2 , ..., xt )
yt−1
X
= p(xt |yt , yt−1 , x1 , x2 , ..., xt−1 )p(yt , yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 , x1 , x2 , ..., xt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt )p(yt |yt−1 )p(yt−1 , x1 , x2 , ..., xt−1 )
yt−1
X
= p(xt |yt ) p(yt |yt−1 )αt−1 (yt−1 )
yt−1
α1 (y1 ) = p(y1 , x1 ) = p(x1 |y1 )p(y1 )

Forward algorithm:
X
αt (yt ) = p(xt |yt ) p(yt |yt−1 ).αt−1 (yt−1 )
yt−1

Viterbi algorithm:
find best sequence of y
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y

Viterbi algorithm:
• To compute the most probable sequence of states {y1 , y2 , ..., yT } given a sequence of
observations {x1 , x2 , ..., xT }:
Y ∗ = arg max p(Y |X) = arg max p(Y, X)
Y Y

• Viterbi algorithm:
max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T yT y1:T −1
= max max {p(xT |yT ).p(yT |yT −1 )p(y1 , ..., yT −1 , x1 , x2 , ..., xT −1 )}

yT y1:T −1

= max max p(xT |yT ).p(yT |yT −1 ) max p(y1 , ..., yT −1 , x2 , ..., xT −1 )
yT yT −1 y1:T −2

• Viterbi algorithm:
max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT ) = max max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )

y1:T yT y1:T −1
= max max {p(xT |yT ).p(yT |yT −1 )p(y1 , ..., yT −1 , x1 , x2 , ..., xT −1 )}

yT y1:T −1

= max max p(xT |yT ).p(yT |yT −1 ) max p(y1 , ..., yT −1 , x2 , ..., xT −1 )
yT yT −1 y1:T −2
• Dynamic programming:
• Compute
arg max p(y1 , x1 ) = arg max p(x1 |y1 ).p(y1 )
y1 y1
• For each t from 2 to T, and for each state yt , compute:
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xt )

y1:t−1
• Select
arg max p(y1 , y2 , ..., yT , x1 , x2 , ..., xT )
y1:T

• Compute
y1 y1

• Compute
y1 y1
y1:t−1

• Compute
y1 y1
y1:t−1
• Select
y1:T

• Compute
y1 y1
y1:t−1
• Select
y1:T

• Could the results from the forward algorithm be used for Viterbi algorithm?


Could the results from the forward algorithm be used for Viterbi algorithm?

Training HMMs:
• Topology is designed beforehand.

• Parameters to be learned: emission and transition probabilities.
• Supervised or unsupervised training.

Supervised learning:
• Training data: paired sequences of states and observations (y1 , y2 , ..., yT , x1 , x2 , ..., xt )
• p(yi ) = number of sequences starting with yi /number of all sequences.
• p(yj |yi ) = number of (yi , yj )’s / number of all (yi , y)’s
• p(xj |yi ) = number of (yi , xj )’s / number of all (yi , x)’s

Supervised learning example:

Unsupervised learning:
• Only observation sequences are available

• Only observation sequences are available
• Iterative improvement of model parameters.

• How?

• Initialize estimated parameters

• For each observation sequence, compute the most probable state sequence, using Viterbi
algorithm.

algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.

algorithm.
• Update the parameters using supervised learning on obtained paired state-observation
sequences.
• Repeat it until convergence.
Application to NER:
• Example: "Facebook CEO Zuckerberg visited Vietnam".

ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
NIL = "CEO", "visited"

Application to NER:

ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
• States = Class labels

Application to NER:

ORG = "Facebook"
PER = "Zuckerberg"
LOC = "Vietnam"
• States = Class labels
• Observations = Words + Features

Application to NER:
Bikel, D.M., (1997) A high-performance learning name-finder

Application to NER:
• What if a name is a multi-word phrase?

• Example: "...John von Neumann is ..."
B-PER = "John"
I-PER = "von","Neumann"
O = "is"
• BIO notation: {B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC, O}

Homework
• Readings
• Marsland, S. (2009) Machine learning:An algorithmic perspective. Chapter 15 (graphical
models).
• Bikel, D. M. etal. (1997) Nymble: a high performance learning name-finder.
• HW
• Apply Viterbi algorithm to find the most probable 3-state sequence in the looking-activity
example in the lecture.
• Write a program to carry out the unsupervised learning example for HMM in the lecture.
Discuss on the result, in particular the convergence of the process.

Chapter 5 - Graphical Models

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5 - Graphical Models

Uploaded by

Copyright:

Available Formats

Machine Learning

Lecturer: Duc Dung Nguyen, PhD.

Faculty of Computer Science and Engineering

1. Bayesian Networks (revisited)

2. Naive Bayes Classifier (revisited)

3. Hidden Markov Models

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 1 / 35

relationship between events

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 2 / 35

Advantages of graphical modeling:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 3 / 35

Advantages of graphical modeling:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 3 / 35

Advantages of graphical modeling:

p(A, B, C, D, E) = p(D|C)p(E|C)p(C|A, B)p(A)p(B)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 3 / 35

• Each instance x is described by a conjunction of attribute values < a1 , a2 , ..., an >

CN B = arg max(a1 , a2 , ..., an |c)p(c)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 4 / 35

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 5 / 35

Joint distribution: p(C, A1 , A2 , ..., An )

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 6 / 35

Naive Bayes is a generative model:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

In contrast to a discriminative model (e.g., CRF)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

In contrast to a discriminative model (e.g., CRF)

• Conditional distribution: P (C|A)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35

if have enough distribution of data

Naive Bayes is a generative model:

• It models a joint distribution: p(C, A)

In contrast to a discriminative model (e.g., CRF)

• Conditional distribution: P (C|A)

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 7 / 35

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 8 / 35

have some relationship between data

• One of the most popular graphical models.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 9 / 35

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35

• Your looking depends on what you did in the night before.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35

• Your looking depends on what you did in the night before.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 10 / 35

• A finite set of possible observations.

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 11 / 35

Marsland, S. (2009) Machine Learning:Machine

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 13 / 35

HMM conditional independence assumptions:

Lecturer: Duc Dung Nguyen, PhD. Contact: nddung@hcmut.edu.vn Machine Learning 14 / 35