Gaussian Mixture Model and Hidden Markov Model

Gaussian Mixture Model (GMM)
and
Hidden Markov Model (HMM)

Samudravijaya K
Majority of the slides are taken from S.Umeshs tutorial on ASR (WiSSAP 2006).
1 of 88
Pattern Recognition
Training
Model Generation Pattern Matching

Output
Input
Signal Processing
Testing
GMM: static patterns HMM: sequential patterns
WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K
2 of 88
Basic Probability
Joint and Conditional probability p(A, B) = p(A|B) p(B) = p(B|A) p(A)
Bayes rule p(B|A) p(A) p(A|B) = p(B) If Ais are mutually exclusive events, p(B) =
i
p(B|Ai) p(Ai) p(B|A) p(A) i p(B|Ai) p(Ai )

3 of 88
p(A|B) =
Normal Distribution
Many phenomenon are described by Gaussian pdf 1 1 p(x|) = exp 2 (x )2 2 2 2 pdf is parameterised by = [, 2] where mean = and variance= 2. A convenient pdf : second order statistics is sucient. Example: Heights of Pygmies Gaussian pdf with = 4f t & std-dev() = 1f t OR: Heights of bushmen Gaussian pdf with = 6f t & std-dev() = 1f t Question:If we arbitrarily pick a person from a population what is the probability of the height being a particular value?
WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 4 of 88
(1)
If I pick arbitrarily a Pygmy, say x, then 1 1 exp (41 4)2 Pr(Height of x=41) = 2.1 2.1 Note: Here mean and variances are xed, only the observations, x, change. Also see: P r(x = 41) P r(x = 5)
(2)
AND
P r(x = 41) P r(x = 3)

5 of 88
Conversely: Given a persons height is 41 Person is more likely to be a pygmy than bushman. If we observe heights of many persons say 36, 41, 38, 45, 47, 4, 65 and all are from same population (i.e. either pygmy or bushmen.) then more certain we are that the population is pygmy. More the observations better will be our decision
Likelihood Function
x[0], x[1], . . . , x[N 1] set of independent observations from pdf parameterised by . Previous Example: x[0], x[1], . . . , x[N 1] are heights observed and is the mean of density which is unknown ( 2 assumed known).
N
L(X; ) = p(x0 . . . xN 1 ; ) =
i=0
p(xi; ) 1 1 exp 2 2 (2 2)N |2

N
i=0
(xi )2
(3)
L(X; ) is a function of and is called Likelihood Function Given: x0 . . . xN 1, what can we say about value of , i.e. best estimate of .
Maximum Likelihood Estimation

Example: We know height of a person x[0] = 44. Most likely to have come from which pdf = 3, 46 or 6 ?
Maximum of L(x[0]; = 3), L(x[0]; 46) and L(x[0]; = 6) choose = 46. If is just a parameter, we will choose arg max L(x[0]; ).
Maximum Likelihood Estimator

1 2 . . m1
Given x[0], x[1], . . . , x[N 1] and pdf parameterised by =

N
We form Likelihood function L(X; ) =
p(xi; )
i=0
M LE = arg max L(X; )
For height problem: can show ()M LE =

1 N
xi
Estimate of mean of Gaussian = sample mean of measured heights.

Bayesian Estimation
MLE is assumed unknown but deterministic Bayesian Approach: is assumed random with pdf p() Prior Knowledge. p(|x) =
Aposterior
p(x|)p() p(x|) p() p(x)
Prior
Height problem: Unknown mean is random pdf Gaussian N (, 2) p() = 1 exp 2 ( )2 2 2 2 1
2 + n 2x Then : ()Bayesian = 2 + n 2 Weighted average of sample mean and a prior mean

Gaussian Mixture Model
p(x) = p(x|N (1; 1)) + (1 ) p(x|N (2; 2))

M
p(x) =
m=1
wm p(x|N (m; m)),
wi = 1
Characteristics of GMM: Just like ANNs are universal approximators of functions, GMMs are universal approximators of densities (provided sucient no. of mixtures are used); true for diagonal GMMs as well.
General Assumption in GMM
Assume that there are M components. Each component generates data from a Gaussian with mean m and covariance matrix m.
GMM
Consider the following probability density function shown in solid blue
It is useful to parameterise or model this seemingly arbitrary blue pdf
13 of 88
Gaussian Mixture Model (Contd.)
Actually pdf is a mixture of 3 Gaussians, i.e. p(x) = c1N (x; 1, 1) + c2N (x; 2, 2) + c3N (x; 3, 3 ) and pdf parameters: c1, c2, c3, 1, 2, 3, 1, 2, 3 ci = 1 (4)
14 of 88
Observation from GMM

Experiment: An urn contains balls of 3 dierent colurs: red, blue or green. Behind a curtain, a person picks a ball from urn If red ball generate x[i] from N (x; 1, 1) If blue ball generate x[i] from N (x; 2, 2) If green ball generate x[i] from N (x; 3, 3)
We have access only to observations x[0], x[1], . . . , x[N 1]
Therefore : p(x[i]; ) = c1N (x; 1, 1) + c2N (x; 2, 2) + c3N (x; 3, 3) but we do not which urn x[i] comes from! Can we estimate component = [c1 c2 c3 1 2 3 1 2 3]T from the observations?
N
arg max p(X; ) = arg max

i=1
p(xi; )
(5)
15 of 88
Estimation of Parameters of GMM

Easier Problem: We know the component for each observation
Obs: Comp. x[0] 1 x[1] 2 x[2] 2 x[3] 1 x[4] 3 x[5] 1 x[6] 3 x[7] 3 x[8] 2 x[9] 2 x[10] 3 x[11] 3 x[12] 3
X1 = {x[0], x[3], x[5] }
belong to
p1(x; 1, 1 ) p2(x; 2, 2)
X2 = {x[1], x[2], x[8], x[9] }
belongs to
X3 = {x[3], x[6], x[7], x[10], x[11], x[12]} belongs to p3(x; 3, 3)
From: X1 = {x[0], x[3], x[5] }
3 c1 = 13 1 and 1 = 3 {x[0] + x[3] + x[5]} 2 1 = 1 3
(x[0] 1)2 + (x[2] 1)2 + (x[5] 1)2
In practice we do not know which observation come from which pdf. How do we solve for arg max p(X; ) ?
Incomplete & Complete Data

x[0], x[1], . . . , x[N 1] incomplete data, Introduce another set of variables y[0], y[1], . . . , y[N 1] such that y[i] = 1 if x[i] p1, y[i] = 2 if x[i] p2 and y[i] = 3 if x[i] p3
Obs: Comp. miss: x[0] 1 y[0] x[1] 2 y[1] x[2] 2 y[2] x[3] 1 y[3] x[4] 3 y[4] x[5] 1 y[5] x[6] 3 y[6] x[7] 3 y[7] x[8] 2 y[8] x[9] 2 y[9] x[10] 3 y[10] x[11] 3 y[11] x[12] 3 y[12]
y[i] = missing dataunobserved data information about component z = (x; y) is complete data observations and which density they come from p(z; ) = p(x, y; ) M LE = arg max p(z; ) Question: But how do we nd which observation belongs to which density ?
Given observation x[0] and g , what is the probability of x[0] coming from rst distribution?
p (y[0] = 1|x[0]; g )
= = =
p(y[0]=1,x[0]; g ) p(x[0]; g ) p(x[0]|y[0]=1;1 ,1 ) p(y[0]=1) 3 g j=1 p(x[0]|y[0]=j; ) p(y[0]=j) p(x[0]|y[0]=1,1 ,1 ) c1 g g g g g g p(x[0]|y[0]=1;1 ,1 ) c1 +p(x[0]|y[0]=2;2 ,2 ) c2 +...
g g g g g
All parameters are known we can calculate p(y[0] = 1|x[0]; g ) (Similarly calculate p(y[0] = 2|x[0]; g ), p(y[0] = 3|x[0]; g ) Which density? y[0] = arg max p(y[0] = i|x[0]; g ) Hard allocation
i
Parameter Estimation for Hard Allocation

x[0] 0.5 0.25 0.25
y[0]=1
p(y[j] = 1|x[j]; g ) p(y[j] = 2|x[j]; g ) p(y[j] = 3|x[j]; g )

Hard Assign.
x[1] 0.6 0.3 0.1

y[1]=1
x[2] 0.2 0.75 0.05

y[2]=2
x[3] 0.1 0.3 0.6

y[3]=3
x[4] 0.2 0.7 0.1

y[4]=2
x[5] 0.4 0.5 0.1

y[5]=2
x[6] 0.2 0.6 0.2

y[6]=2
Updated Parameters: c1 = Similarly (for Gaussian) nd:
2 7
c2 =
2 i, i
4 7
c3 =
1 7
(dierent from initial guess!)
for ith pdf
19 of 88
Parameter Estimation for Soft Assignment

x[0] 0.5 0.25 0.25 x[1] 0.6 0.3 0.1 x[2] 0.2 0.75 0.05 x[3] 0.1 0.3 0.6 x[4] 0.2 0.7 0.1 x[5] 0.4 0.5 0.1 x[6] 0. 2 0.6 0.2
p(y[j] = 1|x[j]; g ) p(y[j] = 2|x[j]; g ) p(y[j] = 3|x[j]; g )
Example: Prob. of each sample belonging to component 1 p(y[0] = 1|x[0]; g ), p(y[1] = 1|x[1]; g ) p(y[2] = 1|x[2]; g ), Average probability that a sample belongs to Comp.#1 is cnew 1 = = 1 N
N
p(y[i] = 1|x[i]; g )
i=1
0.5 + 0.6 + 0.2 + 0.1 + 0.2 + 0.4 + 0.2 2.2 = 7 7
20 of 88
Soft Assignment Estimation of Means & Variances

Recall: Prob. of sample j belonging to component i p(y[j] = i|x[j]; g ) Soft Assignment: Parameters estimated by taking weighted average ! new 1 =
N i=1 xi p(y[i] = N i=1 p(y[i] = 1
1 | x[i]; g ) | x[i]; g )
2 (1 )new
N 2 i=1(xi 1) p(y[i] N i=1 p(y[i] = 1 |
= 1 | x[i]; g ) x[i]; g )
These are updated parameters starting with initial guess g
21 of 88
Maximum Likelihood Estimation of Parameters of GMM

g g g 1. Make initial guess of parameters: g = cg , cg , cg , g , g , g , 1 , 2 , 3 1 2 3 1 2 3
2. Knowing parameters g , nd Prob. of sample xi belonging to j th component.

p[y[i] = j | x[i]; ]
g
for i = 1, 2, . . . N no. of observations for j = 1, 2, . . . M no. of components
3.
bnew cj
N 1 X p(y[i] = j | x[i]; g ) = N i=1
4.
new j =
PN
i=1 xi p(y[i] = PN i=1 p(y[i] = j
j | x[i]; g ) | x[i]; g )
5.
2 (j )new
PN
i=1 (xi
PN
1 )2 p(y[i] = j | x[i]; g ) b p(y[i] = j | x[i]; g )
i=1
6. Go back to (2) and repeat until convergence
22 of 88
23 of 88
24 of 88
Live demonstration: http://www.neurosci.aist.go.jp/ akaho/MixtureEM.html Practical Issues E.M. can get stuck in local minima. EM is very sensitive to initial conditions; a good initial guess helps; k-means algorithm is used prior to application of EM algorithm
Size of a GMM
Bayesian Information Criterion (BIC) value of a GMM can be dened as follows: d logN BIC(G | X) = logp(X | G) 2 where G represent the GMM with the ML parameter conguration d represents the number of parameters in G N is the size of the dataset the rst term is the log-likelihood term; the second term is the model complexity penalty term. BIC selects the best GMM corresponding to the largest BIC value by trading o these two terms.
source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005
The BIC criterion can discover the true GMM size eectively as shown in the gure.
27 of 88
Maximum A Posteriori (MAP)
Sometimes, it is dicult to get sucient number of examples for robust estimation of parameters. However, one may have access to large number of similar examples which can be utilized. Adapt the target distribution from such a distribution. For example, adapt a speaker independent model to a new speaker using small amount of adaptation data.
28 of 88
MAP Adaptation
ML Estimation M LE = arg max p(X|)
MAP Estimation M AP = arg max p(|X)
p(X|) p() = arg max p(X) = arg max p(X|) p()
p() is the a priori distribution of parameters . A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior.
Simple Implementation of MAP-GMMs
Source: Statistical Machine Learning from Data: GMM; Samy Bengio
30 of 88
Simple Implementation
Train a prior model p with a large amount of available data (say, from multiple speakers). Adapt the parameters to a new speaker using some adaptation data (X). Let = [0, 1] be a parameter that describes the faith on the prior model. Adapted weight of j th mixture
wj =
p wj
+ (1 )
p(j|xi)
i
Here is a normalization factor such that
wj = 1.
31 of 88
Simple Implementation (contd.)
means
j =
p j
+ (1 )
i p(j|xi)
xi i p(j|xi)
Weighted average of sample mean and a prior mean variances
j =
p j
pp j j
+ (1 )
i p(j|xi)
xi xi j j i p(j|xi)
32 of 88
HMM
Primary role of speech signal is to carry a message; sequence of sounds (phonemes) encode a sequence of words. The acoustic manifestation of a phoneme is mostly determined by: Conguration of articulators (jaw, tongue, lip) physiology and emotional state of speaker Phonetic context HMM models sequential patterns; speech is a sequential pattern Most text dependent speaker recognition systems use HMMs Text verication involves verication/recognition of phonemes
33 of 88
Phoneme recognition
Consider two phonemes classes /aa/ and /iy/. Problem: Determine to which class a given sound belongs. Processing of speech signal results in a sequence of feature (observation) vectors: o1, . . . , oT (say MFCC vectors) We say the speech is /aa/ if: p(aa|O) > p(iy|O) Using Bayes Rule AcousticM odel P riorP rob
p(O|iy) p(iy) p(O|aa) p(aa) V s. p(O) p(O)

Given p(O|aa), p(aa), p(O|iy) and p(iy) which is more probable ?
Parameter Estimation of Acoustic Model

How do we nd the density function paa(.) and piy (.). We assume a parametric model: paa() parameterised by aa pij () parameterised by iy
Training Phase: Collect many examples of /aa/ being said Compute corresponding observations o1, . . . , oTaa Use the Maximum Likelihood Principle aa = arg max p(O; aa)
aa
Recall: if the pdf is modelled as a Gaussian Mixture Model . then we use EM Algorithm
Modelling of Phoneme
Our Articulators are moving from a conguration To enunciate /aa/ in a word for previous phoneme to /aa/ and then proceeding to move to conguration of next phoneme. Can think of 3 distinct time periods: Transition from previous phoneme Steady state Transition to next phoneme Features for 3 time-interval are quite dierent Use dierent density functions to model the three time intervals model as paa1 (; aa1) paa2 (; aa2) paa3 (; aa3) Also need to model the time durations of these time-intervals transition probs.
Stochastic Model (HMM)
a11
1
p(f)
a12
2
p(f) p(f)
f(Hz)
f(Hz)
f(Hz)
37 of 88
HMM Model of Phoneme

Use term Statefor each of the three time periods. Prob. of ot from j th state, i.e. paaj (ot; aaj ) denoted as bj (ot)

Observation, ot, is generated by which state density? Only observations are seen, the state-sequence is hidden Recall: In GMM, the mixture component is hidden
Probability of Observation
Recall: To classify, we evaluate P r(O|) where are parameters of models In /aa/ Vs /iy/ calculation: P r(O|aa) Vs P r(O|iy ) Example: 2-state HMM model and 3 observations o1 o2 o3
Model parameters are assumed known: Transition Prob. a11, a12, a21, and a22 model time durations State density b1(ot) and b2(ot). 2 bj (ot) are usually modelled as single Gaussian with parameter j , j or by GMMs
Probability of Observation through one Path
T = 3 observations and N = 2 nodes 8 paths thru 2 nodes for 3 observations Example: Path P1 through states 1, 1, 1. P r{O|P1, } = b1(o1) b1(o2) b1(o3) Prob. of Path P1 = P r{P1|} = a01 a11 a11 P r{O, P1|} = P r{O|P1, } P r{P1|} = a01b1(o1).a11b1(o2).a11b1(o3)
Probability of Observation
Path
o1
1 1 1 1 2 1 1 1
o2
1 1 2 2 1 1 1 1
o3
1 2 1 2 1 2 1 2
p(O, Pi |) a01 b1 (o1).a11 b1 (o2).a11b1 (o3 ) a01 b1 (o1).a11 b1 (o2).a12 b2 (o3 ) a01 b1 (o1).a12 b2 (o2).a21 b1 (o3 ) a01 b1 (o1).a12 b2 (o2).a22 b2 (o3 ) a02 b2 (o1).a21 b1 (o2).a11b1 (o3 ) a02 b2 (o1).a21 b1 (o2).a12 b2 (o3 ) a02 b2 (o1).a22 b1 (o2).a21 b1 (o3 ) a02 b2 (o1).a22 b1 (o2).a22 b2 (o3 )
P1 P2 P3 P4 P5 P6 P7 P8
p(O|) =
Pi
P {O, Pi|} =
Pi
P {O|Pi, } P {Pi, }
Forward Algorithm Avoid Repeat Calculations: Two Multiplications a01b1(o1)a11b1(o2).a11b1(o3) + a02b2(o1).a21b1(o2).a11b1(o3) =[a01.b1(o1)a11b1(o2) + a02b2(o1)a21b1(o2)]a11b1(o3) One Multiplication
Forward Algorithm Recursion
Let :1(t = 1) = a01b1(o1) Let :2(t = 1) = a02b2(o2) Recursion : 1(t = 2) = [a01b1(o1).a11 + a02b2(o1).a21].b1(o2) = [1(t = 1).a11 + 2(t = 1).a21].b1(o2)
42 of 88
General Recursion in Forward Algorithm

j (t) = i(t 1)aij .bj (ot) = P {o1, o2, . . . ot, st = j|} Note Sum of probabilities of all paths ending at j (t) node j at time t with partial observation sequence o1, o2, . . . , ot The probability of the entire observation (o1, o2, . . . , oT ), therefore, is
N
p(0|) =
j=1 N
P {o1, o2, . . . , oT , ST = j|} j (T )
=
j=1
where N=No. of nodes

Backward Algorithm
analogous to Forward, but coming from the last time instant T Example: a01b1(o1).a11b1(o2).a11b1(o3) + a01b1(o1).a11b1(o2)a12b2(o3) + . . . = [a01.b1(o1).a11b1(o2)].(a11b1(o3) + a12b2(o3))
1(t = 2) = p{o3|st=2 = 1, } = p{o3, st=3 = 1|st=2 = 1; } + p{o3, st=3 = 2|st=2 = 1, } p{o3|st=3 = 1, st=2 = 1, }.p{st=3 = 1|st=2 = 1, } + = p{o3|st=3 = 2, st=2 = 1, }.p{st=3 = 2|st=2 = 1, }
= b1(o3).a11 + b2(o3).a12
44 of 88
General Recursion in Backward Algorithm
Given that we are at node j at time t j (t) Sum of probabilities of all paths such that partial sequence ot+1, . . . , oT are observed
i(t) =
j=1
[aij bj (ot+1)]
Going to each node from i
th
j (t + 1)
Prob. of observation ot+2 . . . oT given now we are in j th node at t + 1
node
= p{ot+1, . . . , ot|st = i, }
45 of 88
Estimation of Parameters of HMM Model

Given known Model parameters, : Evaluated p(O|) useful for classication Ecient Implementation: Use Forward or Backward Algo. Given set of observation vectors, ot how do we estimate parameters of HMM? Do not know which states ot come from Analogous to GMM do not know which component Use a special case of EM Baum-Welch Algorithm Use following relations from Forward/Backward
N
p(O|) = N (T ) = 1(T ) =
j=1
j (t)j (t)
46 of 88
Parameter Estimation for Known State Sequence

Assume each state is modelled as a single Gaussian: j = Sample mean of observations assigned to state j. 2 j = Variance of the observations assigned to state j. and Trans. Prob. from state i to j = No. of times transition was made from i to j Total number of times we made transition from i
In practice since we do not know which state generated the observation So we will do probabilistic assignment.
47 of 88
Review of GMM Parameter Estimation

Do not know which component of the GMM generated output observation. Given initial model parameters g , and observation sequence x1, . . . , xT . Find probability xi comes from component j Soft Assignment p[component = 1|xi; g ] = So, re-estimation equations are:
b Cj new bj = 1X p(comp = j|xi ; g ) T i=1 PT g i=1 xi p(comp = j|xi ; ) PT g i=1 p(comp = j|xi ; )
T
p[component = 1, xi|g ] p[xi|g ]
j b2
PT
b 2 i=1 (xi j ) p(comp = j|xi ; g ) PT g i=1 p(comp = j|xi ; )
A similar analogy holds for hidden Markov models

Baum-Welch Algorithm
Here: We do not know which observation ot comes from which state si Again like GMM we will assume initial guess parameter g Then prob. of being in state=i at time=t and state=j at time=t+1 is
p{qt = i, qt+1 = j, O|g } t(i, j) = p{qt = i, qt+1 = j|O, g } = p{O|g }

where
p{O|g } = N (T ) =
i i(T )
49 of 88
Baum-Welch Algorithm
Then prob. of being in state=i at time=t and state=j at time=t+1 is p{qt = i, qt+1 = j, O|g } t(i, j) = p{qt = i, qt+1 = j|O, g } = p{O|g } where p{O|g } = N (T ) =
i i (T )
From ideas of Forward-Backward Algorithm, numerator is
p{qt = i, qt+1 = j, O|g } = i(t).aij bj (ot+1).j (t + 1) i(t).aij bj (ot+1)j (t + 1) So t(i, j) = N (t)
50 of 88
Estimating Transition Probability

Trans. Prob. from state i to j = No. of times transition was made from i to j Total number of times we made transition from i
t(i, j) prob. of being in state=i at time=t and state=j at time=t+1 If we average t(i, j) over all time-instants, we get the number of times the system was in ith state and made a transition to j th state. So, a revised estimation of transition probability is
anew ij
=
T t=1 (
T 1 t=1 t (i, j) N
t(i, j) )
j=1
all transitions out of i at time=t
51 of 88
Estimating State-Density Parameters

Analogous to GMM: which observation belonged to which component, New estimates for the state pdf parameters are (assuming single Gaussian)
i =
T t=1 i(t)ot T t=1 i(t) T t=1 i(t)(ot i)(ot T t=1 i(t)
i)T
These are weighted averages weighted by Prob. of being in state j at t Given observation HMM model parameters estimated iteratively p(O|) evaluated eciently by Forward/Backward algorithm
Viterbi Algorithm
Given the observation sequence, the goal is to nd corresponding state-sequence that generated it there are many possible combination (N T ) of state sequence many paths. One possible criterion : Choose the state sequence corresponding to path that with maximum probability
max P {O, Pi|}

i
Word : represented as sequence of phones Phone : represented as sequence states Optimal state-sequence Optimal phone-sequence Word sequence
53 of 88
Viterbi Algorithm and Forward Algorithm

Recall Forward Algorithm : We found probability over each path and summed over all possible paths
NT
i=1
p{O, Pi|}
Viterbi is just special case of Forward algo. instead of sum of prob. of all paths choose path with max prob.
At each node
In Practice: p(O|) approximated by Viterbi (instead of Forward Algo.)
54 of 88
Viterbi Algorithm
55 of 88
Decoding
Recall: Desired transcription W obtained by maximising W = arg max p(W |O)

W
Search over all possible W astronomically large! Viterbi Search nd most likely path through a HMM Sequence of phones (states) which is most probable Mostly: most probable sequence of phones correspond to most probable sequence of words
56 of 88
Training of a Speech Recognition System
HMM parameters estimated using large databases 100 hours Parameters estimated using Maximum Likelihood Criterion
57 of 88
Recognition
58 of 88
Speaker Recognition
Spectra (formants) of a given sound are dierent for dierent speakers.
Spectra of 2 speakers for one frame of /iy/
Derive speaker dependent model of a new speaker by MAP adaptation of SpeakerIndependent (SI) model using small amount of adaptation data; use for speaker recognition.
59 of 88
References
Pattern Classication, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, 2001. Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press, 1990. The EM Algorithm and Extensions, Georey J. McLachlan and Thriyambakam Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707 The EM Algorithm and Extensions, Georey J. McLachlan and Thriyambakam Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707 Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang, Englewood Clis NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN 0-13-015157-2
Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990. Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge, MA., 1998. Maximum Likelihood from incomplete data via the em algorithm, J. Royal Statistical Soc. 39(1), pp. 1-38, 1977. Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp. 291298, 1994. A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes, ISCI, TR-97-021. Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.

Gaussian Mixture Model and Hidden Markov Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gaussian Mixture Model and Hidden Markov Model

Uploaded by

Copyright:

Available Formats

Gaussian Mixture Model (GMM)

Hidden Markov Model (HMM)

Model Generation Pattern Matching

GMM: static patterns HMM: sequential patterns

WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K

p(B|Ai) p(Ai) p(B|A) p(A) i p(B|Ai) p(Ai )

WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K

P r(x = 41) P r(x = 3)

p(xi; ) 1 1 exp 2 2 (2 2)N |2

Maximum Likelihood Estimation

WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 8 of 88

Maximum Likelihood Estimator

Given x[0], x[1], . . . , x[N 1] and pdf parameterised by =

We form Likelihood function L(X; ) =

M LE = arg max L(X; )

For height problem: can show ()M LE =

Estimate of mean of Gaussian = sample mean of measured heights.

p(x|)p() p(x|) p() p(x)

Height problem: Unknown mean is random pdf Gaussian N (, 2) p() = 1 exp 2 ( )2 2 2 2 1

2 + n 2x Then : ()Bayesian = 2 + n 2 Weighted average of sample mean and a prior mean