GMM, HMM. Before The Deep Learning (DL) Era For - by Jonathan Hui - Medium

5/6/22, 12:27 PM Speech Recognition — GMM, HMM.
nition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium
Open in app Get started
Jonathan Hui Follow
Sep 9, 2019 · 14 min read · Listen
Save
Speech Recognition — GMM, HMM
959 9
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 1/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium
Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-
learn technology for speech recognition. Now, there are hybrid systems that combine
HMM with Deep Learning and there are systems that are HMM free. We have more design
choices now. However, for many generative models, HMM remains important. But
regardless of the status, speech recognition helps us to understand the application of
HMM and GMM in the ML context better. So stop the long face and let’s spend sometimes
on it.
Automatic Speech Recognition (ASR)

Let’s get a high-level overview first. The diagram below is a high-level architecture for
speech recognition that links HMM (Hidden Markov Model) with speech recognition.
Starting from an audio clip, we slide windows of 25 ms width and 10 ms apart to extract
MFCC features. For each window frame, 39 MFCC parameters will be extracted. The
primary objective of speech recognition is to build a statistical model to infer the text
primary objective of speech recognition is to build a statistical model to infer the text
sequences W (say “cat sits on a mat”) from a sequence of featureOpen
vectors
in appX. Get started
One approach looks for all possible sequences of words (with limited maximum length)
and finds one that matches the input acoustic features the best.
This model depends on building a language model P(W), a pronunciation lexicon model,
and an acoustic model P(X|W) (a generative model) as below.
Modified from source
A pronunciation model can use tables to convert words to phones, or a corpus is already
transcribed with phonemes already. The acoustic model is about modeling a sequence of
feature vectors given a sequence of phones instead of words. But we will continue the use
of the notation p(X|W) for the acoustic model. Just be aware.
The language model is about the likelihood of the word sequence. For example, “I watch a
movie” will be more likely than “I you movie watch” or “I watch an apple”. It predicts the
next word given the previous words. If we approximate it with a first-order Markov chain,
the next word will depend on the current word only. We can estimate it by counting the
occurrence of word pairs in a corpus.
By combining the acoustic model and the language model, we search for the text
sequence with the maximum likelihood.
This approach sounds indirect and the search looks inefficient or impossible. But p(X|W)
is much easier to model in speech recognition. The distribution of features for a phone can
be modeled with a Gaussian Mixture Model (GMM). We will learn it with training data.
The transition between phones and the corresponding observable can be modeled with
the Hidden Markov Model (HMM). So if we can find an optimal way to search the phone
sequence efficiently, this may not sound bad after all.
An HMM model composes of hidden variables and observables. The top nodes below
represent the phones and the bottom nodes represent the corresponding observables (the
audio features). The horizontal arrows demonstrate the transition in the phone sequence
for the true label “she just …”.
In speech recognition the observable can be represented by 39 MFCC features extracted

In speech recognition, the observable can be represented by 39 MFCC features extracted
from the corresponding audio frame. The good news is that with thisinHMM
Open app model, we do
Get started
not need to search the phone sequence one-by-one. Otherwise, the complexity grows
exponentially with the number of phones. With the Viterbi algorithm or other HMM
methods, we can find the optimal sequence in polynomial time. We will come back to this
later.
The diagram below is a possible realization of an Automatic Speech Recognition (ASR).

Combining information on the lexicon, the acoustic model and the language model, we
can find the optimal phone sequence with the Viterbi decoder.
Modified from source (O is the same as X here)
Let’s do a quick recap, we can model the acoustic model P(X|W) with an HMM. The
arrows on an HMM model will represent phone transitions or links to observables. To
model the audio features that we observe, we learn a GMM model from the training data.
So let’s understand HMM and GMM more in the general context first.
Hidden Markov Model (HMM)

A Markov chain contains all the possible states of a system and the probability of
transiting from one state to another.
A first-order Markov chain assumes that the next state depends on the current state only.
For simplicity, we often call it a Markov chain.
This model will be much easier to handle. However, in many ML systems, not all states are
observable and we call these states hidden states or internal states. Some may treat them
as latent factors for the inputs. For example, it may not be easy to know whether I am
happy or sad. My internal state will be {H or S}. But we can get some hints from what we
observe. For example, when I am happy I have a 0.2 chance that I watch a movie, but
when I am sad, that chance goes up to 0.4. The probability of observing an observable
given an internal state is called the emission probability. The probability of transiting
from one internal state to another is called the transition probability.
For speech recognition, the observable is the content in each audio frame. We can use the
MFCC parameters to represent it. Let’s see what we can do with an HMM.
Likelihood with the forward algorithm
An HMM is modeled by the transition and the emission probability.
Given an HMM model is learned, we can use the forward algorithm to calculate the
likelihood of our observations. Our objective is summing the probabilities of the
observations for all possible state sequences:
But we have to do this smartly. We cannot sum over all possible state sequences one-at-a-
time. It has exponential complexity.
Our strategy will employ divide-and-conquer. If we can express calculations recursively,

we can break down the problem into intermediate steps. In HMM, we solve the problem at
time t by using the result from time t-1 and/or t+1. A circle below represents an HMM
hidden state j at time t. So even the number of state sequence increases exponentially with
time, we can solve it linear if we can express the calculation recursively with time.
This is the idea of dynamic programming that breaks the exponential curse. At time t, the
probability of our observations up to time t is:
Let’s rename the term underlined in red as αt(j) (forward probability) and check if we
can express it recursively. Since the current observation depends on the current state only,
α can be expressed as:
So it does have a recursive relationship. Below are the steps in calculating the likelihood
of the observations given the model λ using recursion. Instead of summing each state
i d d l l l f
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 i h d( i T) If h 9/27
sequence independently, we calculate α from time step one to the end (time T). If there
Get started
are k internal states, the complexity will only be O(k²T), insteadOpen in app
of exponentially.
Here is an example which we start with the initial state distribution on the left. Then we
propagate the value of α to the right. We compute α for every state and repeat it for every
timestep.
Next, given the HMM model, how can we find the internal states given the sequence of
observations. This process is called decoding. This is particularly interesting for speech
recognition. If we have an audio clip, the internal states represent the phones. Speech
recognition can be viewed as finding these internal states given the audio clips.
Decoding (find the internal states — Viterbi algorithm)
Again, we want to express our components recursively. Given the state is j at time t, vt(j) is
the joint probability of the observation sequence with the optimal state sequence.
So not only it can be done, the equation is similar to the forward algorithm except the
summation is replaced by the maximum function. Instead of summing over all possible
state sequence in the forward algorithm, Viterbi algorithm takes the most likely path.
Finding the internal states that maximize the likelihood of observations is similar to the
likelihood method. We just replace the summation with the maximum function.
In this algorithm, we also record the maximum path that leads to each node at time t (the
red arrow above), i.e. we backtrace the optimal path for each node. For example, we are
transited from a happy state H at t=1 to the happy state H at t=2.
Source
Learning (Baum–Welch algorithm/forward-backward algorithm)
h h d l
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 h d l h b d b 13/27
Now, it comes to the hard part. How can we learn the HMM model? This can be done by
Get started
the Baum–Welch algorithm (forward-backward algorithm) Open in app
to learn the transition and
the emission probability. This task sounds mission impossible since both probabilities are
highly tangled in our calculation. But from some perspective, if we know the state
occupation probability (the state distribution at time t), we can derive the emission
probability and the transition probability. If we know these two probabilities, we can
derive the state distribution at time t. This is the chicken and egg problem we discussed in
the EM algorithm. EM algorithm solves the problem in iteration steps. In each step, we
optimizing one latent variable while fixing the others. Imagine that each iteration step
improves the solution. Even for a continuous space, we work with limited precisions and
therefore, there are finite states to explore and improve. So, if we keep the iterations, the
solution will converge.
So it is not surprising that Baum–Welch algorithm is a special case of the EM algorithm.
Let’s get familiar with the following new notations.
We are already familiar with α (forward probability) in the forward algorithm already. β
(backward probability) is its close cousin in the reverse direction (the probability of
seeing all the coming observations given a state i at time t). We can express this
recursively similar to α but in the reverse direction (a.k.a. backward algorithm).
To learn the HMM model, we need to know what states we are to explain the observations
the best. That will be the state occupation probability γ — the probability of state i at
time t given all the observations.
Given the HMM model parameters fixed, we can apply the forward and backward
algorithm to calculate α and β from the observations. γ can be calculated by simply
multiplying α with β, and then renormalize it.
ξ is the probability of transiting from state i to j after time t given all the observations. It
can be computed by α and β similarly.
Intuitively, with a fixed HMM model, we refine the state occupation probability (γ) and
the transition (ξ) with the given observations.
Here come the chicken and egg part. Once the distribution of γ and ξ (θ₂) are refined, we
can perform a point estimate on what will be the best transition and emission probability
(θ₁: a, b).
We fix one set of parameters to improve others and continue the iteration until the
solution converges.
The EM algorithm is usually defined as: Open in app Get started
Here, the E-step establishes p(γ, ξ | x, a, b). Then, the M-step finds a, b that roughly
maximize the objective below.
Here is the recap of the algorithm:
So given all the observations in the training data, Baum–Welch algorithm can learn the
HMM model. However, remember to keep an open mind. In speech recognition, the
problem is far more complex and many solutions do not scale well sometimes.
Acoustic model
In ASR, we can use a pronunciation table to produce the phones for the text sequence Y.
Next, we need to create an acoustic model for these phones.
Decades of research have been done on phonetics. Experts can identify vowels and
consonants from reading the spectrograms directly.
Source
But again we need a much denser representation of the acoustic model so we can
determine the likelihood of an audio feature vector X given the phone P(X|phone).
With MFCC, we extract 39 features from an audio frame. Let’s simplify the picture and
assume that there is only one feature for each frame. For the state “sh” (phone), the value
of this feature can be modeled with a normal distribution.
To extend the concept to 39 features, we just need a Multivariate normal distribution with
39 variables. The following diagram visualizes a bivariate normal distribution for two
variables.
Source (Bivariate normal distribution)
Here is the definition of the Multivariate normal distribution.
where Σ is the covariance matrix which measures correlations among variables. MFCC
parameters have a nice property. There are relatively independent of each other.
Therefore, the non-diagonal element of Σ can simply set to zero.
But, it is too hard to think multi-dimensionally. So we will stick with a 1-D example for
illustration. The likelihood p(x| q) of the observed feature x will be calculated as how far
it is from the peak of the normal distribution q:
Given different phones, we can calculate the corresponding probability density values and
classify it as the phone with the highest value. To learn this Gaussian distribution, we can
simply estimate from the training data point xᵢ.
These equations can be proven by maximizing the likelihood of the training data.
Source
So this Gaussian model is easy to learn from training data and give us a nice likelihood
model on P(x| μ, σ²). In the context of speech recognition, we can learn the Gaussian
model (μ, σ²) for each phone. This serves as the likelihood probability. This also acts as
the emission probability in an HMM.
Unfortunately, this concept is naive even we use a multivariate Gaussian Distribution. If

this is true, it will be much simpler to learn a foreign spoken language. The likelihood is
more complex than a single peak bell curve. To address this, we switch to a Gaussian
Mixture Model (GMM). This allows the distributions to be multimodal, i.e. we allow a
feature to have a few possible values. This provides the flexibility of variants in speech.
For example, the GMM on the right combines three Gaussian distribution with different
weights to form a new probability density (a 3-component GMM). The model is still pretty
dense, 6 Gaussian parameters plus 3 weights.
GMM acoustic model
Intuitively, the feature value of a specific phone can be observed near one of the m modes.
But some values may be more likely than others. So we introduce weights to indicate
which are more likely. With the internal HMM state to be j, the likelihood of the observed
feature vector is:
To learn a GMM, say for a 2-component GMM, we feed features extracted from the
training data to fit the parameters of these two clusters. Conceptually, we start with an
initial or random guess of these parameters. We find which cluster that each data sample
should belong to. Then we re-calculate the cluster parameters according to the associated
data points.
Yes, we will use the EM algorithm to iterate the solution until it converges. In EM, we use a
soft assignment rather than a hard assignment. For a hard assignment, we assign a
specific cluster that each data sample belongs to (a point estimate). In a soft assignment,
it will be a probability distribution. So, it is the chance that a sample will belong to a
cluster. Then we recompute the cluster parameters according to this soft assignment.
Since we have covered this a few times, we will not elaborate on how it is trained further.
To recap, given a phone, we can learn the feature vector of the observable using GMM.
This probability distribution allows us to compute the likelihood of our speech segment
given a phone P(x|s) — this is also the emission probability given an internal state in
HMM.
Vector quantization
Throughout the process, we try to model a denser representation of the acoustic signals.
GMM is one popular approach Alternatively after we extract a training set of feature
GMM is one popular approach. Alternatively, after we extract a training set of feature
vectors from our corpus, we group these features into k clusters, say using k-means
clustering. This creates a codebook with size k to encode audio frame.
k=3 for 2-D data
With this index, we can start using it to train an HMM. Once the model is trained, we can
use it to decode audio clips also. This method is called Vector Quantization and use it
early research. But it is less popular compared to GMM. Therefore, we just want you to
aware of it.
Thoughts
GMM models the observed probability distribution of the feature vector given a phone. It
provides a principled method to measure “distance” between a phone and our observed
audio frame.
On the other hand, HMM produces a principled model on how states are transited and
observed. As the probability of the observations can be modeled with HMM as
Equation source
where h is the hidden state (phone). The likelihood of features given phone can be
modeled with GMM.
Next
We are ready to apply our knowledge in HMM and GMM in developing the basic
components for speech recognition.
Speech Recognition — Acoustic, Lexicon & Language Model

Speech recognition can be viewed as finding the best sequence of
words (W) according to the acoustic, the pronunciation…
medium.com
Credits & references

Automatic Speech Recognition
Speech Recognition
Spoken Language Processing
End-to-end models for Speech Processing
Basic concepts of speech recognition
Phones and Phonemes
Clip art credits

Wave file
About Help Terms Privacy
Get the Medium app

GMM, HMM. Before The Deep Learning (DL) Era For - by Jonathan Hui - Medium

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GMM, HMM. Before The Deep Learning (DL) Era For - by Jonathan Hui - Medium

Uploaded by

Copyright:

Available Formats

5/6/22, 12:27 PM Speech Recognition — GMM, HMM.

Open in app Get started

Jonathan Hui Follow

Sep 9, 2019 · 14 min read · Listen

Speech Recognition — GMM, HMM

Open in app Get started

Open in app Get started

Automatic Speech Recognition (ASR)

Modified from source

In speech recognition the observable can be represented by 39 MFCC features extracted

The diagram below is a possible realization of an Automatic Speech Recognition (ASR).

Modified from source (O is the same as X here)

Hidden Markov Model (HMM)

Open in app Get started

Open in app Get started

Likelihood with the forward algorithm

An HMM is modeled by the transition and the emission probability.

Our strategy will employ divide-and-conquer. If we can express calculations recursively,

Open in app Get started

Decoding (find the internal states — Viterbi algorithm)

Open in app Get started

Modified from source

Open in app Get started

Learning (Baum–Welch algorithm/forward-backward algorithm)

So it is not surprising that Baum–Welch algorithm is a special case of the EM algorithm.

Let’s get familiar with the following new notations.

Open in app Get started

Open in app Get started

The EM algorithm is usually defined as: Open in app Get started

Here is the recap of the algorithm:

Open in app Get started

Open in app Get started

Modified from source

Open in app Get started

Open in app Get started

Source (Bivariate normal distribution)

Here is the definition of the Multivariate normal distribution.

Open in app Get started

Open in app Get started

Unfortunately, this concept is naive even we use a multivariate Gaussian Distribution. If

GMM acoustic model

Open in app Get started

k=3 for 2-D data

Speech Recognition — Acoustic, Lexicon & Language Model

Credits & references

Spoken Language Processing

End-to-end models for Speech Processing

Basic concepts of speech recognition

Phones and Phonemes

Clip art credits

Open in app Get started

About Help Terms Privacy

Get the Medium app

You might also like