You are on page 1of 27

5/6/22, 12:27 PM Speech Recognition — GMM, HMM.

nition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Jonathan Hui Follow

Sep 9, 2019 · 14 min read · Listen

Save

Speech Recognition — GMM, HMM

959 9

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 1/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 2/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Before the Deep Learning (DL) era for speech recognition, HMM and GMM are two must-
learn technology for speech recognition. Now, there are hybrid systems that combine
HMM with Deep Learning and there are systems that are HMM free. We have more design
choices now. However, for many generative models, HMM remains important. But
regardless of the status, speech recognition helps us to understand the application of
HMM and GMM in the ML context better. So stop the long face and let’s spend sometimes
on it.

Automatic Speech Recognition (ASR)


Let’s get a high-level overview first. The diagram below is a high-level architecture for
speech recognition that links HMM (Hidden Markov Model) with speech recognition.

Starting from an audio clip, we slide windows of 25 ms width and 10 ms apart to extract
MFCC features. For each window frame, 39 MFCC parameters will be extracted. The
primary objective of speech recognition is to build a statistical model to infer the text
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 3/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium
primary objective of speech recognition is to build a statistical model to infer the text
sequences W (say “cat sits on a mat”) from a sequence of featureOpen
vectors
in appX. Get started

One approach looks for all possible sequences of words (with limited maximum length)
and finds one that matches the input acoustic features the best.

This model depends on building a language model P(W), a pronunciation lexicon model,
and an acoustic model P(X|W) (a generative model) as below.

Modified from source

A pronunciation model can use tables to convert words to phones, or a corpus is already
transcribed with phonemes already. The acoustic model is about modeling a sequence of
feature vectors given a sequence of phones instead of words. But we will continue the use
of the notation p(X|W) for the acoustic model. Just be aware.

The language model is about the likelihood of the word sequence. For example, “I watch a
movie” will be more likely than “I you movie watch” or “I watch an apple”. It predicts the
next word given the previous words. If we approximate it with a first-order Markov chain,
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 4/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

the next word will depend on the current word only. We can estimate it by counting the
Open in app Get started
occurrence of word pairs in a corpus.

By combining the acoustic model and the language model, we search for the text
sequence with the maximum likelihood.

This approach sounds indirect and the search looks inefficient or impossible. But p(X|W)
is much easier to model in speech recognition. The distribution of features for a phone can
be modeled with a Gaussian Mixture Model (GMM). We will learn it with training data.
The transition between phones and the corresponding observable can be modeled with
the Hidden Markov Model (HMM). So if we can find an optimal way to search the phone
sequence efficiently, this may not sound bad after all.

An HMM model composes of hidden variables and observables. The top nodes below
represent the phones and the bottom nodes represent the corresponding observables (the
audio features). The horizontal arrows demonstrate the transition in the phone sequence
for the true label “she just …”.

In speech recognition the observable can be represented by 39 MFCC features extracted


https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 5/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium
In speech recognition, the observable can be represented by 39 MFCC features extracted
from the corresponding audio frame. The good news is that with thisinHMM
Open app model, we do
Get started

not need to search the phone sequence one-by-one. Otherwise, the complexity grows
exponentially with the number of phones. With the Viterbi algorithm or other HMM
methods, we can find the optimal sequence in polynomial time. We will come back to this
later.

The diagram below is a possible realization of an Automatic Speech Recognition (ASR).


Combining information on the lexicon, the acoustic model and the language model, we
can find the optimal phone sequence with the Viterbi decoder.

Modified from source (O is the same as X here)

Let’s do a quick recap, we can model the acoustic model P(X|W) with an HMM. The
arrows on an HMM model will represent phone transitions or links to observables. To
model the audio features that we observe, we learn a GMM model from the training data.
So let’s understand HMM and GMM more in the general context first.

Hidden Markov Model (HMM)


https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 6/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

A Markov chain contains all the possible states of a system and the probability of
transiting from one state to another.

A first-order Markov chain assumes that the next state depends on the current state only.
For simplicity, we often call it a Markov chain.

This model will be much easier to handle. However, in many ML systems, not all states are
observable and we call these states hidden states or internal states. Some may treat them
as latent factors for the inputs. For example, it may not be easy to know whether I am
happy or sad. My internal state will be {H or S}. But we can get some hints from what we
observe. For example, when I am happy I have a 0.2 chance that I watch a movie, but
when I am sad, that chance goes up to 0.4. The probability of observing an observable
given an internal state is called the emission probability. The probability of transiting
from one internal state to another is called the transition probability.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 7/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

For speech recognition, the observable is the content in each audio frame. We can use the
MFCC parameters to represent it. Let’s see what we can do with an HMM.

Likelihood with the forward algorithm

An HMM is modeled by the transition and the emission probability.

Given an HMM model is learned, we can use the forward algorithm to calculate the
likelihood of our observations. Our objective is summing the probabilities of the
observations for all possible state sequences:

But we have to do this smartly. We cannot sum over all possible state sequences one-at-a-
time. It has exponential complexity.

Our strategy will employ divide-and-conquer. If we can express calculations recursively,


we can break down the problem into intermediate steps. In HMM, we solve the problem at
time t by using the result from time t-1 and/or t+1. A circle below represents an HMM
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 8/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

hidden state j at time t. So even the number of state sequence increases exponentially with
Open in app Get started
time, we can solve it linear if we can express the calculation recursively with time.

This is the idea of dynamic programming that breaks the exponential curse. At time t, the
probability of our observations up to time t is:

Let’s rename the term underlined in red as αt(j) (forward probability) and check if we
can express it recursively. Since the current observation depends on the current state only,
α can be expressed as:

So it does have a recursive relationship. Below are the steps in calculating the likelihood
of the observations given the model λ using recursion. Instead of summing each state
i d d l l l f
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 i h d( i T) If h 9/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

sequence independently, we calculate α from time step one to the end (time T). If there
Get started
are k internal states, the complexity will only be O(k²T), insteadOpen in app
of exponentially.

Here is an example which we start with the initial state distribution on the left. Then we
propagate the value of α to the right. We compute α for every state and repeat it for every
timestep.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 10/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Next, given the HMM model, how can we find the internal states given the sequence of
observations. This process is called decoding. This is particularly interesting for speech
recognition. If we have an audio clip, the internal states represent the phones. Speech
recognition can be viewed as finding these internal states given the audio clips.

Decoding (find the internal states — Viterbi algorithm)

Again, we want to express our components recursively. Given the state is j at time t, vt(j) is
the joint probability of the observation sequence with the optimal state sequence.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 11/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

So not only it can be done, the equation is similar to the forward algorithm except the
summation is replaced by the maximum function. Instead of summing over all possible
state sequence in the forward algorithm, Viterbi algorithm takes the most likely path.

Modified from source

Finding the internal states that maximize the likelihood of observations is similar to the
likelihood method. We just replace the summation with the maximum function.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 12/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

In this algorithm, we also record the maximum path that leads to each node at time t (the
red arrow above), i.e. we backtrace the optimal path for each node. For example, we are
transited from a happy state H at t=1 to the happy state H at t=2.

Source

Learning (Baum–Welch algorithm/forward-backward algorithm)

h h d l
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 h d l h b d b 13/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Now, it comes to the hard part. How can we learn the HMM model? This can be done by
Get started
the Baum–Welch algorithm (forward-backward algorithm) Open in app
to learn the transition and
the emission probability. This task sounds mission impossible since both probabilities are
highly tangled in our calculation. But from some perspective, if we know the state
occupation probability (the state distribution at time t), we can derive the emission
probability and the transition probability. If we know these two probabilities, we can
derive the state distribution at time t. This is the chicken and egg problem we discussed in
the EM algorithm. EM algorithm solves the problem in iteration steps. In each step, we
optimizing one latent variable while fixing the others. Imagine that each iteration step
improves the solution. Even for a continuous space, we work with limited precisions and
therefore, there are finite states to explore and improve. So, if we keep the iterations, the
solution will converge.

So it is not surprising that Baum–Welch algorithm is a special case of the EM algorithm.

Let’s get familiar with the following new notations.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 14/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

We are already familiar with α (forward probability) in the forward algorithm already. β
(backward probability) is its close cousin in the reverse direction (the probability of
seeing all the coming observations given a state i at time t). We can express this
recursively similar to α but in the reverse direction (a.k.a. backward algorithm).

To learn the HMM model, we need to know what states we are to explain the observations
the best. That will be the state occupation probability γ — the probability of state i at
time t given all the observations.

Given the HMM model parameters fixed, we can apply the forward and backward
algorithm to calculate α and β from the observations. γ can be calculated by simply
multiplying α with β, and then renormalize it.

ξ is the probability of transiting from state i to j after time t given all the observations. It
can be computed by α and β similarly.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 15/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Intuitively, with a fixed HMM model, we refine the state occupation probability (γ) and
the transition (ξ) with the given observations.

Here come the chicken and egg part. Once the distribution of γ and ξ (θ₂) are refined, we
can perform a point estimate on what will be the best transition and emission probability
(θ₁: a, b).

We fix one set of parameters to improve others and continue the iteration until the
solution converges.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 16/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

The EM algorithm is usually defined as: Open in app Get started

Here, the E-step establishes p(γ, ξ | x, a, b). Then, the M-step finds a, b that roughly
maximize the objective below.

Here is the recap of the algorithm:

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 17/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

So given all the observations in the training data, Baum–Welch algorithm can learn the
HMM model. However, remember to keep an open mind. In speech recognition, the
problem is far more complex and many solutions do not scale well sometimes.

Acoustic model

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 18/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Modified from source

In ASR, we can use a pronunciation table to produce the phones for the text sequence Y.
Next, we need to create an acoustic model for these phones.

Decades of research have been done on phonetics. Experts can identify vowels and
consonants from reading the spectrograms directly.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 19/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Source

But again we need a much denser representation of the acoustic model so we can
determine the likelihood of an audio feature vector X given the phone P(X|phone).

With MFCC, we extract 39 features from an audio frame. Let’s simplify the picture and
assume that there is only one feature for each frame. For the state “sh” (phone), the value
of this feature can be modeled with a normal distribution.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 20/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

To extend the concept to 39 features, we just need a Multivariate normal distribution with
39 variables. The following diagram visualizes a bivariate normal distribution for two
variables.

Source (Bivariate normal distribution)

Here is the definition of the Multivariate normal distribution.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 21/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

where Σ is the covariance matrix which measures correlations among variables. MFCC
parameters have a nice property. There are relatively independent of each other.
Therefore, the non-diagonal element of Σ can simply set to zero.

But, it is too hard to think multi-dimensionally. So we will stick with a 1-D example for
illustration. The likelihood p(x| q) of the observed feature x will be calculated as how far
it is from the peak of the normal distribution q:

Given different phones, we can calculate the corresponding probability density values and
classify it as the phone with the highest value. To learn this Gaussian distribution, we can
simply estimate from the training data point xᵢ.

These equations can be proven by maximizing the likelihood of the training data.

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 22/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

Source

So this Gaussian model is easy to learn from training data and give us a nice likelihood
model on P(x| μ, σ²). In the context of speech recognition, we can learn the Gaussian
model (μ, σ²) for each phone. This serves as the likelihood probability. This also acts as
the emission probability in an HMM.

Unfortunately, this concept is naive even we use a multivariate Gaussian Distribution. If


this is true, it will be much simpler to learn a foreign spoken language. The likelihood is
more complex than a single peak bell curve. To address this, we switch to a Gaussian
Mixture Model (GMM). This allows the distributions to be multimodal, i.e. we allow a
feature to have a few possible values. This provides the flexibility of variants in speech.

For example, the GMM on the right combines three Gaussian distribution with different
weights to form a new probability density (a 3-component GMM). The model is still pretty
dense, 6 Gaussian parameters plus 3 weights.

GMM acoustic model

Intuitively, the feature value of a specific phone can be observed near one of the m modes.
But some values may be more likely than others. So we introduce weights to indicate
which are more likely. With the internal HMM state to be j, the likelihood of the observed
feature vector is:

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 23/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

To learn a GMM, say for a 2-component GMM, we feed features extracted from the
training data to fit the parameters of these two clusters. Conceptually, we start with an
initial or random guess of these parameters. We find which cluster that each data sample
should belong to. Then we re-calculate the cluster parameters according to the associated
data points.

Yes, we will use the EM algorithm to iterate the solution until it converges. In EM, we use a
soft assignment rather than a hard assignment. For a hard assignment, we assign a
specific cluster that each data sample belongs to (a point estimate). In a soft assignment,
it will be a probability distribution. So, it is the chance that a sample will belong to a
cluster. Then we recompute the cluster parameters according to this soft assignment.
Since we have covered this a few times, we will not elaborate on how it is trained further.

To recap, given a phone, we can learn the feature vector of the observable using GMM.
This probability distribution allows us to compute the likelihood of our speech segment
given a phone P(x|s) — this is also the emission probability given an internal state in
HMM.

Vector quantization
Throughout the process, we try to model a denser representation of the acoustic signals.
GMM is one popular approach Alternatively after we extract a training set of feature
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 24/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium
GMM is one popular approach. Alternatively, after we extract a training set of feature
Open in app Get started

vectors from our corpus, we group these features into k clusters, say using k-means
clustering. This creates a codebook with size k to encode audio frame.

k=3 for 2-D data

With this index, we can start using it to train an HMM. Once the model is trained, we can
use it to decode audio clips also. This method is called Vector Quantization and use it
early research. But it is less popular compared to GMM. Therefore, we just want you to
aware of it.

Thoughts
GMM models the observed probability distribution of the feature vector given a phone. It
provides a principled method to measure “distance” between a phone and our observed
audio frame.

On the other hand, HMM produces a principled model on how states are transited and
observed. As the probability of the observations can be modeled with HMM as

Equation source

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 25/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

where h is the hidden state (phone). The likelihood of features given phone can be
Open in app Get started
modeled with GMM.

Next
We are ready to apply our knowledge in HMM and GMM in developing the basic
components for speech recognition.

Speech Recognition — Acoustic, Lexicon & Language Model


Speech recognition can be viewed as finding the best sequence of
words (W) according to the acoustic, the pronunciation…
medium.com

Credits & references


Automatic Speech Recognition

Speech Recognition

Spoken Language Processing

End-to-end models for Speech Processing

Basic concepts of speech recognition

Phones and Phonemes

Clip art credits


Wave file

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 26/27
5/6/22, 12:27 PM Speech Recognition — GMM, HMM. Before the Deep Learning (DL) era for… | by Jonathan Hui | Medium

Open in app Get started

About Help Terms Privacy

Get the Medium app

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196 27/27

You might also like