You are on page 1of 58

CMP4205: Audio and Speech Signal

Processing

AUTOMATIC SPEECH RECOGNITION (ASR)

Cosmas Mwikirize, Ph.D


Department of Electrical and Computer Engineering, Makerere
University

© Cosmas Mwikirize, 2023


Essential Community

https://thesoundofai.slack.com/
Automatic Speech Recognition: Systems (Hardware/Software) that can analyze,
classify and recognize speech signals.

Technology that allows human beings to use their voices to speak with a computer
interface in a way that, in its most sophisticated variations, resembles normal
human conversation.

Important in mobile and security applications e.g. speech to text, Alexa, Siri etc

Classification/
Acoustic Processing Feature Extraction
Recognition
The ASR Problem

• There is no single ASR problem


• The problem depends on many factors
− Microphone: Close-mic, throat-mic, microphone array, audio-
visual
− Sources: band-limited, background noise, reverberation
− Speaker: speaker dependent, speaker independent
− Language: open/closed vocabulary, vocabulary size,
read/spontaneous speech
− Output: Transcription, speaker id, keywords
A legacy example: Template-Based ASR
• Originally only worked for isolated words
• Performs best when training and testing conditions
are best
• For each word we want to recognize, we store a
template or example based on actual data
• Each test utterance is checked against the templates
to find the best match
• Uses the Dynamic Time Warping (DTW) algorithm
Dynamic Time Warping
• Create a similarity matrix for the two utterances
• Use dynamic programming to find the lowest cost
path
Modern Statistical ASR
Step 1: Feature Calculation

• As in any data-driven task (increasingly machine


learning), the data must be represented in some
format
• Cepstral features have been found to perform well
• Mel-frequency cepstral coefficients (MFCC) are the
most common variety
Essential reading: https://www.intechopen.com/chapters/63970
Alim SA, Rashid NKA. Some Commonly Used Speech Feature Extraction
Algorithms.
So what is a Cepstrum?

Many associated wordplays!!!


Mathematical Formulation: Computing the Cepstrum
Computing the Cepstrum
Visualizing the Cepstrum
Visualizing the Cepstrum

-Continuous signal
-Some harmonic components

Can we treat it as if it were a


Time-domain signal?
Visualizing the Cepstrum

Quefrency (ms)
Quefrequency: The inverse of the distance between successive lines in a Fourier transform, measured in seconds.

Gives idea of pitch


Why is this important?

Info on phonemes
Info on pitch
The Cepstrum
• One way to think about this
– Separating the source and filter
– Speech waveform is created by
• A glottal source waveform
• Passes through a vocal tract which because of its shape has
a particular filtering characteristic
• Articulatory facts:
– The vocal cord vibrations create harmonics
– The mouth is an amplifier
– Depending on shape of oral cavity, some harmonics
are amplified more than others
Understanding the cepstrum

Vocal tract response


Understanding the cepstrum
Understanding the cepstrum
Separating the Components
Separating the Components
We need to remove high frequency components
Associated with Glottal response== Low Pass Filter

This leads us to the concept of the MFCC


MFCC

• All the steps are perceptually relevant

• Result gives formant information on speech


• Spectrum

• Discrete Cosine Transform gives real-valued coefficients


MFCC—Alternative (Detailed) Implementation
Pre-Emphasis

• Pre-emphasis: boosting the energy in the high


frequencies
• Q: Why do this?
• A: The spectrum for voiced segments has more
energy at lower frequencies than higher
frequencies.
– This is called spectral tilt
– Spectral tilt is caused by the nature of the glottal pulse
• Boosting high-frequency energy gives more info to
Acoustic Model
– Improves recognition performance
Example of pre-emphasis

• Before and after pre-emphasis


– Spectral slice from the vowel [aa]

Example: Pass the signal through a first-order finite impulse response (FIR) filter:
increases the amplitude of high frequency bands and decrease the amplitudes of lower bands
MFCC
Windowing

Slide from Bryan Pellom


Windowing
• Why divide speech signal into successive
overlapping frames?
– Speech is not a stationary signal; we want information
about a small enough region that the spectral
information is a useful cue.
• Frames
– Frame size: typically, 10-25ms
– Frame shift: the length of time between successive
frames, typically, 5-10ms
Common window shapes

• Rectangular window:

• Hamming window
Window in time domain
Window in the frequency
domain
MFCC
Discrete Fourier Transform
• Input:
– Windowed signal x[n]…x[m]
• Output:
– For each of N discrete frequency bands
– A complex number X[k] representing magnitude and
phase of that frequency component in the original
signal
• Discrete Fourier Transform (DFT)

• Standard algorithm for computing DFT:


– Fast Fourier Transform (FFT) with complexity N*log(N)
– In general, choose N=512 or 1024
Discrete Fourier Transform computing a
spectrum
• A 24 ms Hamming-windowed signal
– And its spectrum as computed by DFT (plus
other smoothing)
MFCC
Mel-scale
• Human hearing is not equally sensitive to all
frequency bands
• Less sensitive at higher frequencies, roughly >
1000 Hz
• i.e. human perception of frequency is non-linear:
Mel-scale
• A mel is a unit of pitch
– Definition:
• Pairs of sounds perceptually equidistant in pitch
– Are separated by an equal number of mels:

• Mel-scale is approximately linear below 1


kHz and logarithmic above 1 kHz
• Definition:
Mel Filter Bank Processing

• Mel Filter bank


– Uniformly spaced before 1 kHz
– logarithmic scale after 1 kHz
Mel-filter Bank Processing
• Apply the bank of filters according Mel scale to
the spectrum
• Each filter output is the sum of its filtered spectral
components
MFCC
Log energy computation
• Compute the logarithm of the square
magnitude of the output of Mel-filter bank
Log energy computation
• Why log energy?
Logarithm compresses dynamic range of
values
– Human response to signal level is logarithmic
– humans less sensitive to slight differences in amplitude
at high amplitudes than low amplitudes
Makes frequency estimates less sensitive to
slight variations in input (power variation
due to speaker’s mouth moving closer to
mike)
Phase information not helpful in speech
MFCC
Mel Frequency cepstrum
• The cepstrum requires Fourier analysis
• But we’re going from frequency space back to
time
• So we actually apply inverse DFT

Details for signal processing gurus: Since the log


power spectrum is real and symmetric, inverse
DFT reduces to a Discrete Cosine Transform
(DCT)
Another advantage of the Cepstrum

• DCT produces highly uncorrelated features


• Simpler to model (acoustic modelling)
– Simply modelled by linear combinations of Gaussian
density functions with diagonal covariance matrices
• In general we’ll just use the first 12 cepstral
coefficients (we don’t want the later ones which
have e.g. the F0 spike)
MFCC
Dynamic Cepstral Coefficient

• The cepstral coefficients do not capture energy

• So we add an energy feature

• Also, we know that speech signal is not constant (slope of


formants, change from stop burst to release).

• So we want to add the changes in features (the slopes).

• We call these delta features

• We also add double-delta acceleration features


Delta and double-delta
• Derivative: in order to obtain temporal information
Typical MFCC features

• Window size: 25ms


• Window shift: 10ms
• Pre-emphasis coefficient: 0.97
• MFCC:
– 12 MFCC (mel frequency cepstral coefficients)
– 1 energy feature
– 12 delta MFCC features
– 12 double-delta MFCC features
– 1 delta energy feature
– 1 double-delta energy feature
• Total 39-dimensional features
Why is MFCC so popular?
• Efficient to compute
• Incorporates a perceptual Mel frequency scale
• Separates the source and filter

• IDFT(DCT) decorrelates the features


– Improves diagonal assumption in HMM modeling

• Alternative
– Linear Prediction Coefficients (LPC)
– Linear Prediction Cepstral Coefficients (LPCC)
– Line Spectral Frequencies (LSF)
– Discrete Wavelet Transform (DWT)
– Perceptual Linear Prediction (PLP)
Step 2: Acoustic Model
• For each frame of data, we need some way of
describing the likelihood of it belonging to any of our
classes
• Two methods are commonly used
− Multilayer perceptron (MLP) gives the likelihood of a class
given the data
− Gaussian Mixture Model (GMM) gives the likelihood of the
data given a class
Step 3: Pronunciation Model
• While the pronunciation model can be very
complex, it is typically just a dictionary
• The dictionary contains the valid pronunciations
for each word
• Examples:
− Cat: k ae t
− Dog: d ao g
− Fox: f aa x s
Step 4: Language Model
• Now we need some way of representing the
likelihood of any given word sequence
• Many methods exist, but ngrams are the most
common
• Ngrams models are trained by simply
counting the occurrences of words in a
training set
Ngrams
• A unigram is the probability of any word in
isolation
• A bigram is the probability of a given word
given the previous word
• Higher order ngrams continue in a similar
fashion
• A backoff probability is used for any unseen
data
How do we put it together?

• We now have models to represent the three


parts of our equation
• We need a framework to join these models
together
• The standard framework used is the Hidden
Markov Model (HMM)
Markov Model

• A state model using the markov property


– The markov property states that the future
depends only on the present state
• Models the likelihood of transitions between
states in a model
• Given the model, we can determine the
likelihood of any sequence of states
Hidden Markov Model

• Similar to a markov model except the states


are hidden
• We now have observations tied to the
individual states
• We no longer know the exact state sequence
given the data
• Allows for the modeling of an underlying
unobservable process

You might also like