Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis: Speech Technology: A Practical

Speech Technology: A Practical
Introduction
Topic: Spectrogram, Cepstrum

and Mel-Frequency Analysis
Kishore Prahallad
Email: skishore@cs.cmu.edu
Carnegie Mellon University
&
International Institute of Information Technology Hyderabad
1
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)
Topics
Spectrogram
Cepstrum
Mel-Frequency Analysis
Mel-Frequency Cepstral Coefficients
2
Spectrogram
3
Speech signal represented as a sequence of spectral vectors
FFT
FFT
FFT
Spectrum
4
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
5
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
Amp.
Hz
6
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
Rotate it by 90 degrees
Hz
Amplitude
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
Hz
MAP spectral amplitude to a grey level (0255) value. 0 represents black and 255
represents white.
Higher the amplitude, darker the
corresponding region.
8
Amplitude
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
Hz
Time
Time Vs Frequency
FFT
FFT
FFT
FFT
representation of a speech
signal is referred to as
spectrogram
Spectrum
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Hz
Time
10
Some Real Spectrograms

Dark regions indicate
peaks (formants)
in the spectrum
11
Why we are bothered about

spectrograms
Phones and their
properties are
better observed
in spectrogram
12

spectrograms
Sounds can be
identified much
better by the
Formants and by
their transitions
13

spectrograms
Sounds can be
identified much
better by the
Formants and by
their transitions
Hidden Markov
Models implicitly
model these
spectrograms to
perform speech
recognition
14
Usefulness of Spectrogram
Time-Frequency representation of the speech signal
Spectrogram is a tool to study speech sounds (phones)
Phones and their properties are visually studied by phoneticians
Hidden Markov Models implicitly model spectrograms for speech to
text systems
Useful for evaluation of text to speech systems
A high quality text to speech system should produce synthesized
speech whose spectrograms should nearly match with the natural
sentences.
15
Cepstral Analysis
16
A Sample Speech Spectrum

dB
Frequency (Hz)
Peaks denote dominant frequency

components in the speech signal
Peaks are referred to as formants
Formants carry the identity of the sound
17
What we want to Extract?

Spectral Envelope
Formants and a smooth curve connecting them
This Smooth curve is referred to as spectral envelope
dB
Frequency (Hz)
18
Spectral Envelope
Spectrum
Spectral
Envelope
Spectral
details
19
Spectral Envelope
Spectrum
log X[k]
Spectral
Envelope
log H[k]
Spectral
details
log E[k]
20
Spectral Envelope
Spectrum
log X[k]
log X[k] = log H[k] + log E[k]
Spectral
Envelope
log H[k]
Spectral
details
log E[k]
1. Our goal: We want to

separate spectral
envelope and spectral
details from the
spectrum.
2. i.e Given log X[k],
obtain log H[k] and log
E[k], such that
log X[k] = log H[k] + log
E[k]
21
How to achieve this

separation ?
22
Play a Mathematical Trick

Spectrum
Trick: Take FFT of

the spectrum!!
An FFT on spectrum
referred to as Inverse
FFT (IFFT).
Spectral
Envelope
Note: We are dealing

with spectrum in log
domain (part of the
trick)
IFFT of log spectrum
would represent the
signal in pseudofrequency axis
Spectral
details
23

Spectrum
Spectral
Envelope
A pseudo-frequency
axis
Spectral
details
24

Spectrum
Low Freq.
region
High Freq.
region
Spectral
Envelope
A pseudo-frequency
axis
Spectral
details
25

Spectrum
Low Freq.
region
High Freq.
region
Spectral
Envelope
IFFT
A pseudo-frequency
axis
Spectral
details
26

Spectrum
Low Freq.
region
HighTreat
Freq.this as a
regionsine wave
with 4 cycles
per sec.
Spectral
Envelope
IFFT
A pseudo-frequency
axis
Spectral
details
27
Gives a peak
at 4 Hz in
frequency
axis
Low Freq.
region
Spectrum
HighTreat
Freq.this as a
regionsine wave
with 4 cycles
per sec.
Spectral
Envelope
IFFT
A pseudo-frequency
axis
Spectral
details
28
Gives a peak
at 4 Hz in
frequency
axis
Low Freq.
region
Spectrum
HighTreat
Freq.this as a
regionsine wave
with 4 cycles
per sec.
Spectral
Envelope
IFFT
A pseudo-frequency
axis
Spectral
details
29

Spectrum
Low Freq.
region
High Freq.
region
Spectral
Envelope
IFFT
A pseudo-frequency
axis
Spectral
details
30

Gives a peak
at 100 Hz in
frequency
Low Freq.axis
region
Spectrum
High Freq.
region
Treat this as a
sine wave with
100 cycles per
sec.
Spectral
Envelope
IFFT
A pseudo-frequency
axis
Spectral
details
31

Spectrum
Low Freq.
region
High Freq.
region
Spectral
Envelope
IFFT
IFFT
A pseudo-frequency
axis
Spectral
details
32

Spectrum
Spectral
Envelope
A pseudo-frequency
axis
Spectral
details
33

log X[k] = log H[k] + log E[k] Spectrum
IFFT
log H[k]
Spectral
Envelope
log E[k]
A pseudo-frequency
axis
Spectral
details
34

x[k] = h[k] + e[k]
IFFT
log H[k]
Spectral
Envelope
log E[k]
A pseudo-frequency
axis
Spectral
details
35

x[k] = h[k] + e[k]
IFFT
log H[k]
In practice all you
have access to only
log X[k] and hence
you can obtain x[k]
Spectral
Envelope
log E[k]
A pseudo-frequency
axis
Spectral
details
36

x[k] = h[k] + e[k]
IFFT
log H[k]
If you know x[k]
Filter the low
frequency region to get
h[k]
Spectral
Envelope
log E[k]
A pseudo-frequency
axis
Spectral
details
37

x[k] = h[k] + e[k]
IFFT
A pseudo-frequency
axis
x[k] is referred to as Cepstrum

h[k] is obtained by considering
the low frequency region of x[k].
h[k] represents the spectral
envelope and is widely used as
feature for speech recognition
log H[k]
Spectral
Envelope
log E[k]
Spectral
details
38
Cepstral Analysis
X [ k ] = H [ k ] E[ k ]
|| X [k ] || = || H [k ] || || E[k ] ||
|| . || denotes magnitude
Take Log on both sides
log || X [k ] || = log ||H [k ] || + log || E[k ] ||
Taking inverse FFT on both sides
x[k ] = h[k ] + e[k ]
39
40
Review: What we did

We captured spectral envelope (curve
connecting all formants)
BUT: Perceptual experiments say human ear
concentrates on certain regions rather than
using whole of the spectral envelope.
dB
Frequency (Hz)
41
Mel-Frequency analysis of speech is
based on human perception experiments
It is observed that human ear acts as filter
It concentrates on only certain frequency
components
These filters are non-uniformly spaced on

the frequency axis
More filters in the low frequency regions
Less no. of filters in high frequency regions
42
Mel-Frequency Filters
43
Mel-Frequency Filters
More no. of filters in low
freq. region
Lesser no. of filters in

high freq. region
44
Mel-Frequency Cepstral
Coefficients (MFCC)
Spectrum Mel-Filters Mel-Spectrum
Say log X[k] = log (Mel-Spectrum)
NOW perform Cepstral analysis on log X[k]
log X[k] = log H[k] + log E[k]
Taking IFFT
x[k] = h[k] + e[k]
Cepstral coefficients h[k] obtained for Melspectrum are referred to as Mel-Frequency

Cepstral Coefficients often denoted by *MFCC*
45
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
Mel-Filters
Cepstral Analy.
46
Speech signal represented as a sequence of CEPSTRAL vectors
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT
FFT FFT
FFT
Spectrum
Cepstral
Vectors
47
Why we are going to use MFCC

Speech synthesis
Used for joining two speech segments S1 and S2

Represent S1 as a sequence of MFCC
Represent S2 as a sequence of MFCC
Join at the point where MFCCs of S1 and S2 have
minimal Euclidean distance
Used in speech recognition

MFCC are mostly used features in state-of-art speech
recognition system
48
Summary: Process of Feature

Extraction
Speech is analyzed over short analysis window
For each short analysis window a spectrum is obtained
using FFT
Spectrum is passed through Mel-Filters to obtain MelSpectrum
Cepstral analysis is performed on Mel-Spectrum to
obtain Mel-Frequency Cepstral Coefficients
Thus speech is represented as a sequence of Cepstral
vectors
It is these Cepstral vectors which are given to pattern
classifiers for speech recognition purpose
49
Additional Reading
Chapter 6
Pg: 273 281
Pg: 304 311
Pg: 314 - 316
50

Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis: Speech Technology: A Practical

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis: Speech Technology: A Practical

Uploaded by

Copyright:

Available Formats

Speech Technology: A Practical

Topic: Spectrogram, Cepstrum

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Some Real Spectrograms

Why we are bothered about

Why we are bothered about

Why we are bothered about

A Sample Speech Spectrum

Peaks denote dominant frequency

What we want to Extract?

1. Our goal: We want to

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

How to achieve this

Play a Mathematical Trick

Trick: Take FFT of

Note: We are dealing

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

log X[k] = log H[k] + log E[k] Spectrum

Play a Mathematical Trick

log X[k] = log H[k] + log E[k] Spectrum

Play a Mathematical Trick

log X[k] = log H[k] + log E[k] Spectrum

Play a Mathematical Trick

log X[k] = log H[k] + log E[k] Spectrum

x[k] is referred to as Cepstrum

Review: What we did

These filters are non-uniformly spaced on

Lesser no. of filters in

Cepstral coefficients h[k] obtained for Melspectrum are referred to as Mel-Frequency

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of CEPSTRAL vectors

Why we are going to use MFCC

Used for joining two speech segments S1 and S2

Used in speech recognition

Summary: Process of Feature

You might also like