You are on page 1of 50

Speech Technology: A Practical

Introduction

Topic: Spectrogram, Cepstrum


and Mel-Frequency Analysis
Kishore Prahallad
Email: skishore@cs.cmu.edu
Carnegie Mellon University
&
International Institute of Information Technology Hyderabad
1
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Topics

Spectrogram
Cepstrum
Mel-Frequency Analysis
Mel-Frequency Cepstral Coefficients

2
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectrogram

3
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

Spectrum
4
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

5
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Amp.

Hz

6
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum
Rotate it by 90 degrees
Hz

Amplitude
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Hz

MAP spectral amplitude to a grey level (0255) value. 0 represents black and 255
represents white.
Higher the amplitude, darker the
corresponding region.
8

Amplitude
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Hz

Time
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

Time Vs Frequency
FFT
FFT
FFT
FFT
representation of a speech
signal is referred to as
spectrogram
Spectrum

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Hz

Time
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

10

Some Real Spectrograms


Dark regions indicate
peaks (formants)
in the spectrum

11
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are bothered about


spectrograms
Phones and their
properties are
better observed
in spectrogram

12
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are bothered about


spectrograms
Sounds can be
identified much
better by the
Formants and by
their transitions

13
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are bothered about


spectrograms
Sounds can be
identified much
better by the
Formants and by
their transitions

Hidden Markov
Models implicitly
model these
spectrograms to
perform speech
recognition

14
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Usefulness of Spectrogram
Time-Frequency representation of the speech signal
Spectrogram is a tool to study speech sounds (phones)
Phones and their properties are visually studied by phoneticians
Hidden Markov Models implicitly model spectrograms for speech to
text systems
Useful for evaluation of text to speech systems
A high quality text to speech system should produce synthesized
speech whose spectrograms should nearly match with the natural
sentences.

15
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Cepstral Analysis

16
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

A Sample Speech Spectrum


dB

Frequency (Hz)

Peaks denote dominant frequency


components in the speech signal
Peaks are referred to as formants
Formants carry the identity of the sound
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

17

What we want to Extract?


Spectral Envelope
Formants and a smooth curve connecting them
This Smooth curve is referred to as spectral envelope

dB

Frequency (Hz)

18
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope
Spectrum

Spectral
Envelope

Spectral
details

19
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope
Spectrum

log X[k]

Spectral
Envelope

log H[k]

Spectral
details

log E[k]

20
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope
Spectrum

log X[k]
log X[k] = log H[k] + log E[k]

Spectral
Envelope

log H[k]

Spectral
details

log E[k]

1. Our goal: We want to


separate spectral
envelope and spectral
details from the
spectrum.
2. i.e Given log X[k],
obtain log H[k] and log
E[k], such that
log X[k] = log H[k] + log
E[k]
21

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

How to achieve this


separation ?

22
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Play a Mathematical Trick


Spectrum

Trick: Take FFT of


the spectrum!!
An FFT on spectrum
referred to as Inverse
FFT (IFFT).

Spectral
Envelope

Note: We are dealing


with spectrum in log
domain (part of the
trick)
IFFT of log spectrum
would represent the
signal in pseudofrequency axis

Spectral
details
23

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Play a Mathematical Trick


Spectrum

Spectral
Envelope

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
24

Play a Mathematical Trick


Spectrum
Low Freq.
region

High Freq.
region

Spectral
Envelope

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
25

Play a Mathematical Trick


Spectrum
Low Freq.
region

High Freq.
region

Spectral
Envelope
IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
26

Play a Mathematical Trick


Spectrum
Low Freq.
region

HighTreat
Freq.this as a
regionsine wave
with 4 cycles
per sec.
Spectral
Envelope
IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
27

Gives a peak
at 4 Hz in
frequency
axis

Play a Mathematical Trick

Low Freq.
region

Spectrum
HighTreat
Freq.this as a
regionsine wave
with 4 cycles
per sec.
Spectral
Envelope
IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
28

Gives a peak
at 4 Hz in
frequency
axis

Play a Mathematical Trick

Low Freq.
region

Spectrum
HighTreat
Freq.this as a
regionsine wave
with 4 cycles
per sec.
Spectral
Envelope
IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
29

Play a Mathematical Trick


Spectrum
Low Freq.
region

High Freq.
region

Spectral
Envelope

IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
30

Play a Mathematical Trick


Gives a peak
at 100 Hz in
frequency
Low Freq.axis
region

Spectrum
High Freq.
region
Treat this as a
sine wave with
100 cycles per
sec.

Spectral
Envelope

IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
31

Play a Mathematical Trick


Spectrum
Low Freq.
region

High Freq.
region

Spectral
Envelope
IFFT

IFFT

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
32

Play a Mathematical Trick


Spectrum

Spectral
Envelope

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
33

Play a Mathematical Trick


log X[k] = log H[k] + log E[k] Spectrum

IFFT
log H[k]

Spectral
Envelope

log E[k]

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
34

Play a Mathematical Trick


x[k] = h[k] + e[k]

log X[k] = log H[k] + log E[k] Spectrum

IFFT
log H[k]

Spectral
Envelope

log E[k]

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
35

Play a Mathematical Trick


x[k] = h[k] + e[k]

log X[k] = log H[k] + log E[k] Spectrum

IFFT
log H[k]
In practice all you
have access to only
log X[k] and hence
you can obtain x[k]

Spectral
Envelope

log E[k]

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
36

Play a Mathematical Trick


x[k] = h[k] + e[k]

log X[k] = log H[k] + log E[k] Spectrum

IFFT
log H[k]
If you know x[k]
Filter the low
frequency region to get
h[k]

Spectral
Envelope

log E[k]

A pseudo-frequency
axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral
details
37

Play a Mathematical Trick


x[k] = h[k] + e[k]

log X[k] = log H[k] + log E[k] Spectrum

IFFT
A pseudo-frequency
axis

x[k] is referred to as Cepstrum


h[k] is obtained by considering
the low frequency region of x[k].
h[k] represents the spectral
envelope and is widely used as
feature for speech recognition

log H[k]

Spectral
Envelope

log E[k]

Spectral
details
38
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Cepstral Analysis
X [ k ] = H [ k ] E[ k ]
|| X [k ] || = || H [k ] || || E[k ] ||
|| . || denotes magnitude
Take Log on both sides
log || X [k ] || = log ||H [k ] || + log || E[k ] ||
Taking inverse FFT on both sides
x[k ] = h[k ] + e[k ]
39
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Analysis

40
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Review: What we did


We captured spectral envelope (curve
connecting all formants)
BUT: Perceptual experiments say human ear
concentrates on certain regions rather than
using whole of the spectral envelope.

dB

Frequency (Hz)
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

41

Mel-Frequency Analysis
Mel-Frequency analysis of speech is
based on human perception experiments
It is observed that human ear acts as filter
It concentrates on only certain frequency
components

These filters are non-uniformly spaced on


the frequency axis
More filters in the low frequency regions
Less no. of filters in high frequency regions
42
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Filters

43
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Filters
More no. of filters in low
freq. region

Lesser no. of filters in


high freq. region

44
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Cepstral
Coefficients (MFCC)
Spectrum  Mel-Filters  Mel-Spectrum
Say log X[k] = log (Mel-Spectrum)
NOW perform Cepstral analysis on log X[k]
log X[k] = log H[k] + log E[k]
Taking IFFT
x[k] = h[k] + e[k]

Cepstral coefficients h[k] obtained for Melspectrum are referred to as Mel-Frequency


Cepstral Coefficients often denoted by *MFCC*
45
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Mel-Filters
Cepstral Analy.

46
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of CEPSTRAL vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Cepstral
Vectors

47
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are going to use MFCC


Speech synthesis

Used for joining two speech segments S1 and S2


Represent S1 as a sequence of MFCC
Represent S2 as a sequence of MFCC
Join at the point where MFCCs of S1 and S2 have
minimal Euclidean distance

Used in speech recognition


MFCC are mostly used features in state-of-art speech
recognition system

48
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Summary: Process of Feature


Extraction
Speech is analyzed over short analysis window
For each short analysis window a spectrum is obtained
using FFT
Spectrum is passed through Mel-Filters to obtain MelSpectrum
Cepstral analysis is performed on Mel-Spectrum to
obtain Mel-Frequency Cepstral Coefficients
Thus speech is represented as a sequence of Cepstral
vectors
It is these Cepstral vectors which are given to pattern
classifiers for speech recognition purpose
49
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Additional Reading
Chapter 6
Pg: 273 281
Pg: 304 311
Pg: 314 - 316

50
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

You might also like