You are on page 1of 50

Speech Technology: A Practical Introduction

Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis


Kishore Prahallad Email: skishore@cs.cmu.edu Carnegie Mellon University & International Institute of Information Technology Hyderabad
1 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Topics
Spectrogram Cepstrum Mel-Frequency Analysis Mel-Frequency Cepstral Coefficients

2 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectrogram

3 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT
Spectrum

FFT

FFT

4 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

5 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Amp.

Hz
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum Rotate it by 90 degrees Hz

Amplitude
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Hz

MAP spectral amplitude to a grey level (0255) value. 0 represents black and 255 represents white. Higher the amplitude, darker the corresponding region.
8

Amplitude
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Hz

Time
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

Time Vs Frequency FFT FFT FFT FFT representation of a speech signal is referred to as spectrogram Spectrum

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Hz

Time
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

10

Some Real Spectrograms


Dark regions indicate peaks (formants) in the spectrum

11 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are bothered about spectrograms


Phones and their properties are better observed in spectrogram

12 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are bothered about spectrograms


Sounds can be identified much better by the Formants and by their transitions

13 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are bothered about spectrograms


Sounds can be identified much better by the Formants and by their transitions

Hidden Markov Models implicitly model these spectrograms to perform speech recognition

14 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Usefulness of Spectrogram
Time-Frequency representation of the speech signal Spectrogram is a tool to study speech sounds (phones) Phones and their properties are visually studied by phoneticians Hidden Markov Models implicitly model spectrograms for speech to text systems Useful for evaluation of text to speech systems
A high quality text to speech system should produce synthesized speech whose spectrograms should nearly match with the natural sentences.

15 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Cepstral Analysis

16 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

A Sample Speech Spectrum


dB

Frequency (Hz)

Peaks denote dominant frequency components in the speech signal Peaks are referred to as formants Formants carry the identity of the sound
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

17

What we want to Extract? Spectral Envelope


Formants and a smooth curve connecting them This Smooth curve is referred to as spectral envelope

dB

Frequency (Hz)

18 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope
Spectrum

Spectral Envelope

Spectral details

19 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope
Spectrum log X[k]

Spectral Envelope

log H[k]

Spectral details

log E[k]

20 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope
Spectrum log X[k] log X[k] = log H[k] + log E[k] Spectral Envelope

log H[k]

1. Our goal: We want to separate spectral envelope and spectral details from the spectrum. 2. i.e Given log X[k], obtain log H[k] and log E[k], such that log X[k] = log H[k] + log E[k]
21

Spectral details

log E[k]

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

How to achieve this separation ?

22 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Play a Mathematical Trick


Spectrum

Trick: Take FFT of the spectrum!! An FFT on spectrum referred to as Inverse FFT (IFFT). Note: We are dealing with spectrum in log domain (part of the trick) IFFT of log spectrum would represent the signal in pseudofrequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral Envelope

Spectral details
23

Play a Mathematical Trick


Spectrum

Spectral Envelope

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
24

Play a Mathematical Trick


Spectrum Low Freq. region High Freq. region

Spectral Envelope

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
25

Play a Mathematical Trick


Spectrum Low Freq. region High Freq. region

Spectral Envelope IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
26

Play a Mathematical Trick


Spectrum Low Freq. region HighTreat Freq. this as a regionsine wave with 4 cycles per sec. Spectral Envelope IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
27

Gives a peak at 4 Hz in frequency axis Low Freq. region

Play a Mathematical Trick


Spectrum HighTreat Freq. this as a regionsine wave with 4 cycles per sec. Spectral Envelope IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
28

Gives a peak at 4 Hz in frequency axis Low Freq. region

Play a Mathematical Trick


Spectrum HighTreat Freq. this as a regionsine wave with 4 cycles per sec. Spectral Envelope IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
29

Play a Mathematical Trick


Spectrum Low Freq. region High Freq. region

Spectral Envelope

IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
30

Play a Mathematical Trick


Gives a peak at 100 Hz in frequency Low Freq.axis region Spectrum High Freq. region Treat this as a sine wave with 100 cycles per sec.

Spectral Envelope

IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
31

Play a Mathematical Trick


Spectrum Low Freq. region High Freq. region

Spectral Envelope IFFT

IFFT

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
32

Play a Mathematical Trick


Spectrum

Spectral Envelope

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
33

Play a Mathematical Trick


log X[k] = log H[k] + log E[k] Spectrum

IFFT log H[k] Spectral Envelope

log E[k]

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
34

Play a Mathematical Trick


x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT log H[k] Spectral Envelope

log E[k]

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
35

Play a Mathematical Trick


x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT log H[k] In practice all you have access to only log X[k] and hence you can obtain x[k] Spectral Envelope

log E[k]

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
36

Play a Mathematical Trick


x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT log H[k] If you know x[k] Filter the low frequency region to get h[k] Spectral Envelope

log E[k]

A pseudo-frequency axis
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Spectral details
37

Play a Mathematical Trick


x[k] = h[k] + e[k] log X[k] = log H[k] + log E[k] Spectrum

IFFT A pseudo-frequency axis log H[k] Spectral Envelope

x[k] is referred to as Cepstrum h[k] is obtained by considering the low frequency region of x[k]. h[k] represents the spectral envelope and is widely used as feature for speech recognition

log E[k]

Spectral details
38 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Cepstral Analysis
X [ k ] = H [ k ] E[ k ] || X [k ] || = || H [k ] || || E[k ] || || . || denotes magnitude Take Log on both sides log || X [k ] || = log ||H [k ] || + log || E[k ] || Taking inverse FFT on both sides x[k ] = h[k ] + e[k ]
39 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Analysis

40 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Review: What we did


We captured spectral envelope (curve connecting all formants) BUT: Perceptual experiments say human ear concentrates on certain regions rather than using whole of the spectral envelope.

dB

Frequency (Hz)
Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

41

Mel-Frequency Analysis
Mel-Frequency analysis of speech is based on human perception experiments It is observed that human ear acts as filter
It concentrates on only certain frequency components

These filters are non-uniformly spaced on the frequency axis


More filters in the low frequency regions Less no. of filters in high frequency regions
42 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Filters

43 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Filters
More no. of filters in low freq. region Lesser no. of filters in high freq. region

44 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Mel-Frequency Cepstral Coefficients (MFCC)


Spectrum Mel-Filters Mel-Spectrum Say log X[k] = log (Mel-Spectrum) NOW perform Cepstral analysis on log X[k]
log X[k] = log H[k] + log E[k] Taking IFFT x[k] = h[k] + e[k]

Cepstral coefficients h[k] obtained for Melspectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by *MFCC*
45 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of spectral vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Mel-Filters Cepstral Analy.

46 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Speech signal represented as a sequence of CEPSTRAL vectors

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT

FFT FFT

FFT

Spectrum

Cepstral Vectors

47 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Why we are going to use MFCC


Speech synthesis
Used for joining two speech segments S1 and S2 Represent S1 as a sequence of MFCC Represent S2 as a sequence of MFCC Join at the point where MFCCs of S1 and S2 have minimal Euclidean distance

Used in speech recognition


MFCC are mostly used features in state-of-art speech recognition system

48 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Summary: Process of Feature Extraction


Speech is analyzed over short analysis window For each short analysis window a spectrum is obtained using FFT Spectrum is passed through Mel-Filters to obtain MelSpectrum Cepstral analysis is performed on Mel-Spectrum to obtain Mel-Frequency Cepstral Coefficients Thus speech is represented as a sequence of Cepstral vectors It is these Cepstral vectors which are given to pattern classifiers for speech recognition purpose
49 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

Additional Reading
Chapter 6
Pg: 273 281 Pg: 304 311 Pg: 314 - 316

50 Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu)

You might also like