Speech Analysis

by

Dr Philip Jackson

lecturer in speech & audio

Department of Electronic Engineering.

http://www.ee.surrey.ac.uk/Teaching/Courses/eem.ssr

What’s the point of analysing

speech?

• Speech analysis, or speech processing,

transforms a speech waveform into a

representation that is suitable for

extracting its features:

• Human visual inspection

– e.g., by a speech scientist, speech therapist,

or forensic phonetician

• Computer analysis

– e.g., for automatic speech recognition,

speaker recognition, or paralinguistic

processing

And what does that mean?

• Suitable could be:

– amenable to human visual inspection

– using a small number of bits per

second (for transmission or storage)

– compatible with the models in a

speech recognizer

– in line with our understanding of

human auditory processing

Cochlear section

• Cochlea, or inner

ear, has a spiral

form:

– vestibular canal

– basilar

membrane

– tympanic canal

– auditory nerve

Response of the cochlea

Basilar membrane

• travels along the basilar membrane

• vibrates at matching position

• activates auditory nerves

Short-term spectrum

• Represents the distribution of power

with respect to frequency over a time

interval centred at time, t, like a vertical

slice through the spectrogram

• From a source-filter perspective, it gives

us some information about the shape of

the vocal tract at time t

• From a human speech perception view,

it provides similar information to that

sent from the cochlea to the auditory

nerve

Computing the ST-spectrum

• Analogue-to-Digital (A/D)

Conversion

– convert the analogue signal from the

microphone into a digital signal

• Windowing

– select a short section of speech,

centred at time t, and smooth

• Frequency analysis

– estimate the distribution of power with

respect to frequency

Waterfall display

Speech spectrogram

Derived formant tracks

A/D conversion

• Sampling measures the speech signal at

regular intervals, n

• Quantisation encodes the signal xn with

a discrete value

xn

n

Sample rate

• Nyquist’s theorem: for a signal band-

limited to B Hz, then a rate of 2B

samples per second is needed to encode

the signal faithfully

• Human ear sensitive up 20 kHz (hence

44 kHz rate for CDs)

• But for speech:

– high-quality needs 10 kHz bandwidth, i.e., 20

kHz sample rate

– bandwidth can be reduced to ~4 kHz (8 kHz

rate), for telephone quality

– e.g., 8-bit PCM at 8kHz = 64 kbps

CD-quality: fS = 44 kHz

High-quality speech: fS = 20 kHz

Telephone speech: fS = 8 kHz

Window functions

Frequency analysis

• Discrete Fourier Transform (DFT) is

applied to the windowed digital waveform

{x(n):n=1,…,N}.

• With an N-sample window, an N-point

complex spectrum is obtained {X(k):

k=1,…,N}.

• The modulus squared gives the power

spectrum, |X2(k)|

• The logarithm gives the log-power

spectrum, log|X2(k)|

Discrete Fourier transform

• over a finite period of time

• sampled at regular intervals

Forward transform:

(

X ( k ) = ∑n =0 x ( n ) cos

N −1 − j 2πkn

N + j sin − j 2πkn

N

)

Inverse transform:

x (n ) =

1

N

∑

N −1

k =0

X ( k )(cos + j 2πkn

N + j sin + j 2πkn

N

)

Frequency analysis

• Alternative methods include:

– filter-bank analysis (based on a set of

band-pass filters)

– approximations of the spectral

envelope, e.g., Linear predictive

coding (LPC)

Time-frequency resolution 1

• If the window is long then

– the time resolution is poor

– the number of points, N, is large

– there are N points in the spectrum

– so there is fine frequency resolution

– narrow-band frequency analysis, or

narrow-band spectrum

Narrow-band spectrum

Time-frequency resolution 2

• If the window is short then

– the time resolution is good

– the number of points, N, is small

– there are N points in the spectrum

– so the frequency resolution is coarse

– broad-band frequency analysis, or

broad-band spectrum

Wide-band spectrum

Time-frequency resolution 3

• In summary:

– long window, narrow-band spectrum;

– short window, broad-band spectrum.

• Indeed, the bandwidth-time product

cannot exceed a half:

1

BT ≤

2

where T = N f S and f S is the

sample rate

Wide-band and narrow-band spectrograms

Mel-frequency filter bank

• Allocation of DFT bins to filters,

spaced according to the Mel scale:

The real cepstrum

• Procedure for computing cepstral

coefficients from the magnitude

spectrum:

Mel-frequency cepstrum

• Procedure for computing cepstral

coefficients, based on the output

from Mel-frequency binning:

Summary of Fourier analysis

• Fourier leads to frequency representation

– good for visualisation

– is reversible

– continuous and discrete time forms

• Wide- and narrow-band spectra obtained

by adjusting frame size

• Windowing

– reduces spectral smearing

– allows for adaptation

