You are on page 1of 30


ssr: Speaker & Speech Recognition

Speech Analysis

Dr Philip Jackson
lecturer in speech & audio

Centre for Vision, Speech & Signal Processing,

Department of Electronic Engineering.
What’s the point of analysing
• Speech analysis, or speech processing,
transforms a speech waveform into a
representation that is suitable for
extracting its features:
• Human visual inspection
– e.g., by a speech scientist, speech therapist,
or forensic phonetician
• Computer analysis
– e.g., for automatic speech recognition,
speaker recognition, or paralinguistic
And what does that mean?
• Suitable could be:
– amenable to human visual inspection
– using a small number of bits per
second (for transmission or storage)
– compatible with the models in a
speech recognizer
– in line with our understanding of
human auditory processing
Cochlear section
• Cochlea, or inner
ear, has a spiral
– vestibular canal
– basilar
– tympanic canal
– auditory nerve
Response of the cochlea
Basilar membrane

• sound enters at the stapes

• travels along the basilar membrane
• vibrates at matching position
• activates auditory nerves
Short-term spectrum
• Represents the distribution of power
with respect to frequency over a time
interval centred at time, t, like a vertical
slice through the spectrogram
• From a source-filter perspective, it gives
us some information about the shape of
the vocal tract at time t
• From a human speech perception view,
it provides similar information to that
sent from the cochlea to the auditory
Computing the ST-spectrum
• Analogue-to-Digital (A/D)
– convert the analogue signal from the
microphone into a digital signal
• Windowing
– select a short section of speech,
centred at time t, and smooth
• Frequency analysis
– estimate the distribution of power with
respect to frequency
Waterfall display
Speech spectrogram
Derived formant tracks
A/D conversion
• Sampling measures the speech signal at
regular intervals, n
• Quantisation encodes the signal xn with
a discrete value


Sample rate
• Nyquist’s theorem: for a signal band-
limited to B Hz, then a rate of 2B
samples per second is needed to encode
the signal faithfully
• Human ear sensitive up 20 kHz (hence
44 kHz rate for CDs)
• But for speech:
– high-quality needs 10 kHz bandwidth, i.e., 20
kHz sample rate
– bandwidth can be reduced to ~4 kHz (8 kHz
rate), for telephone quality
– e.g., 8-bit PCM at 8kHz = 64 kbps
CD-quality: fS = 44 kHz
High-quality speech: fS = 20 kHz
Telephone speech: fS = 8 kHz
Window functions
Frequency analysis
• Discrete Fourier Transform (DFT) is
applied to the windowed digital waveform
• With an N-sample window, an N-point
complex spectrum is obtained {X(k):
• The modulus squared gives the power
spectrum, |X2(k)|
• The logarithm gives the log-power
spectrum, log|X2(k)|
Discrete Fourier transform
• over a finite period of time
• sampled at regular intervals

Forward transform:

X ( k ) = ∑n =0 x ( n ) cos
N −1 − j 2πkn
N + j sin − j 2πkn
Inverse transform:

x (n ) =

N −1
k =0
X ( k )(cos + j 2πkn
N + j sin + j 2πkn
Frequency analysis
• Alternative methods include:
– filter-bank analysis (based on a set of
band-pass filters)
– approximations of the spectral
envelope, e.g., Linear predictive
coding (LPC)
Time-frequency resolution 1
• If the window is long then
– the time resolution is poor
– the number of points, N, is large
– there are N points in the spectrum
– so there is fine frequency resolution
– narrow-band frequency analysis, or
narrow-band spectrum
Narrow-band spectrum
Time-frequency resolution 2
• If the window is short then
– the time resolution is good
– the number of points, N, is small
– there are N points in the spectrum
– so the frequency resolution is coarse
– broad-band frequency analysis, or
broad-band spectrum
Wide-band spectrum
Time-frequency resolution 3
• In summary:
– long window, narrow-band spectrum;
– short window, broad-band spectrum.
• Indeed, the bandwidth-time product
cannot exceed a half:
BT ≤
where T = N f S and f S is the
sample rate
Wide-band and narrow-band spectrograms
Mel-frequency filter bank
• Allocation of DFT bins to filters,
spaced according to the Mel scale:
The real cepstrum
• Procedure for computing cepstral
coefficients from the magnitude
Mel-frequency cepstrum
• Procedure for computing cepstral
coefficients, based on the output
from Mel-frequency binning:
Summary of Fourier analysis
• Fourier leads to frequency representation
– good for visualisation
– is reversible
– continuous and discrete time forms
• Wide- and narrow-band spectra obtained
by adjusting frame size
• Windowing
– reduces spectral smearing
– allows for adaptation