You are on page 1of 52

Speech Signal Analysis

and Coding

Dr. Arun Kumar

Centre for Applied Research in Electronics


(CARE), IIT Delhi

arunkm@care.iitd.ernet.in
Contents
• Speech Processing Applications

• Speech Signal Understanding


– Speech Production
– Speech Signal Characteristics and Analysis

• Speech Coding
– Coding Standards
– Coder Attributes including Quality Evaluation
– Coding Methodologies
Speech Processing Applications

• Speech Transmission
– Trunk-line telephony
– Wireless telephony
• Speech Storage
– Voice Mail, Voice Memo, Answering
machines
• Speech Synthesis
– Text-to-speech-synthesis
– Automatic information services
Speech Processing Applications

• Speaker Verification and Identification


– Phone banking
– Secure entry
• Aids for the Handicapped
– Variable rate playback
– Hearing aids
– Reading machine for visually impaired
– Visual display of speech information for
hearing impaired
Speech Processing Applications

• Speech Enhancement
– Echo and noise cancellation
• Speech Recognition
– Automatic language translation
• Voice Personality Transformation
– Voice conversion from “source” to “target”
The Speech Signal

“ It is the variation of pressure, from


atmospheric pressure, as a function of
time, caused by traveling waves from
the speaker’s mouth (apart from
nostrils, cheeks and throat).”
The Intensity Level of Speech

Units:
SPL (Sound Pressure Level) in dB
relative to a reference level.

Reference: 10 –16 W/cm2


- Corresponds to ‘just barely audible’
The Intensity Level of Speech
120 Airplane

100 Rock concert


Heavy
80 traffic Variations in normal voice
d 70 level (1 meter distance from
60 mouth)
B 55

20 Whisper

0 Just barely
audible
The Intensity Level of Speech

• Energy of speech during 1 s


– 2 x 10 –5 Joules
(It takes 100 Joules to light a 100 W bulb for
1 s)
• Strongest vowel: /a/ as in “talk”
• Weakest vowel: /i/ as in “see”
• Strongest consonant: /r/ as in “run”
• Weakest consonant: /Θ/ as in “thin”
Speech & Audio Signal Specs.

Audio Sampling Source


Bandwid
Signal Rate Rate
th(Hz)
Category (kHz) (kbps)
Telephone
Band 300-3400 8.0 128
Speech
Wideband
50-7000 16.0 256
Speech
Wideband
20-20,000 44.1/48.0 705/768
Audio
Speech Articulation by the Vocal System

Reproduced from: D. O’Shaughnessy, Human and machine speech communication, IEEE Press, 2000
Speech Classes by Articulation

• Voiced speech

• Unvoiced speech

• Transient (stop) sounds


Acoustic Analysis of Speech
The relationship between speech
sounds (phonemes) and their acoustic
realizations

– Waveform

– Spectrum

– Spectrogram
Time Waveform of a Speech Sentence
0 . 8

THIS IS GOOD
0 . 6

0 . 4

0 . 2

0
p til u d e
A m

- 0 . 2

- 0 . 4

- 0 . 6

- 0 . 8

- 1

ɪ(i) ɪ(i) ɡ (G)


0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4

ʓ(TH)
T m
i e ( s )

s s U (O) d
(s) (s) (D)
Waveform Analysis of a Speech
• Vowels
– High energy, periodic, steady state utterance
• Unvoiced fricatives
– Low energy, noise-like, steady-state utterance
• Voiced fricatives
– Low energy, element of periodicity, steady-state
utterance
• Stops
– Transient release, medium to low energy
• Nasals
– Low-to-medium energy, periodic, steady-state
utterance
Acoustic Analysis of Vowels

Fundamental frequency F0 / Pitch period

F0 Male Female
Average (Hz) 132 223
Range (Hz) 50-250 120-500
Acoustic Analysis of Consonants

• Stop Consonants
– Momentary blockage of the vocal tract (50-
100ms): Closure phase
– Release burst (shortest acoustic event)
– Voice – onset time (VOT)
• Fricatives
– Narrow constriction somewhere in vocal
tract
– Turbulent airflow through the constriction
The
International
Phonetic
Alphabet
(IPA)
Universal Speech Production Model

Voiced
Gain
Impulse Glottal
Train Pulse
Generator Model

Voiced or
Vocal
Unvoiced Radiation Output
Tract
switch Model speech
Filter

White
Noise
Generator

Unvoiced
Gain
Vocal Tract Model

• Time-varying all-pole linear filter excited by a


source signal.

H(z)=1/A(z)
e[n] s[n]

• H(z) models the vocal tract system.


1 1
H ( z) = P
=
A( z )
1 − ∑ ai z −i
i =1
Voiced Speech Spectrum

80

60

40

20
Mag (dB)

-20

-40

-60

-80

-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd-order LP Envelope

80

60

40

20
Mag (dB)

-20

-40

-60

-80

-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th order LP Envelopes

80

60

40

20
Mag (dB)

-20

-40

-60

-80

-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th, &10th order LP Envelopes

80

60

40

20
Mag (dB)

-20

-40

-60

-80

-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th, 10th & 16th order LP Envelopes

80

60

40

20
Mag (dB)

-20

-40

-60

-80

-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Unvoiced Speech and 10th order LP Residual
-0 .1

-0 .1 1

-0 .1 2

-0 .1 3
Amplitude

-0 .1 4

-0 .1 5

-0 .1 6

-0 .1 7

-0 .1 8

-0 .1 9
0 5 10 15 20 25 30 35 40
T im e ( m s )

0 .1 5

0 .1

0 .0 5
Amplitude

-0 .0 5

-0 .1

-0 .1 5

-0 .2
0 5 10 15 20 25 30 35 40
T im e ( m s )
Voiced Speech and 10th-order LP Residual
0 . 6

0 . 4

0 . 2
Amplitude

- 0 . 2

- 0 . 4

- 0 . 6

- 0 . 8
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
T im e ( m s )

0 . 2

0 . 1 5

0 . 1
Amplitude

0 . 0 5

- 0 . 0 5

- 0 . 1

- 0 . 1 5
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
T im e ( m s )

• Short-term correlation
• Long-term correlation
Speech Coding
Coding Rates
• For telephone band (or narrowband) speech:
– Signal Bandwidth: 300-3400 Hz
– Sampling Rate: 8000 Hz
– Resolution: 16 bits / sample linear PCM

• Uncompressed bit rate:


16 bits/sample x 8000 samples/s
= 128 Kbit/s

• What is the minimum coding rate for


transmitting the message information?
Coder Classes according to Bit-Rate

B > 16 Kbps High bit rate coders

Medium bit rate


4 < B <=16 Kbps
coders

1 < B <=4 Kbps Low bit rate coders


Very low bit rate
B < 1 Kbps
coders
Standards Organizations

• ITU-T: International Telecommunications Union


(UN)
• MPEG: Motion Pictures Experts Group
(ISO/UN)
• INMARSAT: Intl. Maritime Satellite Corporation
– for geo-synchronous satellites
• US Government: DoD, NATO
• TIA: Telecom Industry Association - for North
American Telecom standards
• ETSI: European Telecom. Standards Institute
Speech Coding Standards
Bit-rate
Name Coding Type Organization Year
(kbps)
G.711/ PCM µ-law/
64 ITU-T 1972
G.712 A-law
G.721/G.723 32/24/40/ 1984/86/
ADPCM ITU-T
G.726/G.727 16 88/90
G.728 LD-CELP 16 ITU-T 1992

G.729 CS-ACELP 8.0 ITU-T 1995

G.723.1 ACELP 6.3/5.3 ITU-T 1995


(Wideband)
G.722 48/56/64 ITU-T 1985
SB-ADPCM
Speech Coding Standards
Bit-rate
Name Coding Type Organization Year
(kbps)
(Wideband)
G.722.1 24/32 ITU-T 1999
Transform
Inmarsat IMBE 4.15 INMARSAT 1990

IS-54 (old) VSELP 7.95 TIA 1992

GSM-FR RPE-LTP 13 GSM 1991

GSM-HR CELP 5-6 GSM 1994

GSM-EFR CELP 12.2 GSM 1997


Speech Coding Standards

Bit-rate
Name Coding Type Organization Year
(kbps)
IS-641(new) ACELP 7.4 TIA 1997
Iridium AMBE 2.4 Iridium 1996
MPEG-4 HVXC 2-4 MPEG/ISO 1999
MPEG-4 CELP 4-24 MPEG/ISO 1999
US-DoD
FS-1015 LPC-10 2.4 1984
/NATO
US-DoD
FS-1016 CELP 4.8 1989
/NATO
US-DoD
MELP MELP 2.4 1996
/NATO
Coding Methodologies

• Coding Methodologies

– Waveform coding

– Vocoding or parametric coding

– Hybrid coding
Classes according to Coding Type
Excellent

Hybrid Waveform
Good approximating
Coders
coders

Quality
Parametric Coders

Fair

Poor
1 2 4 8 16 32 64
Bit rate (Kbps)
Coding Standards
Excellent
GSM EFR G.728
G.711
G.726
G.729
Linear
GSM FR
G.723.1 Waveform approximating PCM

Good Hybrid Coders IS96 coders

GSM/2

Quality
MELP Parametric Coders
Fair
FS1015

Poor
1 2 4 8 16 32 64
Bit rate (Kbps)
PCM Coding
i[n]
Q[.]
x[n] x’[n]

• Instantaneous, non-uniform quantization


• For time-varying energy signals eg speech,
uniform quantization is inefficient.
• If signal energy is halved, SQNR falls 6 dB.
• SQNR is independent of signal level in Log
quantizer.
ADPCM Coding
Encoder
+ c[n]
Input d[n] Q[.] d’[n]
-
x[n]

x’[n] x”[n]
P +

d’[n] x”[n]
Decoder +
c[n]
x’[n] P
Prediction in the context of Coding
0 .6

0 .4

0 .2
Amplitude

-0 .2

-0 .4

-0 .6

-0 .8
0 5 1 0 1 5 2 0
T im e ( m s )

0 .4

0 .2
Amplitude

- 0 .2

- 0 .4

- 0 .6

- 0 .8
0 5 1 0 1 5 2 0
T im e ( m s )

Signal and first-difference signal


ADPCM Coding
• DPCM with fixed predictor can give 4-11 dB
improvement over PCM.

• PCM with adaptive quantization can give ~ 5


dB improvement over µ-law non-adaptive
PCM.

• DPCM with adaptive prediction can give 10-


12 dB improvement over fixed predictor.
Code Excited Linear Prediction (CELP)
Coding
• Most coders in 4.8-16 kbps are based
on Linear Prediction Analysis-by-
Synthesis (LPAS) coding.

• CELP belongs to LPAS paradigm of


speech coding.
Generic Linear Prediction Analysis
Synthesis (LPAS) Coder
Input
speech
LP Analysis

-
Excitation Synthesis
Generator Filter +

Error
Minimization
CELP Decoder

Synthesized speech
Excitation
G/A(z)
Generator

Excitation parameters

LP and Gain parameters


Speech Quality Measurement

• Speech Quality
– Objective measures
• Segmental SNR
• Itakura-Saito distance measure
• Spectral distortion (SD)
• ITU-T P.862 Recommendation
– Subjective measures
• Mean opinion score (MOS)
• Diagnostic Rhyme Test (DRT)
• Diagnostic Acceptability Measure (DAM)
Absolute Category Rating Tests (MOS)

• Listening quality scale

Excellent 5
Good 4
Fair 3
Poor 2
Bad 1
Diagnostic Rhyme Test

• Measures speech intelligibility


• Listeners are presented with one of two
words which differ only in leading
consonant
– Examples:
• Meet - Beat
• Than - Dan
• Met - Net
• Jest - Guest
Diagnostic Rhyme Test

• Total possible pairs = 96


• Intelligibility score, S, is given by:
N(correct) – N(incorrect)
S = 100 x
N(test pairs)
Coder Rate (kbps) DRT MOS
FS1016 4.8 91.7 3.3
G.728 16 93.0 3.9
Perceptual evaluation of speech quality (PESQ)

• Part of ITU-T P.862 standard


• Objective is to mimic sound perception by
persons in real life
• PESQ simulates expts. in which subjects
judge speech quality
• Physical signals are mapped to
psychophysical representations that match
internal representations in the head
Speech Coder Complexity Issues

• Complexity
– Computational complexity
• Simplex/half-duplex/full-duplex real time
performance on a single DSP
• Fixed point vs. floating point
• CELP coders are computationally complex
– Memory requirement
• Storage of look-up tables, codebooks etc.
Timing Diagram for various Coding Delays
Buffer input Buffer input Buffer input Buffer input Buffer input
speech frame speech frame 2 speech frame 3 speech frame 4 speech frame 5

Algorithmic
buffering delay Encode Encode Encode Encode
frame 1 frame 2 frame 3 frame 4

Encoder
processing
delay Transmit bits of Transmit bits of Transmit bits of
frame 1 frame 2 frame 3
Sum of the
two is the
total
Bit transmission
processing
delay decode decode decode
delay
frame 1 frame 2 frame 2

Decoder
processing Play back Play back
delay decoded speech decoded speech
Total one way coding delay
frame 1 frame 2

0 1 2 3 4 Time (frame index) 5


Thank You!

You might also like