Speech Signal Analysis and Coding: Dr. Arun Kumar

Speech Signal Analysis
and Coding
Dr. Arun Kumar
Centre for Applied Research in Electronics

(CARE), IIT Delhi
arunkm@care.iitd.ernet.in
Contents
• Speech Processing Applications
• Speech Signal Understanding

– Speech Production
– Speech Signal Characteristics and Analysis
• Speech Coding
– Coding Standards
– Coder Attributes including Quality Evaluation
– Coding Methodologies
Speech Processing Applications
• Speech Transmission
– Trunk-line telephony
– Wireless telephony
• Speech Storage
– Voice Mail, Voice Memo, Answering
machines
• Speech Synthesis
– Text-to-speech-synthesis
– Automatic information services
• Speaker Verification and Identification

– Phone banking
– Secure entry
• Aids for the Handicapped
– Variable rate playback
– Hearing aids
– Reading machine for visually impaired
– Visual display of speech information for
hearing impaired
• Speech Enhancement
– Echo and noise cancellation
• Speech Recognition
– Automatic language translation
• Voice Personality Transformation
– Voice conversion from “source” to “target”
The Speech Signal
“ It is the variation of pressure, from

atmospheric pressure, as a function of
time, caused by traveling waves from
the speaker’s mouth (apart from
nostrils, cheeks and throat).”
The Intensity Level of Speech
Units:
SPL (Sound Pressure Level) in dB
relative to a reference level.
Reference: 10 –16 W/cm2

- Corresponds to ‘just barely audible’
120 Airplane
100 Rock concert

Heavy
80 traffic Variations in normal voice
d 70 level (1 meter distance from
60 mouth)
B 55
20 Whisper
0 Just barely
audible
• Energy of speech during 1 s

– 2 x 10 –5 Joules
(It takes 100 Joules to light a 100 W bulb for
1 s)
• Strongest vowel: /a/ as in “talk”
• Weakest vowel: /i/ as in “see”
• Strongest consonant: /r/ as in “run”
• Weakest consonant: /Θ/ as in “thin”
Speech & Audio Signal Specs.
Audio Sampling Source

Bandwid
Signal Rate Rate
th(Hz)
Category (kHz) (kbps)
Telephone
Band 300-3400 8.0 128
Speech
Wideband
50-7000 16.0 256
Speech
Wideband
20-20,000 44.1/48.0 705/768
Audio
Speech Articulation by the Vocal System
Reproduced from: D. O’Shaughnessy, Human and machine speech communication, IEEE Press, 2000
Speech Classes by Articulation
• Voiced speech
• Unvoiced speech
• Transient (stop) sounds

Acoustic Analysis of Speech
The relationship between speech
sounds (phonemes) and their acoustic
realizations
– Waveform
– Spectrum
– Spectrogram
Time Waveform of a Speech Sentence
0 . 8
THIS IS GOOD
0 . 6
0 . 4
0 . 2
0
p til u d e
A m
- 0 . 2
- 0 . 4
- 0 . 6
- 0 . 8
- 1
ɪ(i) ɪ(i) ɡ (G)

0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4
ʓ(TH)
T m
i e ( s )
s s U (O) d
(s) (s) (D)
Waveform Analysis of a Speech
• Vowels
– High energy, periodic, steady state utterance
• Unvoiced fricatives
– Low energy, noise-like, steady-state utterance
• Voiced fricatives
– Low energy, element of periodicity, steady-state
utterance
• Stops
– Transient release, medium to low energy
• Nasals
– Low-to-medium energy, periodic, steady-state
utterance
Acoustic Analysis of Vowels
Fundamental frequency F0 / Pitch period
F0 Male Female
Average (Hz) 132 223
Range (Hz) 50-250 120-500
Acoustic Analysis of Consonants
• Stop Consonants
– Momentary blockage of the vocal tract (50-
100ms): Closure phase
– Release burst (shortest acoustic event)
– Voice – onset time (VOT)
• Fricatives
– Narrow constriction somewhere in vocal
tract
– Turbulent airflow through the constriction
The
International
Phonetic
Alphabet
(IPA)
Universal Speech Production Model
Voiced
Gain
Impulse Glottal
Train Pulse
Generator Model
Voiced or
Vocal
Unvoiced Radiation Output
Tract
switch Model speech
Filter
White
Noise
Generator
Unvoiced
Gain
Vocal Tract Model
• Time-varying all-pole linear filter excited by a

source signal.
H(z)=1/A(z)
e[n] s[n]
• H(z) models the vocal tract system.

1 1
H ( z) = P
=
A( z )
1 − ∑ ai z −i
i =1
Voiced Speech Spectrum
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd-order LP Envelope
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th order LP Envelopes
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th, &10th order LP Envelopes
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th, 10th & 16th order LP Envelopes
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Unvoiced Speech and 10th order LP Residual
-0 .1
-0 .1 1
-0 .1 2
-0 .1 3
Amplitude
-0 .1 4
-0 .1 5
-0 .1 6
-0 .1 7
-0 .1 8
-0 .1 9
0 5 10 15 20 25 30 35 40
T im e ( m s )
0 .1 5
0 .1
0 .0 5
Amplitude
-0 .0 5
-0 .1
-0 .1 5
-0 .2
0 5 10 15 20 25 30 35 40
T im e ( m s )
Voiced Speech and 10th-order LP Residual
0 . 6
0 . 4
0 . 2
Amplitude
- 0 . 2
- 0 . 4
- 0 . 6
- 0 . 8
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
T im e ( m s )
0 . 2
0 . 1 5
0 . 1
Amplitude
0 . 0 5
- 0 . 0 5
- 0 . 1
- 0 . 1 5
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
T im e ( m s )
• Short-term correlation
• Long-term correlation
Speech Coding
Coding Rates
• For telephone band (or narrowband) speech:
– Signal Bandwidth: 300-3400 Hz
– Sampling Rate: 8000 Hz
– Resolution: 16 bits / sample linear PCM
• Uncompressed bit rate:

16 bits/sample x 8000 samples/s
= 128 Kbit/s
• What is the minimum coding rate for

transmitting the message information?
Coder Classes according to Bit-Rate
B > 16 Kbps High bit rate coders
Medium bit rate

4 < B <=16 Kbps
coders
1 < B <=4 Kbps Low bit rate coders

Very low bit rate
B < 1 Kbps
coders
Standards Organizations
• ITU-T: International Telecommunications Union

(UN)
• MPEG: Motion Pictures Experts Group
(ISO/UN)
• INMARSAT: Intl. Maritime Satellite Corporation
– for geo-synchronous satellites
• US Government: DoD, NATO
• TIA: Telecom Industry Association - for North
American Telecom standards
• ETSI: European Telecom. Standards Institute
Speech Coding Standards
Bit-rate
Name Coding Type Organization Year
(kbps)
G.711/ PCM µ-law/
64 ITU-T 1972
G.712 A-law
G.721/G.723 32/24/40/ 1984/86/
ADPCM ITU-T
G.726/G.727 16 88/90
G.728 LD-CELP 16 ITU-T 1992
G.729 CS-ACELP 8.0 ITU-T 1995
G.723.1 ACELP 6.3/5.3 ITU-T 1995

(Wideband)
G.722 48/56/64 ITU-T 1985
SB-ADPCM
Bit-rate
(kbps)
(Wideband)
G.722.1 24/32 ITU-T 1999
Transform
Inmarsat IMBE 4.15 INMARSAT 1990
IS-54 (old) VSELP 7.95 TIA 1992
GSM-FR RPE-LTP 13 GSM 1991
GSM-HR CELP 5-6 GSM 1994
GSM-EFR CELP 12.2 GSM 1997

Bit-rate
(kbps)
IS-641(new) ACELP 7.4 TIA 1997
Iridium AMBE 2.4 Iridium 1996
MPEG-4 HVXC 2-4 MPEG/ISO 1999
MPEG-4 CELP 4-24 MPEG/ISO 1999
US-DoD
FS-1015 LPC-10 2.4 1984
/NATO
US-DoD
FS-1016 CELP 4.8 1989
/NATO
US-DoD
MELP MELP 2.4 1996
/NATO
Coding Methodologies
• Coding Methodologies
– Waveform coding
– Vocoding or parametric coding
– Hybrid coding
Classes according to Coding Type
Excellent
Hybrid Waveform
Good approximating
Coders
coders
Quality
Parametric Coders
Fair
Poor
1 2 4 8 16 32 64
Bit rate (Kbps)
Coding Standards
Excellent
GSM EFR G.728
G.711
G.726
G.729
Linear
GSM FR
G.723.1 Waveform approximating PCM
Good Hybrid Coders IS96 coders
GSM/2
Quality
MELP Parametric Coders
Fair
FS1015
Poor
1 2 4 8 16 32 64
Bit rate (Kbps)
PCM Coding
i[n]
Q[.]
x[n] x’[n]
• Instantaneous, non-uniform quantization

• For time-varying energy signals eg speech,
uniform quantization is inefficient.
• If signal energy is halved, SQNR falls 6 dB.
• SQNR is independent of signal level in Log
quantizer.
ADPCM Coding
Encoder
+ c[n]
Input d[n] Q[.] d’[n]
-
x[n]
x’[n] x”[n]
P +
d’[n] x”[n]
Decoder +
c[n]
x’[n] P
Prediction in the context of Coding
0 .6
0 .4
0 .2
Amplitude
-0 .2
-0 .4
-0 .6
-0 .8
0 5 1 0 1 5 2 0
T im e ( m s )
0 .4
0 .2
Amplitude
- 0 .2
- 0 .4
- 0 .6
- 0 .8
0 5 1 0 1 5 2 0
T im e ( m s )
Signal and first-difference signal

ADPCM Coding
• DPCM with fixed predictor can give 4-11 dB
improvement over PCM.
• PCM with adaptive quantization can give ~ 5

dB improvement over µ-law non-adaptive
PCM.
• DPCM with adaptive prediction can give 10-

12 dB improvement over fixed predictor.
Code Excited Linear Prediction (CELP)
Coding
• Most coders in 4.8-16 kbps are based
on Linear Prediction Analysis-by-
Synthesis (LPAS) coding.
• CELP belongs to LPAS paradigm of

speech coding.
Generic Linear Prediction Analysis
Synthesis (LPAS) Coder
Input
speech
LP Analysis
-
Excitation Synthesis
Generator Filter +
Error
Minimization
CELP Decoder
Synthesized speech
Excitation
G/A(z)
Generator
Excitation parameters
LP and Gain parameters

Speech Quality Measurement
• Speech Quality
– Objective measures
• Segmental SNR
• Itakura-Saito distance measure
• Spectral distortion (SD)
• ITU-T P.862 Recommendation
– Subjective measures
• Mean opinion score (MOS)
• Diagnostic Rhyme Test (DRT)
• Diagnostic Acceptability Measure (DAM)
Absolute Category Rating Tests (MOS)
• Listening quality scale
Excellent 5
Good 4
Fair 3
Poor 2
Bad 1
Diagnostic Rhyme Test
• Measures speech intelligibility

• Listeners are presented with one of two
words which differ only in leading
consonant
– Examples:
• Meet - Beat
• Than - Dan
• Met - Net
• Jest - Guest
Diagnostic Rhyme Test
• Total possible pairs = 96

• Intelligibility score, S, is given by:
N(correct) – N(incorrect)
S = 100 x
N(test pairs)
Coder Rate (kbps) DRT MOS
FS1016 4.8 91.7 3.3
G.728 16 93.0 3.9
Perceptual evaluation of speech quality (PESQ)
• Part of ITU-T P.862 standard

• Objective is to mimic sound perception by
persons in real life
• PESQ simulates expts. in which subjects
judge speech quality
• Physical signals are mapped to
psychophysical representations that match
internal representations in the head
Speech Coder Complexity Issues
• Complexity
– Computational complexity
• Simplex/half-duplex/full-duplex real time
performance on a single DSP
• Fixed point vs. floating point
• CELP coders are computationally complex
– Memory requirement
• Storage of look-up tables, codebooks etc.
Timing Diagram for various Coding Delays
Buffer input Buffer input Buffer input Buffer input Buffer input
speech frame speech frame 2 speech frame 3 speech frame 4 speech frame 5
Algorithmic
buffering delay Encode Encode Encode Encode
frame 1 frame 2 frame 3 frame 4
Encoder
processing
delay Transmit bits of Transmit bits of Transmit bits of
frame 1 frame 2 frame 3
Sum of the
two is the
total
Bit transmission
processing
delay decode decode decode
delay
frame 1 frame 2 frame 2
Decoder
processing Play back Play back
delay decoded speech decoded speech
Total one way coding delay
frame 1 frame 2
0 1 2 3 4 Time (frame index) 5

Thank You!

Speech Signal Analysis and Coding: Dr. Arun Kumar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Signal Analysis and Coding: Dr. Arun Kumar

Uploaded by

Copyright:

Available Formats

Speech Signal Analysis

Dr. Arun Kumar

Centre for Applied Research in Electronics

• Speech Signal Understanding

• Speaker Verification and Identification

“ It is the variation of pressure, from

Reference: 10 –16 W/cm2

100 Rock concert

• Energy of speech during 1 s

Audio Sampling Source

• Transient (stop) sounds

ɪ(i) ɪ(i) ɡ (G)

Fundamental frequency F0 / Pitch period

• Time-varying all-pole linear filter excited by a

• H(z) models the vocal tract system.

• Uncompressed bit rate:

• What is the minimum coding rate for

B > 16 Kbps High bit rate coders

Medium bit rate

1 < B <=4 Kbps Low bit rate coders

• ITU-T: International Telecommunications Union

G.729 CS-ACELP 8.0 ITU-T 1995

G.723.1 ACELP 6.3/5.3 ITU-T 1995

IS-54 (old) VSELP 7.95 TIA 1992

GSM-FR RPE-LTP 13 GSM 1991

GSM-HR CELP 5-6 GSM 1994

GSM-EFR CELP 12.2 GSM 1997

– Vocoding or parametric coding

Good Hybrid Coders IS96 coders

• Instantaneous, non-uniform quantization

Signal and first-difference signal

• PCM with adaptive quantization can give ~ 5

• DPCM with adaptive prediction can give 10-

• CELP belongs to LPAS paradigm of

LP and Gain parameters

• Listening quality scale

• Measures speech intelligibility

• Total possible pairs = 96

• Part of ITU-T P.862 standard

0 1 2 3 4 Time (frame index) 5

You might also like