Professional Documents
Culture Documents
and Coding
arunkm@care.iitd.ernet.in
Contents
• Speech Processing Applications
• Speech Coding
– Coding Standards
– Coder Attributes including Quality Evaluation
– Coding Methodologies
Speech Processing Applications
• Speech Transmission
– Trunk-line telephony
– Wireless telephony
• Speech Storage
– Voice Mail, Voice Memo, Answering
machines
• Speech Synthesis
– Text-to-speech-synthesis
– Automatic information services
Speech Processing Applications
• Speech Enhancement
– Echo and noise cancellation
• Speech Recognition
– Automatic language translation
• Voice Personality Transformation
– Voice conversion from “source” to “target”
The Speech Signal
Units:
SPL (Sound Pressure Level) in dB
relative to a reference level.
20 Whisper
0 Just barely
audible
The Intensity Level of Speech
Reproduced from: D. O’Shaughnessy, Human and machine speech communication, IEEE Press, 2000
Speech Classes by Articulation
• Voiced speech
• Unvoiced speech
– Waveform
– Spectrum
– Spectrogram
Time Waveform of a Speech Sentence
0 . 8
THIS IS GOOD
0 . 6
0 . 4
0 . 2
0
p til u d e
A m
- 0 . 2
- 0 . 4
- 0 . 6
- 0 . 8
- 1
ʓ(TH)
T m
i e ( s )
s s U (O) d
(s) (s) (D)
Waveform Analysis of a Speech
• Vowels
– High energy, periodic, steady state utterance
• Unvoiced fricatives
– Low energy, noise-like, steady-state utterance
• Voiced fricatives
– Low energy, element of periodicity, steady-state
utterance
• Stops
– Transient release, medium to low energy
• Nasals
– Low-to-medium energy, periodic, steady-state
utterance
Acoustic Analysis of Vowels
F0 Male Female
Average (Hz) 132 223
Range (Hz) 50-250 120-500
Acoustic Analysis of Consonants
• Stop Consonants
– Momentary blockage of the vocal tract (50-
100ms): Closure phase
– Release burst (shortest acoustic event)
– Voice – onset time (VOT)
• Fricatives
– Narrow constriction somewhere in vocal
tract
– Turbulent airflow through the constriction
The
International
Phonetic
Alphabet
(IPA)
Universal Speech Production Model
Voiced
Gain
Impulse Glottal
Train Pulse
Generator Model
Voiced or
Vocal
Unvoiced Radiation Output
Tract
switch Model speech
Filter
White
Noise
Generator
Unvoiced
Gain
Vocal Tract Model
H(z)=1/A(z)
e[n] s[n]
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd-order LP Envelope
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th order LP Envelopes
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th, &10th order LP Envelopes
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Superimposed 2nd, 6th, 10th & 16th order LP Envelopes
80
60
40
20
Mag (dB)
-20
-40
-60
-80
-100
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Unvoiced Speech and 10th order LP Residual
-0 .1
-0 .1 1
-0 .1 2
-0 .1 3
Amplitude
-0 .1 4
-0 .1 5
-0 .1 6
-0 .1 7
-0 .1 8
-0 .1 9
0 5 10 15 20 25 30 35 40
T im e ( m s )
0 .1 5
0 .1
0 .0 5
Amplitude
-0 .0 5
-0 .1
-0 .1 5
-0 .2
0 5 10 15 20 25 30 35 40
T im e ( m s )
Voiced Speech and 10th-order LP Residual
0 . 6
0 . 4
0 . 2
Amplitude
- 0 . 2
- 0 . 4
- 0 . 6
- 0 . 8
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
T im e ( m s )
0 . 2
0 . 1 5
0 . 1
Amplitude
0 . 0 5
- 0 . 0 5
- 0 . 1
- 0 . 1 5
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0
T im e ( m s )
• Short-term correlation
• Long-term correlation
Speech Coding
Coding Rates
• For telephone band (or narrowband) speech:
– Signal Bandwidth: 300-3400 Hz
– Sampling Rate: 8000 Hz
– Resolution: 16 bits / sample linear PCM
Bit-rate
Name Coding Type Organization Year
(kbps)
IS-641(new) ACELP 7.4 TIA 1997
Iridium AMBE 2.4 Iridium 1996
MPEG-4 HVXC 2-4 MPEG/ISO 1999
MPEG-4 CELP 4-24 MPEG/ISO 1999
US-DoD
FS-1015 LPC-10 2.4 1984
/NATO
US-DoD
FS-1016 CELP 4.8 1989
/NATO
US-DoD
MELP MELP 2.4 1996
/NATO
Coding Methodologies
• Coding Methodologies
– Waveform coding
– Hybrid coding
Classes according to Coding Type
Excellent
Hybrid Waveform
Good approximating
Coders
coders
Quality
Parametric Coders
Fair
Poor
1 2 4 8 16 32 64
Bit rate (Kbps)
Coding Standards
Excellent
GSM EFR G.728
G.711
G.726
G.729
Linear
GSM FR
G.723.1 Waveform approximating PCM
GSM/2
Quality
MELP Parametric Coders
Fair
FS1015
Poor
1 2 4 8 16 32 64
Bit rate (Kbps)
PCM Coding
i[n]
Q[.]
x[n] x’[n]
x’[n] x”[n]
P +
d’[n] x”[n]
Decoder +
c[n]
x’[n] P
Prediction in the context of Coding
0 .6
0 .4
0 .2
Amplitude
-0 .2
-0 .4
-0 .6
-0 .8
0 5 1 0 1 5 2 0
T im e ( m s )
0 .4
0 .2
Amplitude
- 0 .2
- 0 .4
- 0 .6
- 0 .8
0 5 1 0 1 5 2 0
T im e ( m s )
-
Excitation Synthesis
Generator Filter +
Error
Minimization
CELP Decoder
Synthesized speech
Excitation
G/A(z)
Generator
Excitation parameters
• Speech Quality
– Objective measures
• Segmental SNR
• Itakura-Saito distance measure
• Spectral distortion (SD)
• ITU-T P.862 Recommendation
– Subjective measures
• Mean opinion score (MOS)
• Diagnostic Rhyme Test (DRT)
• Diagnostic Acceptability Measure (DAM)
Absolute Category Rating Tests (MOS)
Excellent 5
Good 4
Fair 3
Poor 2
Bad 1
Diagnostic Rhyme Test
• Complexity
– Computational complexity
• Simplex/half-duplex/full-duplex real time
performance on a single DSP
• Fixed point vs. floating point
• CELP coders are computationally complex
– Memory requirement
• Storage of look-up tables, codebooks etc.
Timing Diagram for various Coding Delays
Buffer input Buffer input Buffer input Buffer input Buffer input
speech frame speech frame 2 speech frame 3 speech frame 4 speech frame 5
Algorithmic
buffering delay Encode Encode Encode Encode
frame 1 frame 2 frame 3 frame 4
Encoder
processing
delay Transmit bits of Transmit bits of Transmit bits of
frame 1 frame 2 frame 3
Sum of the
two is the
total
Bit transmission
processing
delay decode decode decode
delay
frame 1 frame 2 frame 2
Decoder
processing Play back Play back
delay decoded speech decoded speech
Total one way coding delay
frame 1 frame 2