You are on page 1of 4

Goldberg, R. G.

"Introduction"
A Practical Handbook of Speech Coders
Ed. Randy Goldberg
Boca Raton: CRC Press LLC, 2000

© 2000 CRC Press LLC


Chapter 1

Introduction

Research, product development, and new applications of speech coding


have all advanced dramatically in the past decade. Research into new
coding methods and enhancement of existing approaches has proceeded
at a fast pace, fueled by the market demand for improved coders. Digital
cellular and satellite telephony, video conferencing, voice messaging, and
Internet voice communications are just a few of the prominent everyday
applications that are driving the demand. The goal is higher quality
speech at a lower transmission bandwidth. The need will continue to
grow with the expansion of remote verbal communication.
In all modern speech coders, the inherently analog speech signal is
first digitized. This sampling process transforms the analog electrical
variations from the recording microphone into a sequence of numbers.
The sequence is processed by an encoder to produce the coded represen-
tation. The coded representation is either transmitted to the decoder, or
stored for future decoding. The decoder reconstructs an approximation
of the original speech signal. As such, speech coding in general is a lossy
compression.
In the most simple example, conventional Pulse Code Modulation
(PCM) (the method used for digital telephone transmission for many
years) relies upon sampling the signal and quantizing it using a suffi-
ciently large range of numbers so that the the error in the digital ap-
proximation of the signal is not objectionable. This coding method
strives for accurate representation of the time waveform.
The amount of information needed to code speech signals can be fur-
ther reduced by taking advantage of the fact that speech is generated
by the human vocal system. The process of simulating constraints of
the human vocal system to perform speech coding is called vocoding
(from voice coding). Efficient vocoders achieve high speech intelligibil-
ity at much lower bit rates than would be possible by coding the speech

© 2000 CRC Press LLC


waveform directly. In the majority of vocoders, the speech signal is seg-
mented, and each segment is considered to be the output response of
the vocal tract to an input excitation signal. The excitation is modeled
as a periodic pulse train, random noise, or an appropriate combination
of both. For every short-time segment of speech, the excitation para-
meters and the parameters of the vocal tract model are determined and
transmitted as the coded speech. The decoder relies on the implicit un-
derstanding of the vocal tract and excitation models to reconstruct the
speech.
Some vocoders perform a frequency analysis. Manipulation of the fre-
quency representation of the data enables easy implementation of many
speech processing functions, including identification and elimination of
perceptually unimportant signal components. The unimportant infor-
mation can be removed, instead of wasting precious transmission data
space by coding it. That saved transmission space can be reallocated to
improve the speech quality of more perceptually crucial regions of the
signal. Therefore, by coupling the effects of the human auditory system
with those of the human vocal system, significant gains in the quality of
reproduced speech can be realized for a given transmission bandwidth.
Beyond the bit rate/quality tradeoff, a practical speech coder must
limit the computational complexity of the algorithm to a reasonable
level for the desired application. For speech coding applications aimed
at real-time or conversational communication, the overall delay must
remain acceptably small. The delay is the time lag from when the speech
signal was spoken at the input to when it is heard at the output. The
total delay is the sum of the transmission delay of the communications
system, the computational delays of the encoder and decoder, and the
algorithmic delay associated with the coding method.
Speech coding can be summarized as the endeavor to reduce the trans-
mission bandwidth (bit rate) of the coded speech through an efficient,
minimal representation of the speech signal while maintaining an ac-
ceptable level of perceived quality of the decoded speech.
This book covers the basics of speech production, perception, and
digital signal analysis techniques. These serve as building blocks to
understand the various speech coding methods and their particular im-
plementations. The presentations assume no prior knowledge of speech
processing and are designed to be accessible to anyone with a technical
background.
Chapter 2 provides a brief overview of speech production mechanisms
and examples of speech data. This chapter introduces the concept of

© 2000 CRC Press LLC


separating the speech signal into vocal tract and excitation components.
Chapter 3 begins with sampling theory and continues with basic digital
signal processing techniques that are applied in most speech analyses.
Linear Prediction (LP) is explained in Chapter 4. LP modeling of the
vocal tract is a primary processing step of many speech coders. Chapter
5 continues with the speech-specific processing algorithms that estimate
the pitch period, or fundamental frequency, of the excitation. Accurate
pitch estimation is critical to the performance of most of the newer low
bit-rate systems because much of the subsequent processing depends on
the pitch estimate. Human auditory processing is outlined in Chapter 6
to give a better understanding of speech perception.
Chapter 7 elaborates on scalar and vector quantization, pulse code
modulation, and waveform coding. Chapter 8 discusses the evaluation of
the quality of encoded/decoded speech. Chapter 9 begins the discussion
of vocoders by describing several simple types. Chapter 10 continues the
presentation with LP-based vocoders that employ analysis-by-synthesis
to estimate the excitation signal. In Chapter 11, the current leading ap-
proaches to low bit-rate coding are outlined. These methods model the
excitation as a mixture of harmonic and noise-like components. Chapter
12 explains how the perceptual considerations of Chapter 6 can be ap-
plied to improve coder performance. Appendix A lists Internet sites
that contain documentation, encoded/decoded speech examples, and
software implementations for several speech coding standards.

© 2000 CRC Press LLC

You might also like