Research, product development, and new applications of speech coding
have all advanced dramatically in the past decade. Research into new coding methods and enhancement of existing approaches has proceeded at a fast pace, fueled by the market demand for improved coders. Digital cellular and satellite telephony, video conferencing, voice messaging, and Internet voice communications are just a few of the prominent everyday applications that are driving the demand. The goal is higher quality speech at a lower transmission bandwidth. The need will continue to grow with the expansion of remote verbal communication. In all modern speech coders, the inherently analog speech signal is first digitized. This sampling process transforms the analog electrical variations from the recording microphone into a sequence of numbers. The sequence is processed by an encoder to produce the coded represen- tation. The coded representation is either transmitted to the decoder, or stored for future decoding. The decoder reconstructs an approximation of the original speech signal. As such, speech coding in general is a lossy compression. In the most simple example, conventional Pulse Code Modulation (PCM) (the method used for digital telephone transmission for many years) relies upon sampling the signal and quantizing it using a suffi- ciently large range of numbers so that the the error in the digital ap- proximation of the signal is not objectionable. This coding method strives for accurate representation of the time waveform. The amount of information needed to code speech signals can be fur- ther reduced by taking advantage of the fact that speech is generated by the human vocal system. The process of simulating constraints of the human vocal system to perform speech coding is called vocoding (from voice coding). Efficient vocoders achieve high speech intelligibil- ity at much lower bit rates than would be possible by coding the speech
waveform directly. In the majority of vocoders, the speech signal is seg- mented, and each segment is considered to be the output response of the vocal tract to an input excitation signal. The excitation is modeled as a periodic pulse train, random noise, or an appropriate combination of both. For every short-time segment of speech, the excitation para- meters and the parameters of the vocal tract model are determined and transmitted as the coded speech. The decoder relies on the implicit un- derstanding of the vocal tract and excitation models to reconstruct the speech. Some vocoders perform a frequency analysis. Manipulation of the fre- quency representation of the data enables easy implementation of many speech processing functions, including identification and elimination of perceptually unimportant signal components. The unimportant infor- mation can be removed, instead of wasting precious transmission data space by coding it. That saved transmission space can be reallocated to improve the speech quality of more perceptually crucial regions of the signal. Therefore, by coupling the effects of the human auditory system with those of the human vocal system, significant gains in the quality of reproduced speech can be realized for a given transmission bandwidth. Beyond the bit rate/quality tradeoff, a practical speech coder must limit the computational complexity of the algorithm to a reasonable level for the desired application. For speech coding applications aimed at real-time or conversational communication, the overall delay must remain acceptably small. The delay is the time lag from when the speech signal was spoken at the input to when it is heard at the output. The total delay is the sum of the transmission delay of the communications system, the computational delays of the encoder and decoder, and the algorithmic delay associated with the coding method. Speech coding can be summarized as the endeavor to reduce the trans- mission bandwidth (bit rate) of the coded speech through an efficient, minimal representation of the speech signal while maintaining an ac- ceptable level of perceived quality of the decoded speech. This book covers the basics of speech production, perception, and digital signal analysis techniques. These serve as building blocks to understand the various speech coding methods and their particular im- plementations. The presentations assume no prior knowledge of speech processing and are designed to be accessible to anyone with a technical background. Chapter 2 provides a brief overview of speech production mechanisms and examples of speech data. This chapter introduces the concept of
separating the speech signal into vocal tract and excitation components. Chapter 3 begins with sampling theory and continues with basic digital signal processing techniques that are applied in most speech analyses. Linear Prediction (LP) is explained in Chapter 4. LP modeling of the vocal tract is a primary processing step of many speech coders. Chapter 5 continues with the speech-specific processing algorithms that estimate the pitch period, or fundamental frequency, of the excitation. Accurate pitch estimation is critical to the performance of most of the newer low bit-rate systems because much of the subsequent processing depends on the pitch estimate. Human auditory processing is outlined in Chapter 6 to give a better understanding of speech perception. Chapter 7 elaborates on scalar and vector quantization, pulse code modulation, and waveform coding. Chapter 8 discusses the evaluation of the quality of encoded/decoded speech. Chapter 9 begins the discussion of vocoders by describing several simple types. Chapter 10 continues the presentation with LP-based vocoders that employ analysis-by-synthesis to estimate the excitation signal. In Chapter 11, the current leading ap- proaches to low bit-rate coding are outlined. These methods model the excitation as a mixture of harmonic and noise-like components. Chapter 12 explains how the perceptual considerations of Chapter 6 can be ap- plied to improve coder performance. Appendix A lists Internet sites that contain documentation, encoded/decoded speech examples, and software implementations for several speech coding standards.