This action might not be possible to undo. Are you sure you want to continue?
Table of Contents ………………………………………………………...………….……….. i List of Figures …………………………………………………………………..…………… iii List of Tables……………………………………………………………………..…………… iv List of Abbreviations…………………………………………………………….…………… v Acknowledgement……………………………………………………………………………..vi i Abstract……………….………………………………………………………….…………...viii 1 INTRODUCTION ……………………………………………………………..…. 1 1.1 Motivation ……………………………………………………………….. 1 1.2 Overview of Speech Coding ……………………………………………... 2 1.3 Applications of Speech Coders ………………………..………………… 2 1.4 Objective of Present Work ………………………………………………. 3 1.5 Report Organization ………………………………………………….….. 3 2 LITERATURE REVIEW ……………………………..………………………….. 4 2.1 Introduction ……………………………………..……………………….. 4 2.2 Basic Issues in Speech Coding …………………...……………………… 5 2.3 Speech Coding Techniques and Functionalities ………………..……….. 6 2.4 Speech Coding Standards ……………………………………..………… 6 3 PRESENT WORK.……………………………………………………………….. 9 3.1 Structure of Speech Coders ……………………………………………… 9 3.2 Classification of Speech Coders …………………………………...…… 14 3.2.1 Classification by Bit-Rate ……………………………….…… 14 3.2.2 Classification by Coding Techniques ………………….…….. 15 3.3 About Algorithms …………………………………….………………… 16 3.4 Pulse Code Modulation ………………………………………………… 17 3.4.1 Modulation …………………………………………………… 18 3.4.2 Demodulation ………………………………………………… 19 3.4.3 Digitization …………………………………………………… 19
3.5 Differential Pulse Code Modulation …………………………………… 20 3.6 Other Popular Algorithms ……………………………………………… 21 4 RESULTS AND DISCUSSIONS ………………………………………………. 23 4.1 Implementation Details ………………………………………………… 23 4.2 Results ………………………………………………………………….. 23 5 CONCLUSION …………………………………………………………………. 26 6 FUTURE SCOPE ……………………………………………………………….. 27 REFERENCES
List of Figures
Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 4.1 Figure 4.2 Block diagram of a speech coding system Block diagram of a speech coder. System for delay measurement. Illustration of the components of coding delay. Sampling and quantization of a signal (red) for 4-bit PCM MATLAB Simulation of PCM. MATLAB Simulation of DPCM.
List of Tables
Table 2.1 Table 3.1 Table 4.1 Table 4.2 Table 4.3 Summary of Major Speech Coding Standards Classification of speech coders according to bit-rate Results for Quantization Bits = 16 and Sampling Frequency = 8 kHz Results for Quantization Bits = 8 and Sampling Frequency = 8 kHz Results for Quantization Bits = 16 and Sampling Frequency = 16 kHz
List of Abbreviations
3G AbS ACELP ACR ADPCM CDMA CELP DMOS DoD DPCM DSVD DTAD GSM ICASSP IDFT IEC IEEE IP ITU ITU–R ITU–T MOS NCS PC PCM POTS PSTN RAM RC RCR RMS Third Generation Analysis-by-Synthesis Algebraic Code-Excited Linear Prediction Absolute Category Rating Adaptive Differential Pulse Code Modulation Code Division Multiple Access Code-Excited Linear Prediction Degradation Mean Opinion Score U.S. Department of Defense Differential Pulse Code Modulation Digital Simultaneous Voice and Data Digital Telephone Answering Device Groupe Speciale Mobile International Conference on Acoustics, Speech, and Signal Processing Inverse discrete Fourier transform International Electrotechnical Commission Institute of Electrical and Electronics Engineers Internet Protocol International Telecommunications Union ITU–Radiocommunication Sector ITU–Telecommunications Standardization Sector Mean Opinion Score National Communications System Personal Computer Pulse Code Modulation Plain Old Telephone Service Public Switched Telephone Network Random Access Memory Reflection Coefficient Research and Development Center for Radio Systems of Japan Root Mean Square
ROM SNR TDMA TI TIA TTS UMTS VBR VoIP VSELP
Read Only Memory Signal to Noise Ratio Time Division Multiple Access Texas Instruments Telecommunications Industry Association Text to Speech Universal Mobile Telecommunications System Variable Bit-rate Voice over Internet Protocol Vector Sum Excited Linear Prediction
We take the opportunity to remember and acknowledge the cooperation, good will and support, both moral and technical, extended by several noble individuals of which the report is involved. We shall always cherish out associations with them. We wish to accord our sincere thanks to HOD of Electrical Engineering Department, Prof. Ram Naresh Sharma for providing this opportunity to execute the project. We extend our kind thanks to our guide Mr. Himesh Handa for cooperating with us till the completion of this project. We would like to acknowledge the support and infrastructure provided by the college to complete this work.
With the advancement in technology, the application of low bit-rate speech coders to civilian and military communications as well as computer-related voice applications is substantially progressing. Today, speech coders have become essential components in telecommunications and in the multimedia infrastructure. Central to this progress has been the development of new speech coders capable of producing high-quality speech at low data rates. Most of these coders incorporate mechanisms to: represent the spectral properties of speech, provide for speech waveform matching, and “optimize” the coder’s performance for the human ear. A number of these coders have already been adopted in national and international cellular telephony standards. Commercial systems that rely on efficient speech coding include cellular communication, voice over internet protocol (VOIP), videoconferencing, electronic toys, archiving, and digital simultaneous voice and data (DSVD), as well as numerous PC-based games and multimedia applications. In mobile communication systems, service providers are continuously met with the challenge of accommodating more users within a limited allocated bandwidth, speech coders that provide toll quality speech at low bit rates are needed. For this reason, manufactures and service providers are continuously in search of low bit-rate speech coders that deliver toll-quality speech. The objective of this project is to study commonly used speech coding algorithms. The project report starts with the description of these speech coders. Then we present our implementation results and finally give concluding remarks followed by comments on future research in this area.
Speech coding is the process of obtaining a compact representation of voice signals for efficient transmission over band-limited wired and wireless channels and/or storage. In general, speech coding is a procedure to represent a digitized speech signal using as few bits as possible, maintaining at the same time a reasonable level of speech quality . A not so popular name having the same meaning is speech compression. Speech coding has matured to the point where it now constitutes an important application area of signal processing.
In the era of third-generation (3G) wireless personal communications standards, despite the emergence of broad-band access network standard proposals, the most important mobile radio services are still based on voice communications. Even when the predicted surge of wireless data and Internet services becomes a reality, voice will remain the most natural means of human communication, although it may be delivered via the Internet, predominantly after compression. Due to the increasing demand for speech communication, speech coding technology has received augmenting levels of interest from the research, standardization, and business communities. Advances in microelectronics and the vast availability of low-cost programmable processors and dedicated chips have enabled rapid technology transfer from research to product development; this encourages the research community to investigate alternative schemes for speech coding, with the objectives of overcoming deficiencies and limitations . The standardization community pursues the establishment of standard speech coding methods for various applications that will be widely accepted and implemented by the industry. The business communities capitalize on the ever-increasing demand and opportunities in the consumer, corporate, and network environments for speech processing products.
1.2 OVERVIEW OF SPEECH CODING
This section describes the structure, properties, and applications of speech coding technology. Speech coding is the art of creating a minimally redundant representation of the speech signal that can be efficiently transmitted or stored in digital media, and decoding the signal with the best possible perceptual quality. Like any other continuous-time signal, speech may be represented digitally through the processes of sampling and quantization; speech is typically quantized using either 16-bit uniform or 8-bit companded quantization . Like many other signals, however, a sampled speech signal contains a great deal of information that is either redundant (nonzero mutual information between successive samples in the signal) or perceptually irrelevant (information that is not perceived by human listeners). Most telecommunications coders are lossy, meaning that the synthesized speech is perceptually similar to the original but may be physically dissimilar. Speech coding is performed using numerous steps or operations specified as an algorithm. An algorithm is any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output. An algorithm is thus a sequence of computational steps that transform the input into the output. Many signal processing problems—including speech coding— can be formulated as a well-specified computational problem; hence, a particular coding scheme can be defined as an algorithm. In general, an algorithm is specified with a set of instructions, providing the computational steps needed to perform a task. With these instructions, a computer or processor can execute them so as to complete the coding task. The instructions can also be translated to the structure of a digital circuit, carrying out the computation directly at the hardware level .
1.3 APPLICATIONS OF SPEECH CODERS
Speech coding has an important role in modern voice-enabled technology, particularly for digital speech communication, where quality and complexity have a direct impact on the marketability and cost of the underlying products or services . There are many speech coding standards designed to suit the need of a given application. More recently, with the explosive growth of the internet, the potential market of voice over internet protocol (voice over IP, or VoIP) has lured many companies to
develop products and services around the concept. Speech coding will play a central role in this revolution. Another smaller-scale area of application includes voice storage or digital recording, with some outstanding representatives being the digital telephone answering device (DTAD) and solid-state recorders. For these products to be competitive in the marketplace, their costs must be driven to a minimum. By compressing the digital speech signal before storage, longer-duration voice messages can be recorded for a given amount of memory chips, leading to improved cost effectiveness . Techniques developed for speech coding have also been applied to other application areas such as speech synthesis, audio coding, speech recognition, and speaker recognition. Due to the weighty position that speech coding occupies in modern technology, it will remain in the center of attention for years to come.
1.4 OBJECTIVE OF PRESENT WORK
The main objectives of the project can be divided into three goals; • • • To study the basics of speech coding. To design PCM and DPCM speech coders using MATLAB program that are capable of coding and decoding the input speech signal. To compare the PCM and DPCM coders by the speech quality, coding delay, error etc.
1.5 REPORT ORGANIZATION
The report is divided into 6 chapters. Chapter 1 provides the introduction and an overview of the subjects covered, with references to various aspects of speech coding, overview and applications. Chapter 2 is a review of speech coding issues, coding techniques, standards etc. are discussed. Chapter 3 deals with the present work. Speech coding is explained in detail and PCM and DPCM techniques are described. Chapter 4 contains the implementation results in which PCM and DPCM coders are compared. Chapter 5 is the conclusion of our project work. Finally, Chapter 6 is concerned with the future scope of speech coding.
The history of audio and music compression was beginning in the 1930s with research into pulse-code modulation (PCM) and PCM coding. Compression of digital audio was started in the 1960s by telephone companies who were concerned with the cost of transmission bandwidth. The 1990s have seen improvements in these earlier algorithms and an increase in compression ratios at given audio quality levels. Speech compression is often referred to as speech coding which is defined as a method for reducing the amount of information needed to represent a speech signal. Most forms of speech coding are usually based on a lossy algorithm. Lossy algorithms are considered acceptable when encoding speech because the loss of quality is often undetectable to the human ear.
Speech coding is fundamental to the operation of the public switched telephone network (PSTN), videoconferencing systems, digital cellular communications, and emerging voice over Internet protocol (VoIP) applications. The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application . Interest in speech coding is motivated by the evolution to digital communications and the requirement to minimize bit rate, and hence, conserve bandwidth. There is always a tradeoff between lowering the bit rate and maintaining the delivered voice quality and intelligibility; however, depending on the application, many other constraints also must be considered, such as complexity, delay, and performance with bit errors or packet losses. Based on these developments, it is possible today, and it is likely in the near future, that our day-to-day voice communications will involve multiple hops including heterogeneous networks. This is a considerable departure from the plain old telephone service (POTS) on the PSTN, and indeed, these future voice connections
will differ greatly even from the digital cellular calls connected through the PSTN today. As the networks supporting our voice calls become less homogeneous and include more wireless links, many new challenges and opportunities emerge. There was almost an exponential growth of speech coding standards in the 1990's for a wide range of networks and applications, including the PSTN, digital cellular, and multimedia streaming. In order to compare the various speech coding methods and standards, it is necessary to have methods for establishing the quality and intelligibility produced by a speech coder. It is a difficult task to find objective measures of speech quality, and often, the only acceptable approach is to perform subjective listening tests . However, there have been some recent successes in developing objective quantities, experimental procedures, and mathematical expressions that have a good correlation with speech quality and intelligibility.
2.2 BASIC ISSUES IN SPEECH CODING
Speech and audio coding can be classified according to the bandwidth occupied by the input and the reproduced source. Narrowband or telephone bandwidth speech occupies the band from 200 to 3400 Hz, while wideband speech is contained in the range of 50 Hz to 7 kHz. High quality audio is generally taken to cover the range of 20 Hz to 20 kHz. Given a particular source, the classic tradeoff in lossy source compression is rate versus distortion--the higher the rate, the smaller the average distortion in the reproduced signal. Of course, since a higher bit rate implies a greater bandwidth requirement, the goal is always to minimize the rate required to satisfy the distortion constraint. For speech coding, we are interested in achieving a quality as close to the original speech as possible. Encompassed in the term quality are intelligibility, speaker identification, and naturalness. Absolute category rating tests are subjective tests of speech quality and involve listeners assigning a category and rating for each speech utterance according to the classifications, such as, Excellent (5), Good (4), Fair (3), Poor (2), and Bad (1). The average for each utterance over all listeners is the Mean Opinion Score (MOS) . Although important, the MOS values obtained by listening to isolated utterances do not capture the dynamics of conversational voice communications in the
various network environments. It is intuitive that speech codecs should be tested, within the environment and while executing the tasks, for which they are designed. Thus, since we are interested in conversational (two-way) voice communications, a more realistic test would be conducted in this scenario. Recently, the perceptual evaluation of speech quality (PESQ) method was developed to provide an assessment of speech codec performance in conversational voice communications. The PESQ has been standardized by the ITU-T as P.862 and can be used to generate MOS values for both narrowband and wideband speech.
2.3 SPEECH CODING TECHNIQUES AND FUNCTIONALITIES
The most common approaches to narrowband speech coding today center around two paradigms, namely, waveform-following coders and analysis-by-synthesis methods. Waveform-following coders attempt to reproduce the time domain speech waveform as accurately as possible, while analysis-by-synthesis methods utilize the linear prediction model and a perceptual distortion measure to reproduce only those characteristics of the input speech that are determined to be most important . Another approach to speech coding breaks the speech into separate frequency bands, called subbands, and then codes these subbands separately, perhaps using a waveform coder or analysis-by-synthesis coding, for reconstruction and recombination at the receiver. Extending the resolution of the frequency domain decomposition leads to transform coding, wherein a transform is performed on a frame of input speech and the resulting transform coefficients are quantized and transmitted to reconstruct the speech from the inverse transform. A discussion of the class of speech coders called vocoders or purely parametric coders is not included as they are out of the scope of the project and their more limited range of applications today.
2.4 SPEECH CODING STANDARDS
Standards exist because there are strong needs to have common means for communication: it is to everyone’s best interest to be able to develop and utilize products and services based on the same reference . The standard bodies are organizations responsible for overseeing the development of standards for a particular application. Brief descriptions of some well-known standard bodies are given here.
International Telecommunications Union (ITU): The Telecommunications Standardization Sector of the ITU (ITU-T) is responsible for creating speech coding standards for network telephony. This includes both wired and wireless networks.
Telecommunications Industry Association (TIA): The TIA is in charge of promulgating speech coding standards for specific applications. It is part of the American National Standards Institute (ANSI). The TIA has successfully developed standards for North American digital cellular telephony, including time division multiple access (TDMA) and code division multiple access (CDMA) systems.
European Telecommunications Standards Institute (ETSI): The ETSI has memberships from European countries and companies and is mainly an organization of equipment manufacturers. ETSI is organized by application; the most influential group in speech coding is the Groupe Speciale Mobile (GSM), which has several prominent standards under its belt.
United States Department of Defense (DoD): The DoD is involved with the creation of speech coding standards, known as U.S. Federal standards, mainly for military applications.
Research and Development Center for Radio Systems of Japan (RCR): Japan’s digital cellular standards are created by the RCR.
Table 2.1 Summary of Major Speech Coding Standards
The goal of all speech coding systems is to transmit speech with the highest possible quality using the least possible channel capacity. In general, there is a positive correlation between coder bit-rate efficiency and the algorithmic complexity required to achieve it. The more complex an algorithm is, the more is processing delay and cost of implementation. A speech coder converts a digitized speech signal into a coded representation, which is usually transmitted in frames. A speech decoder receives coded frames and synthesizes reconstructed speech. Standards typically dictate the input–output relationships of both coder and decoder. The input–output relationship is specified using a reference implementation, but novel implementations are allowed, provided that input–output equivalence is maintained. Speech coders differ primarily in bit rate (measured in bits per sample or bits per second), complexity (measured in operations per second), delay (measured in milliseconds between recording and playback), and perceptual quality of the synthesized speech.
3.1 STRUCTURE OF SPEECH CODERS
Figure 3.1 shows the block diagram of a speech coding system. The continuous time analog speech signal from a given source is digitized by a standard connection of
Figure 3.1 Block diagram of a speech coding system filter (eliminates aliasing), sampler (discrete-time conversion), and analog to digital converter (uniform quantization is assumed). The output is a discrete-time speech signal whose sample values are also discretized. This signal is referred to as the digital speech . Traditionally, most speech coding systems were designed to support telecommunication applications, with the frequency contents limited between 300 and 3400 Hz. According to the Nyquist theorem, the sampling frequency must be at least twice the bandwidth of the continuous-time signal in order to avoid aliasing. A value of 8 kHz is commonly selected as the standard sampling frequency for speech signals. To convert the analog samples to a digital format using uniform quantization and maintaining toll quality [Jayant and Noll, 1984]—the digital speech will be roughly indistinguishable from the bandlimited input—more than 8 bits/sample is necessary. The use of 16 bits/sample provides a quality that is considered high. Throughout this book, the following parameters are assumed for the digital speech signal: Sampling frequency = 8 kHz Number of bits per sample = 16 This gives rise to
Bit-rate = 8 kHz * 16 bits = 128 kbps The above bit-rate, also known as input bit-rate, is what the source encoder attempts to reduce (Figure 3.1). The output of the source encoder represents the encoded digital speech and in general has substantially lower bit-rate than the input. The linear prediction coding algorithm, for instance, has an output rate of 2.4 kbps, a reduction of more than 53 times with respect to the input. The encoded digital speech data is further processed by the channel encoder, providing error protection to the bitstream before transmission to the communication channel, where various noise and interference can sabotage the reliability of the transmitted data. Even though in Figure 1.1 the source encoder and channel encoder are separated, it is also possible to jointly implement them so that source and channel encoding are done in a single step. The channel decoder processes the error-protected data to recover the encoded data, which is then passed to the source decoder to generate the output digital speech signal, having the original rate. This output digital speech signal is converted to continuoustime analog form through standard procedures: digital to analog conversion followed by ant-aliasing filtering . The input speech (a discrete-time signal having a bit-rate of 128 kbps) enters the encoder to produce the encoded bit-stream, or compressed speech data. Bit-rate of the bit-stream is normally much lower than that of the input speech.
Figure 3.2 Block diagram of a speech coder. The decoder takes the encoded bit-stream as its input to produce the output speech signal, which is a discrete-time signal having the same rate as the input speech. As we will see later in this book, many diverse approaches can be used to design the encoder/decoder pair. Different methods provide differing speech quality and bit-rate, as well as implementation complexity. The encoder/decoder structure represented in Figure 3.2 is known as a speech coder, where the input speech is encoded to produce a
low-rate bit-stream. This bit-stream is input to the decoder, which constructs an approximation of the original signal.
Desirable Properties of a Speech Coder
The main goal of speech coding is either to maximize the perceived quality at a particular bit-rate, or to minimize the bit-rate for a particular perceptual quality. The appropriate bit-rate at which speech should be transmitted or stored depends on the cost of transmission or storage, the cost of coding (compressing) the digital speech signal, and the speech quality requirements . In almost all speech coders, the reconstructed signal differs from the original one. The bit-rate is reduced by representing the speech signal (or parameters of a speech production model) with reduced precision and by removing inherent redundancy from the signal, resulting therefore in a lossy coding scheme. Desirable properties of a speech coder include:
Low Bit-Rate: The lower the bit-rate of the encoded bit-stream, the less bandwidth is required for transmission, leading to a more efficient system. This requirement is in constant conflict with other good properties of the system, such as speech quality. In practice, a trade-off is found to satisfy the necessity of a given application.
High Speech Quality: The decoded speech should have a quality acceptable for the target application. There are many dimensions in quality perception, including intelligibility, naturalness, pleasantness, and speaker recognizability.
Robustness across Different Speakers / Languages: The underlying technique of the speech coder should be general enough to model different speakers (adult male, adult female, and children) and different languages adequately. Note that this is not a trivial task, since each voice signal has its unique characteristics.
Robustness in the Presence of Channel Errors: This is crucial for digital communication systems where channel errors will have a negative impact on speech quality.
Good Performance on Non-speech Signals (i.e., telephone signaling). In a typical telecommunication system, other signals might be present besides speech. Signaling tones such as dual-tone multi-frequency (DTMF) in keypad dialing and music are often encountered. Even though low bit-rate speech
coders might not be able to reproduce all signals faithfully, it should not generate annoying artifacts when facing these alternate signals.
Low Memory Size and Low Computational Complexity: In order for the speech coder to be practicable, costs associated with its implementation must be low; these include the amount of memory needed to support its operation, as well as computational demand. Speech coding researchers spend a great deal of effort to find out the most efficient realizations.
Low Coding Delay: In the process of speech encoding and decoding, delay is inevitably introduced, which is the time shift between the input speeches of the encoder with respect to the output speech of the decoder. An excessive delay creates problems with real-time two-way conversations, where the parties tend to ‘‘talk over’’ each other.
Coding Delay Consider the delay measured using the topology shown in Figure 3.3. The delay obtained in this way is known as coding delay, or one-way coding delay [Chen, 1995], which is given by the elapsed time from the instant a speech sample arrives at the encoder input to the instant when the same speech sample appears at the decoder output . The definition does not consider exterior factors, such as communication distance or equipment, which are not controllable by the algorithm designer.
Figure 3.3 System for delay measurement.
Figure 3.4 Illustration of the components of coding delay. Based on the definition, the coding delay can be decomposed into four major components (see Figure 3.4):
Encoder Buffering Delay: Many speech encoders require the collection of a certain number of samples before processing. For instance, typical linear prediction (LP)-based coders need to gather one frame of samples ranging from 160 to 240 samples, or 20 to 30 ms, before proceeding with the actual encoding process.
Encoder Processing Delay: The encoder consumes a certain amount of time to process the buffered data and construct the bit-stream. This delay can be shortened by increasing the computational power of the underlying platform and by utilizing efficient algorithms. The processing delay must be shorter than the buffering delay; otherwise the encoder will not be able to handle data from the next frame.
Transmission Delay: Once the encoder finishes processing one frame of input samples, the resultant bits representing the compressed bit-stream are transmitted to the decoder. Many transmission modes are possible and the choice depends on the particular system requirements.
Decoder Processing Delay: This is the time required to decode in order to produce one frame of synthetic speech. As for the case of the encoder processing delay, its upper limit is given by the encoder buffering delay, since a whole frame of synthetic speech data must be completed within this time frame in order to be ready for the next frame.
3.2 CLASSIFICATION OF SPEECH CODERS
There is no clear separation between various approaches of classification of speech coders. This section presents some existent classification criteria. The speech coders can be classified according to their bit-rate or by their coding technique. There are other approaches also like single mode or multimode coders. The most popular and widely used approaches of classification of speech coders are: • • Bit-rate Coding technique
3.2.1 Classification by Bit-Rate All speech coders are designed to reduce the reference bit-rate of 128 kbps toward lower values. Depending on the bit-rate of the encoded bit-stream, it is common to classify the speech coders according to Table 3.1. Table 3.1 Classification of speech coders according to bit-rate
A given method works fine at a certain bit-rate range, but the quality of the decoded speech will drop drastically if it is decreased below a certain threshold. The minimum bit-rate that speech coders will achieve is limited by the information content of the speech signal. Judging from the recoverable message rate from a linguistic perspective for typical speech signals, it is reasonable to say that the minimum lies somewhere around 100 bps. Current coders can produce good quality at 2 kbps and above, suggesting that there is plenty of room for future improvement. 3.2.2 Classification by Coding Techniques Waveform Coders: An attempt is made to preserve the original shape of the signal waveform, and hence the resultant coders can generally be applied to any signal source. These coders are better suited for high bit-rate coding, since performance drops sharply with decreasing bit-rate. In practice, these coders work best at a bit-rate of 32 kbps and higher. Signal-to-noise ratio (SNR) can be utilized to measure the
quality of waveform coders. Some examples of this class include various kinds of pulse code modulation (PCM) and adaptive differential PCM (ADPCM). Parametric Coders: Within the framework of parametric coders, the speech signal is assumed to be generated from a model, which is controlled by some parameters. During encoding, parameters of the model are estimated from the input speech signal, with the parameters transmitted as the encoded bit-stream. This type of coder makes no attempt to preserve the original shape of the waveform, and hence SNR is a useless quality measure. Perceptual quality of the decoded speech is directly related to the accuracy and sophistication of the underlying model. Due to this limitation, the coder is signal specific, having poor performance for non-speech signals. There are several proposed models in the literature. The most successful, however, is based on linear prediction. In this approach, the human speech production mechanism is summarized using a time-varying filter, with the coefficients of the filter found using the linear prediction analysis procedure. Hybrid Coders: As its name implies, a hybrid coder combines the strength of a waveform coder with that of a parametric coder. Like a parametric coder, it relies on a speech production model; during encoding, parameters of the model are located. Additional parameters of the model are optimized in such a way that the decoded speech is as close as possible to the original waveform, with the closeness often measured by a perceptually weighted error signal. As in waveform coders, an attempt is made to match the original signal with the decoded signal in the time domain. This class dominates the medium bit-rate coders, with the code-excited linear prediction algorithm and its variants the most outstanding representatives. From a technical perspective, the difference between a hybrid coder and a parametric coder is that the former attempts to quantize or represent the excitation signal to the speech production model, which is transmitted as part of the encoded bit-stream. The latter, however, achieves low bit-rate by discarding all detail information of the excitation signal; only coarse parameters are extracted. A hybrid coder tends to behave like a waveform coder for high bit-rate, and like a parametric coder at low bit-rate, with fair to good quality for medium bit-rate.
3.3 ABOUT ALGORITHMS
A speech coder is generally specified as an algorithm, which is defined as a computational procedure that takes some input values to produce some output values. An algorithm can be implemented as software (i.e., a program to command a processor) or as hardware (direct execution through digital circuitry) . With the widespread availability of low-cost high-performance digital signal processors (DSPs) and general-purpose microprocessors, many signal processing tasks—done in the old days using analog circuitry—are predominantly executed in the digital domain. Advantages of going digital are many: programmability, reliability, and the ability to handle very complex procedures, such as the operations involved in a speech coder, so complex that the analog world would have never dreamed of it. In this section the various aspects of algorithmic implementation are explained. The Reference Code It is the trend for most standard bodies to come up with a reference source code for their standards, where code refers to the algorithm or program written in text form. The source code is elaborated with some high-level programming language, with the C language being the most commonly used [Harbison and Steele, 1995]. In this reference code, the different components of the speech coding algorithm are implemented. Normally, there are two main functions: encode and decode taking care of the operations of the encoder and decoder, respectively. The reference source code is very general and might not be optimized for speed or storage; therefore, it is an engineering task to adjust the code so as to suit a given platform. Since different processors have different strengths and weaknesses, the adjustment must be custom made; in many instances, this translates into assembly language programming. The task normally consists of changing certain parts of the algorithm so as to speed up the computational process or to reduce memory requirements. Depending on the platform, the adjustment of the source code can be relatively easy or extremely hard; or it may even be unrealizable, if the available resources are not enough to cover the demand of the algorithm. A supercomputer, for instance, is a platform where there are abundant memory and computational power; minimum change is required to make an algorithm run under this environment . The personal computer (PC), on the other hand, has a moderate amount of memory and computational power; so adjustment is desirable to
speed up the algorithm, but memory might not be such a big concern. A cellular handset is an example where memory and computational power are limited; the code must be adjusted carefully so that the algorithm runs within the restricted confinements. To verify that a given implementation is accurate, standard bodies often provide a set of test vectors. That is, a given input test vector must produce a corresponding output vector. Any deviation will be considered a failure to conform to the specification.
3.4 PULSE CODE MODULATION
Pulse-code modulation (PCM) is a digital representation of an analog signal where the magnitude of the signal is sampled regularly at uniform intervals, then quantized to a series of symbols in a numeric (usually binary) code. PCM has been used in digital telephone systems and 1980s-era electronic musical keyboards. Uncompressed PCM is not typically used for video in standard definition consumer applications such as DVD or DVR because the bit rate required is far too high. However, the nextgeneration blu-ray format, which has a capacity far superior to previous medium, sometimes allows producers to include the full PCM soundtrack. The word pulse in the term Pulse-Code Modulation refers to the "pulses" to be found in the transmission line. This perhaps is a natural consequence of this technique having evolved alongside two analog methods, pulse width modulation and pulse position modulation, in which the information to be encoded is in fact represented by discrete signal pulses of varying width or position, respectively . In this respect, PCM bears little resemblance to these other forms of signal encoding, except that all can be used in time division multiplexing, and the binary numbers of the PCM codes are represented as electrical pulses. The device that performs the coding and decoding function in a telephone circuit is called a codec. 3.4.1 Modulation A sine wave (red curve) is sampled and quantized for PCM. The sine wave is sampled at regular intervals, shown as ticks on the x-axis. For each sample, one of the available values is chosen by some algorithm (in this case, the floor function is used). This produces a fully discrete representation of the input signal (shaded area) that can be easily encoded as digital data for storage or manipulation . For the sine wave
example at right, we can verify that the quantized values at the sampling moments are 7, 9, 11, 12, 13, 14, 14, 15, 15, 15, 14, etc. Encoding these values as binary numbers would result in the following set of nibbles: 0111, 1001, 1011, 1100, 1101, 1110, 1110, 1111, 1111, 1111, 1110, etc. These digital values could then be further processed or analyzed by a purpose-specific digital signal processor or general purpose CPU. Several Pulse Code Modulation streams could also be multiplexed into a larger aggregate data stream, generally for transmission of multiple streams over a single physical link. This technique is called time-division multiplexing, or TDM, and is widely used, notably in the modern public telephone system. The diagram of sampling and quantization of sine wave is:
Fig. 3.5 Sampling and quantization of a signal (red) for 4-bit PCM There are many ways to implement a real device that performs this task. In real systems, such a device is commonly implemented on a single integrated circuit that lacks only the clock necessary for sampling, and is generally referred to as an ADC (Analog-to-Digital converter). These devices will produce on their output a binary representation of the input whenever they are triggered by a clock signal, which would then be read by a processor of some sort. 3.4.2 Demodulation To produce output from the sampled data, the procedure of modulation is applied in reverse . After each sampling period has passed, the next value is read and the output of the system is shifted instantaneously (in an idealized system) to the new value. As a result of these instantaneous transitions, the discrete signal will have a
significant amount of inherent high frequency energy, mostly harmonics of the sampling frequency. To smooth out the signal and remove these undesirable harmonics, the signal would be passed through analog filters that suppress artifacts outside the expected frequency range (i.e. greater than ½ fs, the maximum resolvable frequency). Some systems use digital filtering to remove the lowest and largest harmonics. In some systems, no explicit filtering is done at all; as it's impossible for any system to reproduce a signal with infinite bandwidth, inherent losses in the system compensate for the artifacts — or the system simply does not require much precision. The sampling theorem suggests that practical PCM devices, provided a sampling frequency that is sufficiently greater than that of the input signal, can operate without introducing significant distortions within their designed frequency bands. The electronics involved in producing an accurate analog signal from the discrete data are similar to those used for generating the digital signal. These devices are DACs (digital-to-analog converters), and operate similarly to ADCs. They produce on their output a voltage or current (depending on type) that represents the value presented on their inputs. This output would then generally be filtered and amplified for use. 3.4.3 Digitization In conventional PCM, the analog signal may be processed (e.g. by amplitude compression) before being digitized. Once the signal is digitized, the PCM signal is usually subjected to further processing (e.g. digital data compression) .Some forms of PCM combine signal processing with coding. Older versions of these systems applied the processing in the analog domain as part of the A/D process; newer implementations do so in the digital domain. These simple techniques have been largely rendered obsolete by modern transform-based audio compression techniques.
DPCM encodes the PCM values as differences between the current and the predicted value. An algorithm predicts the next sample based on the previous samples, and the encoder stores only the difference between this prediction and the actual value. If the prediction is reasonable, fewer bits can be used to represent the same information. For audio, this type of encoding reduces the number of bits required per sample by about 25% compared to PCM.
Adaptive DPCM (ADPCM) is a variant of DPCM that varies the size of the quantization step, to allow further reduction of the required bandwidth for a given signal-to-noise ratio.
Delta modulation, another variant, uses one bit per sample.
3.5 DIFFERENTIAL PULSE CODE MODULATION
Differential pulse code modulation (DPCM) is a procedure of converting an analog into a digital signal in which an analog signal is sampled and then the difference between the actual sample value and its predicted value (predicted value is based on previous sample or samples) is quantized and then encoded forming a digital value . DPCM code words represent differences between samples unlike PCM where code words represented a sample value. Basic concept of DPCM - coding a difference, is based on the fact that most source signals show significant correlation between successive samples so encoding uses redundancy in sample values which implies lower bit rate. Realization of basic concept is based on a technique in which we have to predict current sample value based upon previous samples (or sample) and we have to encode the difference between actual value of sample and predicted value (the difference between samples can be interpreted as prediction error). Because it's necessary to predict sample value DPCM is form of predictive coding. DPCM compression depends on the prediction technique, well-conducted prediction techniques lead to good compression rates, in other cases DPCM could mean expansion comparing to regular PCM encoding. The various steps like sampling, quantization, digitization etc. are similar to those in case of PCM technique. For DPCM following options are available for selection criteria:
Quality - This is related to the pitch setting in the DPCM instrument editor, use the same to play the sample at original pitch. Quality of 15 will give best result, but samples can only be a little less than one second.
Volume - Sets the conversion volume level, higher levels helps removing noise. Click-elimination - The volume of triangle and noise is decreased when a DPCM sample is playing and can be restored with a note-off in the DPCM channel, but this will normally result in an audible click-sound. These two
options will help restoring the volume after the sample is finished without causing a click:
Restore delta counter - Restores the channels delta counter by adding zeroes after the sample. Could cause a small echo-like sound. Clip sample - Cuts the available volume-range for the sample and leaves the delta counter near zero after end. Decreases volume and heavily distorts the sample. What's best depends on the sample, most likely is that you won't need any option. Max size of DPCM samples are 3.9 kb, at quality 15 (33 kHz) it's a little less than one second and lowest quality (4 kHz) about eight seconds.
3.6 OTHER POPULAR ALGORITHMS
There are other types of waveform algorithms which are used in speech coding like A-law PCM, µ-law PCM, ADPCM etc. An ADPCM algorithm is used to map a series of 8-bit µ-law or A-law PCM samples into a series of 4-bit ADPCM samples. In this way, the capacity of the line is doubled. The technique is detailed in the G.726 standard. Some ADPCM techniques are used in Voice over IP communications. Similarly other types of coders are also available like CELP, VCELP etc. Code excited linear prediction (CELP) is a speech coding algorithm originally proposed by M.R. Schroeder and B.S. Atal in 1985. At the time, it provided significantly better quality than existing low bit-rate algorithms, such as RELP and LPC vocoders (e.g. FS-1015). Along with its variants, such as ACELP, RCELP, LD-CELP and VSELP, it is currently the most widely used speech coding algorithm. CELP is now used as a generic term for a class of algorithms and not for a particular codec. Vector sum excited linear prediction (VSELP) is a speech coding method used in several cellular standards. Variations of this codec have been used in several 2G cellular telephony standards, including IS-54, IS-136 (D-AMPS) and GSM (Half Rate speech). It was also used in the first version of RealAudio for audio over the Internet. The IS-54 VSELP standard was published by the Telecommunications Industry Association in 1989.
RESULTS AND DISCUSSIONS
We studied the basics of speech coding system and various coding techniques with their design procedures and application scopes. Then we implemented PCM and DPCM coders in MATLAB 7. Further we compared the PCM and DPCM coders on criteria like speech quality, error, execution time etc. by varying bit-rate and sampling frequency.
4.1 IMPLEMENTATION DETAILS
For MATLAB implementations: • • • Input speech has been sampled at 8 kHz (for comparison with standard coders). For PCM, we have used a uniform quantizer with 2^16 = 65536 levels. The bit-rate is 8k*16 = 128 kbps. For DPCM, we have used an adaptive first-order linear predictor with coefficient α = 0.45. Again, we have a uniform quantizer with 2^16 = 65536 levels. Here also, the bit-rate is 128 kbps.
Table 4.1 Results for Quantization Bits = 16 and Sampling Frequency = 8 kHz Criteria Quantization Noise Max. Value of Error SNR (decibels) Execution Time (sec.) PCM 0.0833 6.2498e-5 98.0905 1.288820 DPCM 0.0978 6.2498e-5 98.0905 1.322032
Fig. 4.1 MATLAB Simulation of PCM.
Fig. 4.2 MATLAB Simulation of DPCM.
Table 4.2 Results for Quantization Bits = 8 and Sampling Frequency = 8 kHz Criteria Quantization Noise Max. Value of Error SNR (decibels) Execution Time (sec.) PCM 5.4610e+3 2.0398 49.9257 1.285243 DPCM 6.4123e+3 2.0398 49.9257 1.341308
Table 4.3 Results for Quantization Bits = 16 and Sampling Frequency = 16 kHz Criteria Quantization Noise Max. Value of Error SNR (decibels) Execution Time (sec.) PCM 0.0833 3.1250e-5 98.0905 3.786807 DPCM 0.0978 3.1250e-5 98.0905 4.073008
Note: The execution time also depends upon the computer system in which the codes are tested. All these simulations are done in Windows Vista based system with processor speed 2.8 GHz and 2GB RAM.
Speech coding or speech compression is the application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bit stream. The objective of the speech coding is to represent speech signal with minimum number of bits yet maintain the perceptual quality. The advantages with coded speech signals are lower sensitivity to channel noise, easier to error-protect, encrypt, multiplex and packetize and lastly it is efficient transmission over bandwidth constrained channels due to lower bit rate. PCM coders produce better quality speech than most of the parametric coders but relatively higher compression ratio is achieved in parametric coders. Further error introduced in DPCM algorithm is greater than that introduced in PCM algorithm but simultaneously compression ratio reduces in latter. Thus the choice of the coder is based upon the application requirements. Further hybrid coders having the characteristics of both waveform coders and parametric coders are being designed to provide better compression ratio maintaining reasonable speech quality. For this project, the implementation of the speech coding based on PCM and DPCM is successfully done in MATLAB. The output results from the input speech are very good and acceptable. The targeted objective has been achieved.
In recent years, there has been significant progress in the fundamental building blocks of source coding: flexible methods of time-frequency analysis, adaptive vector quantization, and noiseless coding. Compelling applications of these techniques to speech coding are relatively less mature. The present research is focused on meeting the critical need for high quality speech transmission over digital cellular channels at 4 kbps. Research on properly coordinated source and channel coding is needed to realize a good solution to this problem. Although, high-quality low-delay coding at 16 kbps has been achieved, low-delay coding at lower rates is still a challenging problem. Improving the performance of low-rate coders operating in noisy channels is also an open problem. Additionally there is a demand for robust low-rate coders that will accommodate signals other than speech such as music. Further, current research is also focused in the area of VoIP.
 T. P. Barnwell III, K. Nayebi and C. H. Richardson, “SPEECH CODING, A computer Laboratory Textbook”, John Wiley & Sons, Inc. 1996.  Wai C. Chu, “Speech Coding Algorithms: Foundation and Evolution of Standardized Coders”, Wiley Inter-science.  P. C. Loizou, Speech Enhancement, Theory and Practice. CRC Press, 2007.  Mark Hasegawa Johnson and Abeer Alwan, “Speech Coding: Fundamentals and Applications”.  Lawrence R. Rabiner and Ronald W. Schafer, “Introduction to Digital Speech Processing”, Vol. 1, Nos. 1–2 (2007) 1–194.  DDVPC, CELP Speech Coding Standard, Technical Report FS-1016, U.S. Dept. of Defense Voice Processing Consortium, 1989.  A. Das and A. Gersho, Low-rate multimode multiband spectral coding of speech, Int. J. Speech Tech. 2(4): 317–327 (1999).  J. H. Chung and R. W. Schafer, “Performance evaluation of analysis-bysynthesis homomorphic vocoders,” Proceedings of IEEE ICASSP, vol. 2, pp. 117–120, March 1992.
R. Goldberg and L. Riek, A Practical Handbook of Speech Coders, CRC Press, Boca Raton, FL, 2000.