Institute of Space Technology Islamabad

Communication Systems

Linear Prediction Coding Vocoders

Rabia Bashir
Institute of Space Technology Islamabad rabiatulips3@ymail.com

Abstract
This paper explains the principle of the human speech production with the aid of a Linear Predictive Vocoders (LPC vocoders). The LPC vocoders are basically used for analog to digital conversion and hence compressing the voice signals. This is a technique for lossy compression but we still use it because in voice signal, some losses are tolerable. In this scheme the human voice is fed as input, then the parameters like excitation and vocal tract parameters are computed. Then these parameters are transmitted instead of original voice signal. This is done by analysis part of linear prediction coder. On the other side that is the synthesis part of a vocoder, using these parameters, generates a replica of input signal. The user can replay the signal and compare it with the reference speech signal.

Key words
PCM, LPC, ACR, ADPCM, CELP, AMDF

Most forms of speech coding are based on a lossy algorithm. Most forms of speech compression are achieved by modeling the process of speech production as a linear digital filter. Waveform-following coders produce lossless signal. 2. Lossy algorithms are acceptable for speech because the loss of quality is often undetectable to the human ear. The frequency range of human speech is around 300 Hz to 3400 Hz.Institute of Space Technology Islamabad Communication Systems 1. Uncompressed speech is usually transmitted at 64 kb/s using 8 bits/sample and a rate of 8 kHz for sampling. Speech coding algorithms use many facts of speech production for coding Not transmitting the silence saves the bandwidth and amount of information required to represent signal. Introduction: There are different types of speech compression algorithms that use a variety of different techniques. Speech compression is often termed as speech coding which is defined as reduction scheme for the amount of information needed to represent a speech signal. bit rate Delay Complexity 4. [1] Speech coding or compression is usually conducted with the use of voice coders or vocoders. Any . and sends the parameters of the process instead of sending the speech itself”. There are two types of voice coders:   waveform-following coders Model-base coders. 3. have four main attributes: 1. because they use a parametric model of speech production which involves encoding and transmitting the parameters not the signal All vocoders. To determine the degree of compression that a vocoder achieves bit rate is used. Linear Predictive Coding (LPC) is one of the methods of compression that models the speech production. High correlation between adjacent samples is another fact that can be used. Model-based coders will never exactly reproduce the original speech signal. To achieve compression from the speech signal the digital filter and its slow changing parameters are usually encoded. LPC “Models speech as an autoregressive process. including LPC vocoders. Quality Any voice coder will have to make tradeoff between these different attributes.

2. Linear predictive coders sacrifice quality in order to achieve a low bit rate and as a result often sound synthetic. and the articulation tract. LPC takes the speech signal in blocks or frames of speech and determines the input signal and the coefficients of the filter that will be capable of reproducing the current block of speech. Adaptive differential pulse code modulation (ADPCM) only reduces the bit rate by a factor of 2 to 4. This test involves subjects being given pairs of sentences and asked to rate them as excellent. Quality is a subjective attribute and it depends on how the speech sounds to a given listener. vocal cords. The model is based on good understanding of the human voice mechanism. Voice model and Model-Based Vocoders: Linear prediction coding (LPC) vocoders are model base vocoders. Figure 1 below shows a cross section of the human speech organ. consisting of the mouth and nose cavity. Human voice mechanism: Human speech is produced by joint interaction of lungs. In decoding. Linear predictive coding is very complex because of its high compression rate and involves executing millions of instructions per second. The linear predictive coder transmits speech at a bit rate of 2. The efficiency of the method is due to the low bandwidth required and due to the speed of the analysis algorithm for the encoded signals. or bad. Linear predictive coding Vocoders: Linear predictive coding (LPC) also known as analysis/synthesis method was introduced in the sixties as an efficient and effective mean to achieve synthetic speech and speech signal communication. . an excellent rate of compression. This information is quantized and transmitted. The complexity affects both the cost and the power of the vocoder. The most successful speech and video encoder is based on human physiological models involved in speech generation and video perception. LPC rebuilds the filter based on the coefficients received.Institute of Space Technology Islamabad Communication Systems bit rate below 64 kb/s is considered compression. poor. The general delay standard for transmitted speech conversations is that any delay that is greater than 300 ms is considered unacceptable. - - The general algorithm for linear predictive coding involves an analysis or encoding part and a synthesis or decoding part. good. between 16 kb/s and 32kb/s has a much higher quality of speech than LPC which is an alternate method of speech compression.4 kb/s. fair. Absolute category rating (ACR) test is one of the most common test for speech quality. Here are given the basic principles of the linear prediction voice coders (vocoders). While encoding.

Lungs expel air out of the epiglottis. Voiced sounds are usually vowels. The vibrating vocal cords interrupt the air stream and produce a quasi-periodic pressure wave consisting of impulses. The pressure wave impulses are commonly called the pitch impulses. Variation of pitch frequency is shown below in Figure 3. Figure 2: A typical impulse response [3] This part defines the speech tone. and the frequency of the pressure signal is the pitch frequency or the fundamental frequency (In figure 2 below a typical impulse sequence produced by the vocal cords for a voiced sound is shown).Institute of Space Technology Islamabad Communication Systems Figure 1: The Human Speech Organ [2] Based on this physiological speech model. causing the vocal cords to vibrate. human voices can be divided into two sound categories. 1. Normally the pitch frequency of speaker varies almost constantly. Unvoiced sound Voiced sounds: these are sounds produced when the vocal cords are vibrating. Voiced sound 2. often from syllable to syllable. And the speech uttered in constant frequency sounds same. .

LPC models: Based on human speech model. . 3. they radiate a sound wave. Each segment is than examined to find the answers to these key questions: • Is the segment voiced or unvoiced? • What is the pitch of the segment? • What parameters are needed to build a filter that models the current segment? LPC analysis is mostly done by a sender who answers these questions and usually transmits these answers onto a receiver. which is a speech signal. Figure 4 demonstrates what parts of the receiver correspond to what parts in the human anatomy. For unvoiced sound. a voice encoding approach can be established. The analysis part of LPC involves examining the speech signal and breaking it down into segments or blocks. Both cavities form characteristic resonant frequencies (formant frequencies). the pitch impulses simulate the air in the vocal tract (mouth and nose cavities). The receiver performs LPC synthesis by using the answers received to build a filter that will be able to reproduce the original speech signal if the correct input is applied.Institute of Space Technology Islamabad Communication Systems Figure 3: Variation of the pitch frequency [4] Unvoiced sounds: are produced when the vocal cords are not vibrating. Unvoiced sounds are usually nonvowel or consonants sounds and often have very random waveforms. Changing the shape of mouth cavity allows different sounds to be pronounced. When cavities in the vocal tract resonate under excitation. The excitation directly comes from the air flow. It has two key components: analysis or encoding and synthesis or decoding. LPC synthesis tries to imitate human speech production.

Voiced sounds are generally vowels and can be considered as a pulse that is similar to periodic waveforms.1. This input signal is then broken up into segments which are analyzed and transmitted to the receiver seperately.Institute of Space Technology Islamabad Communication Systems All voice coders tend to model two things: excitation and articulation. The difference in the two waveforms creates a need for the use of two different input signals for the LPC filter in the synthesis or decoding. Voiced /unvoiced determination: Determining if a segment is voiced or unvoiced is important because both sounds have a different waveform. These sounds have high average energy levels which mean that they have very large amplitudes. One input signal is for voiced sounds and the other is for unvoiced.1. . Unvoiced sounds are usually non-vowel or consonants sounds and often have very random waveforms.1. LPC Analysis/Encoding: 4. 4. Excitation is the type of sound that is passed into the filter or vocal tract and articulation is the transformation of the excitation signal into speech. The LPC encoder notifies the decoder if a signal segment is voiced or unvoiced by sending a single bit. Voiced sounds also have distinct resonant or formant frequencies. Input speech Usually the input signal is sampled at a rate of 8000 samples per second. Figure 4: Human VS Voice coder Speech production [5] 4.

Another case of misinterpretation by the LPC model is with a group of sounds known as nasal sounds. Figure 5: Voiced sound Figure 6: Unvoiced sound The classification of neighboring segments is also considered because it is undesirable to have an unvoiced frame in the middle of a group of voiced frames or vice versa. Sound is not always produced according to the LPC model. One example of this is when segments of voiced speech with a lot of background noise are sometimes interpreted as unvoiced segments.e. These sounds are a combination of the chaotic waveforms of unvoiced sounds and the periodic waveforms of voiced sounds.Institute of Space Technology Islamabad Communication Systems There are two steps in the process of determining if a speech segment is voiced or unvoiced. If the amplitude levels are large then the segment is classified as voiced and if they are small then the segment is considered unvoiced. The first step is to look at the amplitude of the signal i. To determine the category of sounds that cannot be clearly classified based on an analysis of the amplitude. This determination requires a predefined criterion about the range of amplitude values and energy levels associated with the two types of sound. The LPC Model can not accurately reproduce these sounds. the determination of voiced and unvoiced speech signals is finalized by counting the number of times a waveform crosses the x-axis and then comparing that value to the normally range of values for most unvoiced and voiced sounds. . Not only is it possible to have sounds that are not produced according to the model. Thus. These facts conclude that the unvoiced speech waveform must cross the x-axis more often than the waveform of voiced speech. This step uses the fact that voiced speech segments have large amplitudes and unvoiced speech segments have high frequencies. This problem is often ignored in LPC. it is also possible to have sounds that are produced according to the model but that cannot be accurately classified as voiced or unvoiced. energy in the segment. During the production of these sounds the nose cavity destroys the concept of a linear tube since the tube now has a branch. Another type of speech encoding called code excited linear prediction coding (CELP) handles the problem of sounds that are combinations of voiced and unvoiced by using a standard codebook which contains typical problem causing signals. a second step is used to make the final distinction between voiced and unvoiced sounds.

the pitch period is only needed for the decoding of voiced segments and is not required for unvoiced segments. One type of algorithm takes advantage of the fact that the autocorrelation of a period function. Therefore. will have a maximum when k is equivalent to the pitch period. These algorithms usually detect a maximum value by checking the autocorrelation value against a threshold value.Institute of Space Technology Islamabad Communication Systems 4. There are several different types of algorithms that can be used. Figure 7: AMDF function for voiced sound Figure 8: AMDF function for unvoiced sound . One problem with algorithms that use autocorrelation is that the validity of their results is susceptible to interference. is limited for humans. An example of the AMDF function applied to the above figures 5 and 6 is given below in figure 7 and 8 for voiced and unvoiced sound respectively. is the time required for one wave cycle to completely pass a fixed position.1. as a periodic sequence with period Po. the pitch period can be defined as the period of the vocal cord vibration that occurs during the production of voiced speech. For voiced segments we can consider the set of speech samples for the current segment. P. Rxx(k). For speech signals. Therefore there is an assumption that the pitch period is between 2. Pitch Period Estimation: The period for any wave. When interference occurs the algorithm cannot guarantee accurate results. so the AMDF is evaluated for a limited range of the possible pitch period values. {yn}. which is when P is equal to the pitch period. including speech signals.5 and 19.5 milliseconds. This means that samples that are Po apart should have similar values and that the AMDF function will have a minimum at Po. Instead of autocorrelation another algorithm called average magnitude difference function (AMDF) is also used which is defined as The pitch period.2.

Vocal tract filter parameters The filter that is used by the decoder to recreate the original input signal is based on a set of coefficients. it can be seen that this function smoothes out the waveform. {ai} is chosen to minimize the average value of en2 for all samples in the segment.Institute of Space Technology Islamabad Communication Systems If this waveform is compared with the waveform before the AMDF function is applied. The mean squared error is expressed as: where {yn} is the set of speech samples for the current segment and {ai} is the set of coefficients.1. A filter with n parameters is referred to as an nth order filter. The parameters themselves different from segment to segment. The first step in minimizing the average mean squared error is to take the derivative. In order to provide the most accurate coefficients. . Each speech segment has different filter coefficients or parameters that it uses to recreate the original sound. the difference between the minimum and the average values is very small compared to voiced signals. An advantage of the AMDF function is that it can be used to determine if a sample is voiced or unvoiced. These coefficients are extracted from the original signal during encoding and are transmitted to the receiver for use in decoding.3. 4. This difference can be used to make the voiced and unvoiced determination. In order to find the filter coefficients that best match the current segment being analyzed the encoder attempts to minimize the mean squared error. and also the number of parameters differ from voiced to unvoiced segment. Voiced segments use 10 parameters to build the filter while unvoiced sounds use only 4 parameters. in Figure 5. When the AMDF function is applied to an unvoiced signal.

This equation cannot be solved without first computing R-1. When applied to an Mth order filter. the L-D algorithm computes all filters of order less than M. This type of matrix is called a Toeplitz matrix and can be easily inverted.. and the average mean squared error of a jth order filter is denoted Ej instead of E[en2]. the filter coefficients.. in the current segment. Autocorrelation requires that several initial assumptions be made about the set or sequence of speech samples. In order to determine the contents of A. the equation A = R-1P must be solved. The estimation of an autocorrelation function Ryy(k) can be expressed as: Using Ryy(k).. the M equations that were acquired from taking the derivative of the mean squared error can be written in matrix form RA = P where A contains the filter coefficients. E[yn-iyn-j] has to be estimated in order to solve for filter coefficients. each E[yn-iyn-j] is converted into an autocorrelation function of the form Ryy(| i-j|).M-1. This is an easy computation if one notices that R is symmetric and more importantly all diagonals consist of the same element. That is..Institute of Space Technology Islamabad Communication Systems Taking the derivative produces a set of M equations. {ai (j) }for a jth order filter. {yn}. The Levinson-Durbin (L-D) Algorithm [6] is a recursive algorithm that is considered very computationally efficient since it takes advantage of the properties of R when determining the filter coefficients. In autocorrelation. it determines all order N filters where N=1. This algorithm is outlined in Figure 9 where the filter order is denoted with a superscript. First. The auto correlation approach is explained here. it requires that {yn} be stationary and second. . it requires that the {yn} sequence is zero outside of the current segment. There are two approaches that can be used for this estimation: autocorrelation and auto covariance.

the pitch period. the input signal used by the filter is determined on the receiver end. The quantization of the filter coefficients for transmission can create a major problem since errors in the filter coefficients can lead to instability in the vocal tract filter and create an inaccurate output signal. Transmitting the Parameters In an uncompressed form. This means that 11 values are needed: 10 reflection coefficients and the gain.4. By the classification of the speech segment as voiced or unvoiced and by the pitch period of the segment. So the pitch period is quantized using a log-companded quantizer to one of 60 possible values.Institute of Space Technology Islamabad Communication Systems L-D Algorithm Figure 9: Levinson-Durbin (L-D) Algorithm for solving Toeplitz Matrices While computing the filter coefficients {ai} a set of coefficients. speech is usually transmitted at 64. 6 bits are required to represent the pitch period. To try and eliminate this problem. The first few reflection coefficients. 4. kn denoted the reflection coefficient where 1 < n < 10 for voiced speech filters and 1 < n < 4 for unvoiced filters. This means that 5 values are needed: 4 reflection coefficients and the gain. are the most likely coefficients to have magnitudes around one.400 bits/second by breaking the speech into segments and then sending the voiced/unvoiced information. known as reflection coefficients are generated. The encoder sends a single bit to tell if the current segment is voiced or unvoiced. These coefficients are used to solve potential problems in transmitting the filter coefficients. 4th order filter is used for unvoiced speech. A 10th order filter is used for voiced speech segment. and the coefficients for the filter that represents the vocal tract for each segment. This potential problem is reduced by quantizing and transmitting the reflection coefficients that are generated by the Levinson-Durbin algorithm. k1 and k2. non . The only problem with transmitting the vocal tract filter parameters is that it is more sensitive to errors in reflection coefficients having magnitude close to one. LPC reduces this rate to 2. {ki}. These coefficients can be used to rebuild the set of filter coefficients {ai} and can guarantee a stable filter if their magnitude is strictly less than one.1.000 bits/second using 8 bits/sample and a rate of 8 kHz for sampling.

Institute of Space Technology Islamabad Communication Systems uniform quantization for k1 and k2 is used. For voiced segments k5 up to k8 are quantized using 4-bit uniform quantizer while k9 uses a 3-bit quantizer and k10 uses a 2-bit uniform quantizer. 1 Bit Voiced/unvoiced 6 Bits Pitch period 10 Bits K1 and k5 (5 each) 10 Bits K3 and k4 (5 each) 16 Bits K5.4 frames per second and therefore the bit rate is 2400 bits per second. each coefficient is used to generate a new coefficient. k7 and k8 (4 each) 3 Bits K9 2 Bits K10 5 Bits Gain G 1 Bit_____________ Synchronization 54 Bits Total bits per frame Table 1: Total Bits in each Speech Segment The input speech is sampled at a rate of 8000 samples per second and that the 8000 samples in each second of speech signal are broken into 180 sample segments. k3 and k4 are quantized using 5-bit uniform quantization. The total number of bits required for each segment or frame is 54 bits. Once the reflection coefficients have been quantized the gain. It means that same bits are used for voiced and unvoiced coefficients transmission. The gain is calculated using the root mean squared (rms) value of the current segment. The gain is quantized using a 5-bit log-companded quantizer. are used for error protection for unvoiced segments the bits used in voiced segments to represent the reflection coefficients. First. gi. of the form: These new coefficients g1 and g2 are quantized using a 5-bit uniform quantizer. . This means that there are approximately 44.400 bits/second if a lot of unvoiced segments are sent. This total is explained in table below. G. k6. k5 through k10. This also means that the bit rate doesn’t decrease below 2. is the only thing not yet quantized.

the decoder needs to know what type of signal the segment contains. the decoding or synthesis of a speech segment is based on this. Each segment is decoded individually and to represent the entire input speech signal the sequence of reproduced sound segments is joined together.44 segments/second) = 2400 bits/second Figure 10: Verification for Bit Rate of LPC Speech Segments 5. In order to know what type of excitement signal will be given to the LPC filter.Institute of Space Technology Islamabad Communication Systems Sample rate= 8000 samples/second Samples per segment = 180 samples/segment Segment rate = Sample Rate/ Samples per Segment = (8000 samples/second) / (180 samples/segment) = 44. The 54 bits of information that are transmitted from the encoder.. [7] For unvoiced segments white noise produced by a pseudo-random number generator is used as the input for the filter.444444. The pitch period for voiced segments is used to determine whether the sample pulse needs to be truncated or extended. A pulse is defined as “An isolated disturbance. This combination of voice/unvoiced determination and pitch period are the only things that are need to produce the excitement signal. LPC only has two possible signals. . The speech signal is declared voiced or unvoiced based on the voiced/unvoiced determination bit. segments /second Segment size = 54 bits/segment Bit rate = Segment size * Segment rate = (54 bits/segment) * (44.. If the pulse needs to be extended it is padded with zeros since the definition of a pulse said that it travels through an undisturbed medium.. For voiced segments a pulse is used as the excitement signal. LPC Synthesis/Decoding: The reverse of the encoding process is the process of decoding a sequence of speech segments. that travels through an undisturbed medium”.

45-13 kb/s Digital Cellular standards 6. it provides a perfect method for generating speech from text. These systems have a primary goal of decreasing the bit rate as much as possible while maintaining a level of speech quality that is understandable.3-64kb/s) North American Telephone Systems 64 kb/s (uncompressed) Table 1.8-16 kb/s Regional Cellular standards 3. In fact. . then encrypted and transmitted. LPC Applications Standard telephone system is the most common usage for speech compression. Figure shows a diagram of the LPC decoder. In this type of synthesis the speech has to be generated from text. Secure telephone systems require a low bit rate since speech is first digitalized.Institute of Space Technology Islamabad Communication Systems Each segment of speech has a different LPC filter that is produced using the reflection coefficients and the gain that are received from the encoder. Secure Telephony 0.2: Bit Rates for different telephone standards A second area that linear predictive coding has been used is in Text-to-Speech synthesis. a lot of the technology used in speech compression was developed by the phone companies.2 shows the bit rates used by different phone systems. 10 and 4 reflection coefficients are used for voiced segment filters and unvoiced segments respectively. Figure 11: LPC Decoder 6.7-13 kb/s International Telephone Network 32 kb/s (can range from 5. Linear predictive coding only has application in secure telephony because of its low bit rate. These reflection coefficients are used to generate parameters used to create the filter. Table 1. Since LPC synthesis involves the generation of speech based on a model of the vocal tract. The final step of decoding a segment of speech is to pass the excitement signal through the filter to produce the synthesized speech signal.

Linear predictive coding encoders work by breaking up a sound signal into different segments and then sending information of each segment to the decoder.P. 5. 2. 300-304 B.umontreal. The encoder send information on whether the segment is voiced or unvoiced and the pitch period for voiced segment which is used to create an excitement signal in the decoder. 7. Text-to-Speech Synthesis.ca/~mignotte/IFT3205/APPLETS/HumanSpeechProduction/HumanSpeechProduction. 9-11.pdf Richard Wolfson. 8000 bps) which makes it ideal for use in secure telephone systems. Vocoders and video compression. The content and meaning of speech is of more importance than the quality in secure telephone.com/technotes/A%20tutorial%20on%20linear%20prediction%20and%20Levinson-Durbin. Linear Predictive Coding provides a favorable method of speech compression for multimedia applications since it provides the smallest storage space as a result of its low bit rate. 4. Lathi . Conclusion Linear Predictive Coding is an analysis/synthesis technique to lossy speech compression that attempts to model the human production of sound instead of transmitting an estimate of the sound wave. Modern Digital and Analog Communication. and multimedia applications. Modern Digital and Analog Communication. References 1. The encoder sends information about the vocal tract parameters which are used to build a filter on the decoder side. Vocoders and video compression.html Same as above http://www. An example of a multimedia application that would involve speech is an application that allows voice annotations about a text document to be saved with the document.iro. Physics for Scientists and Engineers (1995) 376-377. involve one-way communication and involve storing the data. Jay Pasachoff. Lathi . 7. Sproat. The method of speech compression used in multimedia applications depends on the desired speech quality and the limitations of storage space for the application. Digital Signal Processing Handbook (1999). . B.emptyloop. 300-304 http://www.e. telephone answering machines. When these parameters are given a proper excitement signal. Most multimedia applications. they can produce the replica of input at output node.P. Linear predictive coding achieves a bit rate of 2400 bits/second (much less than the general bit rate i. R. and J. The trade off for LPC’s low bit rate is that it does have some difficulty with certain sounds.Institute of Space Technology Islamabad Communication Systems More applications of LPC and other speech compression schemes are voice mail systems. 3. 6. Olive.