UNIT - 4 - Speech Coding in GSM

GSM – 10EC843 Unit - 4
Unit - 4
Speech Coding in GSM
Speech Coding Methods
Speech coding is the process for reducing the bit rate of digital speech for transmission or
storage, while maintaining a speech quality that is acceptable for the application.
Speech coding is an application of data compression of digital audio signals containing speech.
Speech coding uses speech-specific parameter estimation using audio signal processing
techniques to model the speech signal, combined with generic data compression algorithms to
represent the resulting modeled parameters in a compact bitstream.
Speech coding methods can be classified as:

 Waveform coding
 Source coding
 Hybrid coding
In Figure 1, the bit rate is plotted on a logarithmic axis vs. speech quality classes of poor to
excellent, corresponding to the five-point mean-opinion score (MOS) scale values of 1 to 5, as
defined by ITU. It may be noted that for low complexity and low delay, a bit rate of 32 to 64
kbps is required. This suggests the use of waveform codecs. However, for a low bit rate between
4 and 8 kbps, hybrid codecs should be used. These types of codecs tend to be complex with high
delay.
Fig 1: Bit rate vs QoS

-1 -
Shwetha M, Assistant Professor, Department of ECE, Sai Vidya Institute of Technology, Bengaluru – 560 064
m.shwetha@saividya.ac.in
Speech Codec Attributes
Speech quality as produced by a speech codec is a function of transmission bit rate, complexity,
delay, and bandwidth. Therefore, when considering speech codecs it is essential to consider all
these attributes. It should be noted that there is a strong interaction between all these attributes
and that they can be traded off against each other.
1. Transmission Bit Rate

 Since the speech codec shares the communications channel with other data, the peak bit
rate should be as low as possible so as not to use a disproportionate share of the channel.
 Codecs below 64 kbps were developed to increase the capacity of equipment used for
narrow bandwidth links. For the most part, codecs are fixed-bit-rate codec, meaning they
operate at the same rate regardless of the input.
 In the variable-bit-rate codecs, network loading and voice activity determine the
instantaneous rate assigned to a particular voice channel.
 Any of the fixed-rate speech codec can be combined with a VAD and made into a
simple two-state variable-bit-rate system.
 The lower rate could be either zero or some low rate needed to characterize slowly
changing background noise characteristics.
 Either way, the bandwidth of the communications channels is used only for active
speech.
2. Delay
 The delay of a speech codec can have a great impact on its suitability for a particular
application. For one-way delay of conversation greater than 300 ms, the conversation
becomes more like a half-duplex or push-to-talk experience rather than an ordinary
conversation.
 The components of total system delay include frame size, look ahead, multiplexing
delay, processing delay for computations, and transmission delay.
 Most low-bit-rate speech codecs process one frame of speech data at a time. The speech
parameters are updated and transmitted for every frame.
-2 -
 In addition, to analyze the data properly it is sometimes necessary to analyze data

beyond the frame boundary. Hence, before the speech can be analyzed it is necessary to
buffer a frame of data.
 The resulting delay is referred to as algorithmic delay.
 This delay component cannot be reduced by changing the implementation: All other
delay components can be reduced by changing implementation.
 The second major contribution for delay comes from the time taken by the encoder to
analyze the speech and the decoder to reconstruct the speech. This part of the delay is
referred to as processing delay.
 It depends on the speed of the hardware used to implement the codec.
 The sum of the algorithmic and processing delays is called the one-way codec delay.
 The third component of delay is due to transmission. It is the time taken for an entire
frame of data to be transmitted from the encoder to the decoder.
 The total of the three delays is the one-way system delay.
 In addition, there is a frame interleaving delay which adds approximately an additional
20-ms delay to the transmission delay. Frame interleaving is necessary to combat
channel fading.
3. Complexity
 Speech codecs are implemented in special-purpose hardware, such as digital signal

processors (DSPs). Their attributes are described by computing speed in millions of
instructions per second (MIPS), random-access memory (RAM) size, and read-only
memory (ROM) size.
 For a speech codec, the system designer makes a choice about how much of these
resources are to be allocated to the speech codec.
 Speech codecs with less than 15 MIPS are considered to be of low complexity; those
requiring 30 MIPS or more are thought to be of high complexity. More complexity
results in higher costs and greater power consumption; for portable applications, greater,
power consumption means reduced time between battery recharges or using larger
batteries, which means more expense and weight.
-3 -
4. Quality
 Of all the attributes, quality has the most dimensions. Figure 1 is a typical plot that
relates the performance of various coding schemes.
 In many applications there are large amounts of background noise such as car noise,
street noise, office noise, and so on).
 These all account to the quality of the codec.
LPAS (Linear prediction analysis-by-synthesis)
 LPAS methods provide efficient speech coding at rates between 4 and 16 kbps.
 In LPAS speech codecs, the speech is divided into frames of about 20 ms in length for
which the coefficients of linear predictor (LP) are computed. The resulting LP filter
predicts each sample from a set of previous samples.
 In analysis-by-synthesis speech codecs, the residual signal is quantized on a subframe-
by-subframe basis (there are commonly 2 to 8 subframes per frame).
 The resulting quantized signal forms the excitation signal for the LP synthesis filter. For
each subframe, a criterion is used to select the best excitation signal from a set of trial
excitation signals.
 The criterion compares the original speech signal with trial reconstructed speech signals.
Because of the synthesis implicit in the evaluation criterion, the method is called
analysis-by-synthesis (ABS) coding.
 For lower bit rates, the most efficient representation is achieved by using vector
quantization. For each subframe, the excitation signal is selected from a multitude of
vectors that are stored in a codebook. The index of the best-matching vector is
transmitted. At the receiver this vector is retrieved from the same codebook. The
resulting excitation signal is filtered through the synthesis filter to produce the
reconstructed speech. Linear prediction ABS codecs using a codebook approach are
commonly known as CELP codecs.
 Parametric codecs are traditionally used at low bit rates. A proper understanding of the
speech signal and its perception is essential to obtain good speech quality with a
parametric codec.
-4 -
 Parametric coding is used for those aspects of the speech signal that are well understood,
while the waveform-matching procedure is employed for those aspects that are not well
understood. Waveform-matching constraints are relaxed for those aspects that can be
replaced by parametric models without degrading the quality of the reconstructed
speech.
 Piecewise linear interpolation of the pitch does not degrade speech quality. Pitch period
is determined once every 20 ms and linearly interpolated between the updates.
 The challenge is to generalize the LPAS method such that its matching accuracy
becomes independent of the synthetic pitch-period contour used. This is done by
determining a time wrap of speech signal such that its pitch-period contour matches the
synthetic pitch-period contour.
 The time wraps are determined by comparing a multitude of time-wrapped original
signals with a synthesized signal. This coding scheme is called generalized ABS coding
and is referred to as Relaxed CELP (RCELP).
 The generalization relaxes the waveform-matching constraints without affecting speech
quality.
ITU-T STANDARDS
Table 1 shows the different types of codecs that have been standardized by the ITU.
Table 1: Comparison of ITU speech codecs
1. Bit rate
For G.729, the nominal bit rate of the codec is 8 kbps including the overhead for the codec-
inherent error correction. The adaptive variable-bit-rate capability allows the codec to operate in
-5 -
circuit multiplication equipment. Thus, during periods of congestion, operation continues at 6.4
kbps with a graceful degradation of speech quality. In contrast, when bandwidth is available, the
bit rate increases to 9.6 kbps to improve the performance.
2. Delay
The requirement for the codec/decoder delay is important for two reasons. First, an excessive
delay affects the quality of a conversation. Second, echo control devices are required if total one-
way delay is above 25 ms. Algorithmic delay was set to be 16 ms as a requirement for G.729
and 5 ms as an objective. The corresponding limits for the total codec/decoder delay were set
32 ms and 10 ms, respectively. The algorithmic delay includes the frame size delay plus any
other delay inherent in the algorithm such as a possible look-ahead time. Frame size was
specified to be 10 ms. G.729 has a 5-ms look-ahead time. Assuming an 8-ms processing delay
and an 8-ms transmission delay, the one-way system delay of G.729 is about 32 ms.
3. Complexity
The codec's complexity is measured in terms of MIPS and million operations per second
(MOPS). From a service viewpoint, power consumption is used as the complexity metric to
reflect what is important to the end user of a portable handset. G. 723.1 is of lower complexity as
compared to G.729. ITU Study Group (SG) 14 requirements for G.723.1 codec are 10 MIPS,
2000 words of RAM, and 10,000 words of ROM.
WAVEFORM CODING
In general, waveform codecs are designed to be independent of signal. They map the input
waveform of the encoder into a facsimile-like replica of it at the output of the decoder. Coding
efficiency is quite modest. However, the coding efficiency can be improved by exploiting some
statistical signal properties, if the codec parameters are optimized for the most likely categories
of input signals, while still maintaining good quality for other types of signals as well. The
waveform codecs can be further subdivided into time domain waveforms codec and frequency
domain waveform codec.
-6 -
1. Time Domain Waveform Coding The well-known representation of speech signal using
time domain waveform coding is the A-law (in Europe) or µ-law (in North America)
companded PCM at 64 kbps (Figure 2). Both use nonlinear companding characteristics to
give a near constant SNR over the total input dynamic range.
Fig 2: PCM
The ITU G.721, 32-kbps ADPCM codec is an example of a time domain waveform codec. More
flexible counterparts of the G.721 are the G.726 and G.727 codecs. The G.726 codec is a
variable-rate arrangement for bit rates between 16 and 40 kbps. This may be advantageous in
various networking applications to allow speech quality and bit rate to be adjusted on the basis of
the instantaneous requirement. The G.727 codec uses core bits and enhancement bits in its bit
-7 -
stream to allow the network to drop the enhancement bits under restricted channel capacity
conditions, while benefitting from them when the network is lightly loaded.
In differential codecs a linear combination of the last few samples is used to generate an estimate
of the current one, which occurs in the adaptive predictor. The resultant difference signal (i.e.,
the prediction residual) is computed and encoded by the adaptive quantizing with a lower
number of bits than the original signal, since it has a lower variance than the incoming signal.
For a sampling rate of 8000 samples per second, an 8-bit PCM sample is represented by a 4-bit
ADPCM sample to give a transmission rate of 32 kbps.
2. Frequency domain waveform coding
In frequency domain waveform codecs, the input signal undergoes a more or less accurate short-
time spectral analysis. The signal is split into a number of frequency domain sub-bands. The
individual sub-band signals are then encoded by using different numbers of bits to fulfill the
quality requirements of that band based on its prominence. The various schemes differ in their
accuracies of spectral analysis and in the bit allocation principle (fixed, adaptive, semiadaptive).
Two well-known representatives of this class are sub-band coding (SBC) and adaptive transform
coding (ATC).
3. Vocoders
Vocoders are parametric digitizing that use certain properties of the human speech production
mechanism. Human speech is produced by emitting sound pressure waves that are radiated
primarily from lips, although significant energy also emanates, in case of some sounds, from the
nostrils, throat, and other areas. In human speech, the air compressed by the lungs excites the
vocal cords in two typical modes. When generating voice sounds, the vocal cords vibrate and
generate quasi periodic voice sounds. In case of lower energy unvoiced sounds, the vocal cords
do not participate in the voice production, and the source acts like a noise generator. The
excitation signal is then filtered through the vocal apparatus, which behaves like a spectral
shaping filter. This can be described adequately by an all-pole transfer function that is
constituted by the spectral shaping action of gloat, vocal tract, lip radian characteristics, and so
on.
-8 -
In the case of vocoders, instead of producing a close replica of an input signal at the output, an
appropriate set of source parameters is generated to characterize the input signal sufficiently
close for a given period of time. The following steps are used in this process.
i. Speech signal is segmented in quasi stationary segments of 5-20 ms.

ii. Speech segments are subjected to spectral analysis to produce the coefficients of the all-zero
analysis filter to minimize the prediction residual energy. This process is based on computing
the speech autocorrelation coefficients and then using either the matrix inversion or iterative
scheme.
iii. The corresponding source parameters are specified. The excitation parameters as well as
filter coefficients are quantized and transmitted to the decoder to synthesize a replica of the
original signal by exciting the all-pole synthesis filter.
The quality of this type of scheme is predetermined by the accuracy of source model rather than
the accuracy of the quantization of the parameters. The speech quality is limited by the fidelity of
the source model used. The main advantage of vocoders is their low bit rate, with the penalty of
relatively synthetic speech quality. Vocoders can be classified into the frequency and time
domain subclasses. However, frequency domain vocoders are generally more effective than the
time domain vocoders.
4. Hybrid Coding
Hybrid coding is an attractive trade-off between waveform coding and vocoders, both in terms of
speech quality and transmission bit rate, although generally at the price of higher complexity. It
is also referred to as ABS codecs.
Most recent international and regional speech coding standards belong to a class of LPAS
codecs. This class of codecs includes ITU G.723.1, G.728, and G.729, as well as all the current
digital cellular standards, including
 Europe: GSM, full-rate, half-rate, and enhanced full-rate

 North America: Full-rate, half-rate, and enhanced full-rate for TDMA IS-136 and CDMA
IS-95 systems
 Japan: PDC full-rate and half-rate
-9 -
 In an LPAS coder, the decoded speech is produced by filtering the signal produced by the
excitation generator through both a long-term and a short-term predictor synthesis filter.
 The excitation signal is found by minimizing the mean-squared error over a block of
samples. The error signal is the difference between the original and decoded signal. It is
weighted by filtering it through a weighting filter. Both short-term and long-term predictors
are adapted over time.
 Since the analysis procedure (encoder) includes the synthesis procedure (decoder), the
description of the encoder defines the decoder. The short-term synthesis filter models the
short-term correlations (spectral envelope) in the speech signal. This is an all-pole filter.
 The predictor coefficients are determined from the speech signal using LP techniques. The
coefficients of short-term predictor are adapted in time, with rates varying from 30 to as high
as 400 times per second.
 The long-term predictor filter models the long-term correlations (fine spectral structure) in
the speech signal. Its parameters are a delay and gain coefficient. For a periodic signal, delay
corresponds to the pitch period (or possibly an integral number of pitch periods).
 The delay is random for nonperiodic signals. Typically, the long-term predictor coefficients
are adapted at rates varying from 100 to 200 times per second.
 An alternative structure for the pitch filter is the adaptive codebook. In this case, the long-
term synthesis filter is replaced by a codebook that contains the previous excitation at
different delays. The resulting vectors are searched, and the one that provides the best result
is selected. In addition, an optimal scaling factor is determined for the selected vector. This
representation simplifies the determination of the excitation for delays smaller than the length
of excitation frames.
 CELP coders use another approach to reduce the number of bits per sample. Both encoder
and decoder store the same collection of codes (C) of possible length L in a codebook. The
excitation for each frame is described completely by the index to an appropriate vector. This
index is found by an exhaustive search over all possible codebook vectors, using the one that
gives the smallest error between the original and decoded signals.
-10 -
 To simplify the search it is common to use a gain-shape codebook in which the gain is
searched and quantized separately. Further simplifications are obtained by populating the
code-book vectors with a multipulse structure.
 By using only a few nonzero unit pulses in each codebook vector, efficient search procedures
are derived. This partitioning of excitation space is referred to as an algebraic codebook. The
excitation method is known as ACELP.
GSM VOCODERS
1. Full-Rate Vocoders
The GSM vocoder is referred to as a Linear Predictive Coding with Regular Pulse Excitation
(LPC-RPE). Figure 3 shows the simplified block diagram.
 The encoder processes 20-ms blocks of speech. Each speech block contains 260 bits—the
output rate is 13 kbps.
 The encoder has three major parts:linear prediction analysis, long-term prediction, and
excitation analysis.
 The linear prediction analysis deals with an 8-tap filter characterized by 8 log-area ratios,
each represented by 3, 4, 5, or 6 bits. In total, the 8 log-area ratios are represented by 36 bits.
 The long-term predictor estimates pitch and gain four times at 5-ms intervals. Each estimate
provides a lag coefficient and gain coefficient of 7 bits and 2 bits, respectively.
 The remaining 188 bits are derived from the regular pulse excitation analysis as the long-
term predictor operates over 5-ms intervals resulting in 47 bits per sub-block.
 The distribution of bits used in a GSM full-rate vocoder is given in Table 2.
 Table 2: GSM Full-rate vocoder(13 kbps)
-11 -
Fig 3: GSM full-rate LPC-RPE vocoders
Half-Rate Vocoder
 The half-rate GSM vocoder with 5.6 kbps is a VSELP codec. This is a close relative of
the CELP family. A slight difference is that VSELP uses more than one separate
excitation codebook, which are separately scaled by their respective excitation gain
factors.
 The half-rate GSM vocoder operates in one of four different modes, depending on the
grade of voice detected in the speech.
 Table 3 provides a summary for the synthesis modes 0-3.
 The speech spectral envelope is encoded by using 28 bits per 20-ms frame for the vector
quantization of the LPC coefficients.
 The four different synthesis modes correspond to different excitation modes.
 Two bits per frame are assigned for excitation mode selection. The decision as to which
excitation mode should be used is based on the Long-Term Predictor (LTP) gain. LTP
gain is typically high for highly correlated voice segments and low for noise-like
uncorrelated unvoiced segments.
-12 -
 In the unvoiced mode 0, speech is synthesized by superimposing the outputs of two 128-
entry fixed codebooks to generate the excitation signal which is the filtered through the
synthesis filter and spectral post filter. Both excitation codebooks 1 and 2 have a 7-bit
address in each of the 5-ms subframes.
 In modes 1-3 where the input speech exhibits different grades of voice, the excitation is
generated by superimposing the 512-entry fixed trained code book's output onto the
adaptive codebook.
 The fixed codebook in mode 1-3 requires a 9-bit address, giving a total of 36 coding bits
for the 20-ms frame. The adaptive codebook delay or long-term predictor delay (LTPD)
is encoded using 8 bits to allow for 256 different LTP delay values.
 In order to minimize the variance of the LTP residual, the codec uses a range of integer
and noninteger LTP delay positions.
 A further economy is achieved by encoding LTPD in the consecutive subframe
differently than the previous subframe's delay. This is indicated by ALTPD in Table 3.
 Four encoding bits are used to allow for a maximum difference of positions with respect
to the previous LTPD values.
 The energy of each subframe Es is normalized with respect to the frame energy EF which
is vector quantized using 5 bits per 5-ms subframe to all for 32 possible combinations.
 Thus, a total of 20 bits per frame are assigned encoding the gain-related information.
Table 3: GSM Half-rate Vocoders
-13 -

UNIT - 4 - Speech Coding in GSM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT - 4 - Speech Coding in GSM

Uploaded by

Copyright:

Available Formats

GSM – 10EC843 Unit - 4

Speech coding methods can be classified as:

Fig 1: Bit rate vs QoS

Speech Codec Attributes

1. Transmission Bit Rate

 In addition, to analyze the data properly it is sometimes necessary to analyze data

 Speech codecs are implemented in special-purpose hardware, such as digital signal

LPAS (Linear prediction analysis-by-synthesis)

2. Frequency domain waveform coding

i. Speech signal is segmented in quasi stationary segments of 5-20 ms.

 Europe: GSM, full-rate, half-rate, and enhanced full-rate

Fig 3: GSM full-rate LPC-RPE vocoders

You might also like