Speech Compresion

Chapter 1
Data Compression
1.1 Definition DATA compression

Compression is the process of converting a data set into a code to save the
need for storage and transmission of data making it easier to transmit a
data. With the compression of a can save in terms of time and storage that
exist in memory (storage). Many compression algorithm techniques can be
performed and function properly such as the Huffman, Lempel Ziv Welch,
Run Length Encoding, Tunstall, And Shannon Fano methods. The data
process of data compression is shown in figure 1.[1]
Figure 1.1: Types of Data Compression
1.2 Type Of Compression

Compression can be of two types:
• Lossless Compression
• Lossy Compression
1
Figure 1.2: The data process of data compression
1.2.1 Lossless Compression

Lossless data compression is a class of algorithms that allows the original data
to be reconstructed perfectly when the data is decompressed. Compression
using lossless compression algorithms results into files which are smaller in
size than the original files and without any loss of quality of the file. Lossless
compression is used when it is required that the decompressed data should
be exactly the same as the data in its non-compressed form. [2]
Lossless Compression Techniques

1. Huffman Compression: Huffman coding is used for lossless data
compression. It uses variable length code for encoding a source symbol
(such as a character in a file) which is derived based on the estimated
probability of occurrence for each possible value of the source symbol.
In this compression technique, a table is created incorporating the no of
occurrences of an individual symbol. This table is known as frequency
table and is arranged in a certain order. Then a tree is generated
from that table, in this tree high frequency symbols are assigned codes
which have fewer bits, and less frequent symbols are assigned codes
with many bits. In this way the code table is generated. The following
example bases on a data source using a set of five different symbols.
The symbol’s frequencies are: Table 1.1 The two rarest symbols ’E’ and
’D’ are connected first, followed by ’C’ and ’D’. The new parent nodes
have the frequency 16 and 22 respectively and are brought together in
the next step. The resulting node and the remaining symbol ’A’ are
subordinated to the root node that is created in a final step [2]
2
Figure 1.3: Huffman Tree
2. Shannon Fano Compression : This technique is named after

Claude Shannon and Robert Fano and is a variable length code for
encoding a source symbol. It is a lossless data compression scheme. Ac-
cording to Shannon’s source coding theorem, the optimal code length
for a symbol is –log b P, where b is the number of symbols used to make
output codes and P is the probability of the input symbol. Similar to
the Huffman coding, initially a frequency table is generated and then
a particular procedure is followed to produce the code table from fre-
quency.[3] The following example illustrates the compression algorithm:
[10] To create a code tree according to Shannon and Fano an ordered
table is required which provides the frequency of any symbol. Each
part of the table will be divided into two segments. The algorithm has
to ensure that either the upper and the lower part of the segment have
nearly the same sum of frequencies. This procedure will be repeated
until only single symbols are left. The original data can be coded with
an average length of 2.26 bit. Linear coding of 5 symbols would require

3 bit per symbol. But, before generating a Shannon-Fano code tree the
table must be known or it must be derived from preceding data.[2]
3
Figure 1.4: Shannon Fano Compression
3. LZW Compression: The Lempel-Ziv-Welch algorithm was created

in 1984 by Terry Welch. It removes redundant characters in the output
and includes every character in the dictionary before starting com-
pression and employs other techniques to improve compression. The
algorithm for Lempel-Ziv-Welch is:[3]
w = NIL;
while ( read a character k )
{
if wk exists in the dictionary
w = wk;
else
add wk to the dictionary;
output the code for w;
w = k;
}
4. Run Length Encoding: Run-Length Encoding is a simple data

compression algorithm which is supported by bitmap file formats such
as BMP. RLE basically compresses the data by reducing the physical
size of a repeating string of characters. This repeating string is called
a run which is typically encoded into two bytes where the first byte
represents the total number of characters in the run and is called the
run count and it replaces runs of two or more of the same character
with a number which represents the length of the run which will be
followed by the original character and single characters are coded as
runs of 1. RLE is useful where redundancy of data is high or ii can also
be used in combination with other compression techniques also.
Here is an example of RLE:
Input: YYYBBCCCCDEEEEEERRRRRRRRRR
Output: 3Y2B4C1D6E10R The drawback of RLE algorithm is that
it cannot achieve the high compression ratios as compared to another
4
advanced compression methods, but the advantage of RLE is that it is
easy to implement and quick to execute thus making it a good alterna-
tive for a complex compression algorithm. [3]
5. Arithmetic Coding: The Arithmetic coding Technique assign an

interval to each potential symbol after that a decimal number is as-
signed to this interval. The algorithm starts with an interval of 0.0 and
1.0. After each input symbol from the alphabet is read, the interval
is subdivided into a smaller interval in proportion to the input sym-
bol’s probability. This subinterval then becomes the new interval and
is divided into parts according to probability of symbols from the in-
put alphabet. This is repeated for each and every input symbol. And,
at the end, any floating point number from the final interval uniquely
determines the input data . [4]
6. Bit Reduction Algorithm: This algorithm was originally imple-

mented for use in an SMS application and using this algorithm, about
256 characters per message (typically 160 characters per message) through
the same 7-bit GSM network could be send. The idea is that this pro-
gram reduces the standard 7-bit encoding to some application specific
5-bit encoding system and then pack into a byte array. This method
will reduce the size of a string considerably when the string is lengthy
and the compression ratio is not affected by the content of the string
.[2]
7. J-bit encoding(JBE): It manipulates each bit of data inside file to

minimize the size without losing any data after decoding. This basic
algorithm is intended to be combined with other data compression algo-
rithms in order to optimize the compression ratio. The performance of
this algorithm is measured by comparing combination of different data
compression algorithms. The main idea of this algorithm is to split the
input data into two data where the first data will contain original non-
zero byte and the second data will contain bit value explaining position
of non-zero and zero bytes. Both of the data can be separately com-
pressed with other data compression algorithm to achieve maximum
compression ratio.[2]
8. Two phase encoding:: This technique compresses the sorted data

more efficiently. It provides a way to enhance the compression tech-
nique by merging RLE compression algorithm and incremental com-
pression algorithm. In first phase the data is compressed by applying
RLE algorithm that compresses the frequent occur data bits by short
5
bits. In the second phase incremental compression algorithm stores the
prefix of previous symbol from the current symbol and replaces with
integer value. This technique can reduce the size of sorted data by 50
8% using two phase encoding technique.[2]
9. Modified Huffman Data Compression algorithm: It works in

three phases in order to compress the text data. In the first phase,
data is compressed with the help of dynamic bit reduction technique
and in second phase unique words are to be found to compress the
data further and in third and final phase Huffman coding is used to
compress the data further to produce the final output.[2]
10. Adaptive Huffman coding: We use Huffman coding for data com-
pression but the limitation of Huffman coding is, to send the probability
table with the compress information, because without the probability
table decoding is not possible. To remove this disadvantage in Huff-
man coding, the adaptive Huffman coding was developed. This table
requires the addition of 0 an extra bytes to the output table, and con-
sequently it usually doesn’t make much difference in the compression
ratio. [2]
1.2.2 Lossy Compression

Lossy compression is a class of data encoding methods that uses inexact ap-
proximations and partial data discarding to represent the content. These
compression techniques result in even smaller file size than the lossless com-
pression techniques but results in degraded quality of files. Lossy compression
results in lost data because of which the resultant file takes much less disk
space than the original file. For examples, a JPEG image can reduce an
image size by more than 80
Lossy Compression Techniques

1. Transform coding: Transform coding is a type of data compression
for natural data like audio signals or images. This technique is typi-
cally lossy, resulting in a lower quality copy of the original input. In
transform coding , knowledge of the application is used to choose in-
formation to discard, thereby lowering its band width. The remaining
information can be compressed via number of methods, when the out-
put is decoded, the result may not be identical to the original input,
but is expected to be close enough for the purpose of the application.[5]
6
2. Discrete Cosine Transform (DCT): A discrete expresses a finite
sequence of data points in the terms of the sum of cosine functions os-
cillating at different frequencies. DCT is a lossy compression technique
which is widely used in area of image and audio compression. DCTs
are used to convert data in the summation of series of cosine waves
oscillating at different frequencies. There are very similar to Fourier
transforms but DCT involves uses of cosine functions are much more
efficient as fewer function are needed to approximate a signal. [5]
3. Discrete Wavelet Transform(DWT): The DWT is an implementa-

tion of the wavelet transform using a discrete set of the values scales and
translations obeying some defined rules. In other words, this transform
decomposes the signal into mutually orthogonal set of wavelets which
is the main differences from the continuous wavelet transform or its
implementation from the discrete time series sometimes called Discrete
–time continuous wavelet transform (DTCWT). DWT is applied to the
image block generated by the pre-processor.[5]
7
Chapter 2
Speech compression
2.1 Definition speech compression

Speech compression is the process of representing a voice signal for efficient
transmission or storage. The compressed speech can be sent over both band
limited wire and wireless channels. The aim of speech compression is to
represent the samples of a speech signal in a compact form thus having the
less code symbols without degrading the quality of the speech signal . Com-
pressed speech is very important in cellular and mobile communication. It
is also applied in voice over internet protocol (VOIP), videoconferencing,
electronic toys, archiving, digital simultaneous voice and data (DSVD), nu-
merous computer-based gaming and multimedia applications [6]
2.1.1 TYPES OF SPEECH CODING WAYS

Waveform Coding:
1. Pulse Code Modulation: PCM is a method used to digitally epit-
omize sampled analogsignals.PCM is the best waveform coding tech-
nique which quantizes and encode every sample of speech valueto a
finite number of bits. This is the method of analog to digital conversio-
nand also non differential. The operations that it involves are sampling,
quantizing and coding based on quantized levels shown in Figure. [7]
Figure 2.1: Block Diagram of PCM coding
8
2. Differential PCM: Differential PCM system having the procedure,
the sampled input signal is stored in a predictor, and sends it through
a differentiator, the differentiator compares the previous sample sig-
nal with the current sample signal and sends this difference to the
quantizing and coding phase of PCM each samples are compared to
a prediction result, and the change is called the prediction residual.
Since the difference between input samples is less than a whole input
Figure 2.2: Block Diagram of DPCM Coding
sample, the number of bits required for transmission is reduced (Yadav

Monika S et al., 2016). DPCM quantizes the difference signal using
uniform quantization. Uniform quantization produces an SNR that is
slight for small input sample signals and large for huge input sample
signals. Hence, voice quality is well at higher signals.[7]
3. Adaptive DPCM: ADPCM uses linear prediction that means it uses

the preceding samples to predict the current sample. It then calcu-
lates the difference between the current sample and its predication and
quantizes the difference. Decoder multiplies this number by the quan-
tization step to obtain the reconstructed audio sample. The method
used is efficient because the quantization step is updated on the time,
by both encoder and decoder, according to the varying amplitude of
the input speech samples. Adaptive differential pulse codemodulation
is improved version of PCM where 4 bits per sample are used encoding
the pulse creek. ADPCM codecs require appreciable less storage space.
In this both quantizer and predictor are adjust in nature.[7]
4. Delta Modulation: Delta modulation is a disparity quantization

scheme that uses two levels of quantization. By using a single bit to
represent each sample, the sample rate and the bit rate are equivalent.
9
Accordingly, sample rate is directly related to signal worth. Applying
adaptive techniques to delta modulation, quantizer allows for unceasing
step size adjustment. By adjusting the quantization step size, the coder
is able to signify low amplitude signals with greater accuracy.[7]
5. Sub Band Coding: Sub band coding is the waveform coding spec-
tral domain type. It is the process of decomposing of speech signal,
where speech is usually divided into four or eight sub bands by a bank
of filters, and each sub band is tasted at a band pass Nyquist rate and
encoded with dissimilar accuracy in accordance to a perceptual norms.
To reduce number of samples, sampling rate of the signal in each sub
band is reduced by obliteration.
Sub band coding can be used for coding speech at bit rates in the range
of 9.6 Kbps to 32 Kbps. In this range speech quality is roughly equiv-
alent to that of ADPCM at an equivalent bit. The advantage of sub
band coding is that each of the sub band can be encoded separately
using different coders based on perceptual appearances.
A sub band spectral analysis technique was established that signifi-
cantly shrinks the complexity of computing the perceptual model.[7]
10
Figure 2.3: Block Diagram of Sub band Coding
11
Parameter Coding :
1. Linear Predictive Coding: Linear predictive coding, a prevailing,
worthy quality, low bit rate speech analysis compression technique for
encoding a speech signal. The basic approach is to find a set of predic-
tor coefficients that curtail the mean squared error over a small segment
of speech waveform. It has two main components LPC analysis namely
encoding and LPC synthesis decoding. The goal line of the LPC anal-
ysis is to estimate whether the speech signal is enunciated or tacit, to
find the pitch of each frame and to the parameters needed to build the
source filter model. These parameters are transmitted to the receiver
will carry out LPC synthesis using the received parameters. It is an
auto regressive method of speech coding, in which the speech signal at
a precise instant is represented by a linear sum of the previous samples
linear prediction estimates the current sample by conjoining past few
samples linearly. Although auto correlation and covariance methods
have been mostly used to determine LP coefficients. Speech coding
or compression is generally conducted with the use of voice coders or
vocodors.[7]
Figure 2.4: Block Diagram of LPC
2. Mixed Excitation Linear Predictive: The Mixed Excitation Linear

Prediction algorithm is the linear prediction based parametric speech
coder normally works at the range of 2.4 kbps. The MELP vocodors is
based on the customary LPC parametric model, but also includes four
additional topographies. These are mixed excitation, aperiodic pulses,
pulse dispersion, and adaptive spectral enhancement. Aperiodic pulses
12
are most repeatedly used during alteration regions between voiced and
unvoiced segments of the speech signal. . In MELP speech coding the
input speech signal limits are predictable first which are then used to
combine speech signal at the output.[7]
3. Residual Excited Linear Prediction: The Residual Excited Lin-

ear Prediction vocoder is one of the Linear Predictive Coding based
vocodors. Its firmness rate is moderate because the RELP vocoder
needs to encode a sequence of residual signals for exhilarating the vo-
cal tract model synthesized from speech signals. However, the quality
of synthesized speech is greater to other kinds of LPC vocodors. The
system is healthy since there is no need to analyze whether the sound
is voiced or unvoiced nor to analyze the pitch epoch.[7]
Hybrid :
1. Coded Excited Linear Prediction CELP: is a fit structured closed
loop analysis by synthesis hybrid coding technique which combines the
advantages of both techniques waveform and parametric to afford a
robust low bit speech coder for narrow band and medium band speech
coding. Impression of CELP was intuitive as a stab to improve on LPC
Figure 2.5: Block Diagram of CELP
coder. Most popular coding systems in the range of 4-8 Kbps bit rate
use CELP (Rhutuja Jage et al., 2016). Examining of an excitation
codebook to offer a consistent excitation structure during encoding is
the vital behindhand CELP functioning. This technique is widely used
for fee quality speech at 16Kbps. Presently the CELP is used very
effectively in MPEG-4 audio speech coding conversion. [7]
2. Conjugate Structure Algebraic Code Excited Linear Predic-

tion: It is the utmost present hybrid coding technique which is cur-
rently positioned in almost all the latest VoIP applications. It operates
at 8Kbps and provides near Toll quality performance of the speech sig-
nal. The coder is based on code excited linear prediction model. The
13
CSACELP coder procedures input signals on a frame by frame and sub
frame by sub frame root. The algorithm exploits vector quantization
method, both the adaptive and fixed codebook are vector quantized to
form conjugate structure. [7]
VOCODORS :
Vocodors system is based on the analysis synthesis technique, used to imitate
human speech. The vocodors was originally developed as a speech coder for
telecommunications applications, for the purpose of being to code speech
for transmission. The vocodors are further classified as channel vocodors,
formant vocodors.[7]
2.2 Definition Speech Production

Speech production is the process by which spoken words are selected to be
produced, have their phonetics formulated and then finally are articulated
by the motor system in the vocal apparatus. Speech production can be spon-
taneous such as when a person creates the words of a conversation, reaction
such as when they name a picture or read aloud a written word, or a vocal
imitation such as in speech repetition. Speech production is not the same as
language production since language can also be produced manually by signs.
In ordinary fluent conversation people pronounce each second roughly four
syllables, ten or twelve phonemes and two to three words out of a vocabulary
that can contain 10 to 100 thousand words. Errors in speech production are
relatively rare occurring at a rate of about once in every 900 words in spon-
taneous speech. Words that are commonly spoken or learned early in life or
easily imagined are quicker to say than ones that are rarely said, learnt later
in life or abstract.[8]
2.2.1 Properties of Speech

We produce sound by forcing air from the lungs through the vocal cords
into the vocal tract . The air ends up escaping through the mouth, but the
sound is generated in the vocal tract (which extends from the vocal cords to
the mouth, and in an average adult it is 17 cm long) by vibrations in much
the same way as air passing through a flute generates sound. The pitch
of the sound is controlled by varying the shape of the vocal tract (mostly
by moving the tongue) and by moving the lips. The intensity (loudness) is
controlled by varying the amount of air sent from the lungs. Humans are
14
Figure 2.6: A Cross Section of the Human Head
much slower than computers or other electronic devices, and this is also true
with regard to speech. The lungs operate slowly, and the vocal tract changes
shape slowly, so the pitch and loudness of speech vary slowly. When speech is
captured by a microphone and is sampled, we find that adjacent samples are
similar, and even samples separated by 20 ms are strongly correlated. This
correlation is the basis of speech compression. The vocal cords can open and
close, and the opening between them is called the glottis. The movements of
the glottis and vocal tract give rise to different types of sound. [9]
The three main types are as follows:
Voiced sounds: These are the sounds we make when we talk. The
vocal cords vibrate, which opens and closes the glottis, thereby sending
pulses of air at varying pressures to the tract, where it is shaped into
sound waves. Varying the shape of the vocal cords and their tension
changes the rate of vibration of the glottis and therefore controls the
pitch of the sound. Recall that the ear is sensitive to sound frequencies
of from 16 Hz to about 20,000–22,000 Hz. The frequencies of the human
voice, on the other hand, are much more restricted and are generally in
the range of 500 Hz to about 2 kHz. This is equivalent to time periods
of 2 ms to 20 ms, and to a computer, such periods are very long. Thus,
voiced sounds have long-term periodicity, and this is the key to good
speech compression. Figure 2.7 a is a typical example of waveforms of
voiced sound.[9]
Unvoiced sounds: These are sounds that are emitted and can be
heard, but are not parts of speech. Such a sound is the result of hold-
ing the glottis open and forcing air through a constriction in the vocal
tract. When an unvoiced sound is sampled, the samples show little
15
correlation and are random or close to random. b is a typical example
of the waveforms of unvoiced sound. [9]
Figure 2.7: (a) Voiced and (b) Unvoiced Sound Waves.
Other Sounds: These sounds can be categorized as :

Nasal Sounds: Vocal Tract coupled acoustically with Nasal Tract, i.e.
sounds radiated through nostrils and lips e.g. m, n, ing etc.
Plosive sounds: These sounds are a result of a build-up and sudden
release of pressure near the closure in the front of the vocal tract e.g.
p, t, b etc..[10]
The vocal tract is a vase shaped acoustic tube that ceases at one end by
the vocal cords and by the lips at the other end. Cross sectional area of
Figure 2.8: A Single Dimensional Acoustic Tube Of

Diverging Cross-Sectional Area And (B) An Acoustic
Tube Model Divided Into Eight Parts, Used For Dis-
cretization Of Speech Signal.
the vocal tract changes based upon the sounds that we intend to pro-
duce. The formant frequency can be defined as the frequency around
which there is a high concentration of energy. Statistically, it has been
observed that for every kHz there is approximately one formant fre-
quency. Hence, we can observe a total of 3-4 formant frequencies in a
human voice frequency range of 4 KHz.
Since the bandwidth for human speech is from 0 to 4 KHz, we sam-
ple the speech signals at 8 KHz based on the Nyquist criteria to avoid
aliasing.[10]
16
2.2.2 Speech Production Model
Speech Production Model Depending on the content of the speech signal
(voiced or unvoiced) the speech signal comprises of a series of pulses (for
voiced sounds) or random noise (for unvoiced sounds). This spectrum of
signals moves through the vocal tract. The vocal tract behaves as a spectral
shaping filter i.e. the frequency response of the vocal tract is thrust upon
the incoming speech signal. The shape and size of the vocal tract defines the
frequency response and hence the difference in the voices of people.[10]
Development of an accurate speech producing model requires one to de-
velop a speech filter based model of the human speech producing mechanism.
It is presumed that the source of excitation and the vocal tract are inde-
pendent of each other. Therefore, they both are modeled separately. For
modelling the vocal tract it is assumed that the vocal tract has defined char-
acteristics over a 10 ms period of time. Thus once every 10 ms, the vocal
tract configuration changes, bringing about, new vocal tract parameters (i.e.
resonant/formant frequencies)[10]
To build up an accurate model for speech production, it is essential to
build a speech filter based model. The model must precisely represent the
following:[10]
3.• The excitation technique of the human speech production mechanism.

2.
1.
• The lip-nasal voice process.
• The operational intricacies of the vocal tract.
• Voiced speech and Unvoiced speech .
Figure 2.9: Overall Speech Production Model
17
S(z) = E(z) * G(z) * A*V(z) * R(z)
Where:
S(z) = Speech at the Output of the Model
E(z) = Excitation Model
G(z) = Glottal Model
A = Gain Factor
V(z) = Vocal Tract Model
R(z) = Radiation Model
Excitation Model: The output of the excitation function of the model
will vary depending on the trait of the speech produced. During the course
of the voiced speech, the excitation will consist of a series of impulses, each
spaced at an interval of the pitch period.During the course of unvoiced speech,
the excitation will be a white noise/random noise type signal.[10]
Glottal Model: The glottal model is used exclusively for the Voiced Speech
component of the human speech. The glottal flow distinguishes the speakers
in speech recognition and speech synthesis mechanisms.[10]
Gain Factor: The energy of the sound is dependent on the gain factor.
Generally, the energy for the voiced speech is many times greater than that
of the unvoiced speech.[10]
Vocal Tract Model: A chain of lossless tubes (short and cylindrical in
shape) form the basis/model of the vocal tract (as shown in Figure 4 below),
each with its own resonant frequency. The design of the lossless tube is dif-
ferent for different people. The resonant frequency depends on the shape of
the tube, and hence, the difference in voices for different people.[10]
Figure 2.10: Vocal Tract Model
The vocal tract model described above is typically used in the low bit-rate
speech codecs, speech recognition systems, speaker authentication/identification
systems, and speech synthesizers as well. It is essential to derive the coef-
ficients of the vocal tract model for every frame of speech. The typical
technique used for deriving the coefficients of the vocal tract model in speech
18
codecs is Linear Predictive Coding (LPC). LPC vocoders can achieve a bit-
rate of 1.2 to 4.8 kbps and hence, is categorized into a low quality, moderate
complexity, and a low bit-rate algorithm.[10]
2.3 Coded Excited Linear Prediction CELP

2.3.1 Principle of CELP
The general structure of a CELP codec is illustrated in Figure 3.1. In a
typical CELP system, the input speech is segmented into fixed size blocks
called frames, which are further subdivided into subframes. A linear predic-
tion (LP) filter forms the synthesis filter that models the short-term speech
spectrum. The coefficients of the filter are computed once per frame and
quantized. The synthesized speech is obtained by applying an excitation
vector constructed from a stochastic codebook and an adaptive codebook
every subframe to the input of the LP filter. The stochastic codebook con-
tains ”white noise” in an attempt to model the noisy nature of some speech
segments, while the adaptive codebook contains past samples of the excita-
tion and models the long-term periodicity (pitch) of speech. The codebook
indices and gains are determined by an analysis-by-synthesis procedure, in
order to minimize a perceptually weighted distortion criterion. [11]
Figure 2.11: CELP Codec
19
2.3.2 CELP Systems
This section gives a brief description of three major CELP based standards.
The DoD 4.8 kb/s Speech Coding Standard The advances in CELP based
speech coding led to the development of the U.S. Department of Defense
(DoD) 4.8 kb/s standard (Federal Standard 1016) [41]. The standard uses a
10th order synthesis filter computed using the autocorrelation method on a
frame size of 240 samples (30ms). The coefficients are quantized using a 34-
bit non-uniform scalar quantization of the LSPs. Each frame is divided into
4 subframes of 60 samples. The excitation is formed from a one-tap adaptive
codebook and a single stochastic codebook using a sequential search. The
stochastic codebook is sparse, ternary, and overlapped by -2 samples. The
adaptive codebook provides for the possibility of using non-integer delays.
The gains are quantized using scalar quantizers.
3.3.2 VSELP Vector Sum Excited Linear Prediction (VSELP) is the 8
kb/s codec chosen by the Telecommunications Industry Association (TIA)
for the North American digital cellular speech coding standard [4]. VSELP
uses a 10th order synthesis filter and three codebooks: an adaptive codebook,
and two stochastic codebooks. The search of the codebooks is done using
an orthogonalization procedure based on the Gram-Schmidt algorithm. The
excitation codebooks each have 128 vectors obtained as binary linear com-
binations of seven basis vectors. The binary words representing the selected
codevector in each codebook specify the polarities of the linear combination
of basis vectors. Since only the basis vectors of each codebook must be fil-
tered, the search complexity is vastly reduced. The performance of VSELP
is characterized by MOS scores of about 3.7; which is considered to be close
to toll quality. 3.3.3 LD-CELP In 1988, the CCITT established a maximum
delay requirement of 5 ms for a new 16 kb/s speech coding standard. This
resulted in the selection of the LD-CELP algorithm as the CCITT standard
G.728 in 1992 [5]. Classical speech coders must buffer a large block of speech
for linear prediction analysis prior to further signal processing. The synthesis
filter in LD-CELP is based on backward prediction. In this method, the pa-
rameters of the filter are not derived from the original speech, but computed
based on previous reconstructed speech. As such, the synthesis filter can be
derived at both encoder and decoder, thus eliminating the need for quanti-
zation. The backward-adaptive L.P filter used in LD-CELP is 50th order.
The excitation is obtained from a product gain-shape codebook consisting
of a 7-bit shape codebook and a 3-bit backward-adaptive gain quantizer.
LD-CELP achieves toll quality at 16 kb/s with a 5 ms coding delay.
20
Bibliography
[1] L Anjar Fitriya, Tito Waluyo Purboyo, and Anggunmeka Luhur

Prasasti. A review of data compression techniques. International Jour-
nal of Applied Engineering Research, 12(19):8956–8963, 2017.
[2] Ruchi Gupta, Mukesh Kumar, and Rohit Bathla. Data compression-
lossless and lossy techniques. International Journal of Application or
Innovation in Engineering & Management, 5(7):120–125, 2016.
[3] Neha Sharma, Jasmeet Kaur, Navmeet Kaur, N Sharma, J Kaur, and
N Kaur. A review on various lossless text data compression tech-
niques. Research Cell: An International Journal of Engineering Sci-
ences, 12(2):58–63, 2014.
[4] S Kapoor and A Chopra. A review of lempel ziv compression techniques.

IJCST, 4(2), 2013.
[5] P Kavitha. A survey on lossless and lossy data compression methods.

International Journal of Computer Science & Engineering Technology,
7(03):110–114, 2016.
[6] Sandar Oo. Study on speech compression and decompression by using

discrete wavelet transform [j]. Journal of Trend in Scientific Research
and Development, 3(3), 2019.
[7] N Kaladharan. A review of different speech coding methods. Interna-

tional Journal of Electrical and Electronic Engineering and Telecommu-
nications, 6:96–103, 2017.
[8] Chhayanka Sharma, D. R. Sahu, Priyanka Gakre, and Ivneet Kaur

Kaler. Speech compression using lpc. 2014.
[9] David Salomon. Data compression: the complete reference. Springer

Science & Business Media, 2004.
21
[10] Rhishikesh Agashe. Speech processing model in embedded media pro-
cessing. urlhttps://www.einfochips.com/blog/speech-processing-model-
in-embedded-media-processing/. 2021-03-15.
[11] Robert Zopf. Real-time implementation of a variable rate CELP speech

codec. PhD thesis, Theses (School of Engineering Science)/Simon Fraser
University, 1995.
22

Speech Compresion

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Compresion

Uploaded by

Copyright:

Available Formats

Chapter 1

1.1 Definition DATA compression

Figure 1.1: Types of Data Compression

1.2 Type Of Compression

1.2.1 Lossless Compression

Lossless Compression Techniques

2. Shannon Fano Compression : This technique is named after

an average length of 2.26 bit. Linear coding of 5 symbols would require

3. LZW Compression: The Lempel-Ziv-Welch algorithm was created

4. Run Length Encoding: Run-Length Encoding is a simple data

5. Arithmetic Coding: The Arithmetic coding Technique assign an

6. Bit Reduction Algorithm: This algorithm was originally imple-

7. J-bit encoding(JBE): It manipulates each bit of data inside file to

8. Two phase encoding:: This technique compresses the sorted data

9. Modified Huffman Data Compression algorithm: It works in

1.2.2 Lossy Compression

Lossy Compression Techniques

3. Discrete Wavelet Transform(DWT): The DWT is an implementa-

2.1 Definition speech compression

2.1.1 TYPES OF SPEECH CODING WAYS

Figure 2.1: Block Diagram of PCM coding

Figure 2.2: Block Diagram of DPCM Coding

sample, the number of bits required for transmission is reduced (Yadav

3. Adaptive DPCM: ADPCM uses linear prediction that means it uses

4. Delta Modulation: Delta modulation is a disparity quantization

Figure 2.4: Block Diagram of LPC

2. Mixed Excitation Linear Predictive: The Mixed Excitation Linear

3. Residual Excited Linear Prediction: The Residual Excited Lin-

Figure 2.5: Block Diagram of CELP

2. Conjugate Structure Algebraic Code Excited Linear Predic-

2.2 Definition Speech Production

2.2.1 Properties of Speech

Figure 2.7: (a) Voiced and (b) Unvoiced Sound Waves.

Other Sounds: These sounds can be categorized as :

Figure 2.8: A Single Dimensional Acoustic Tube Of

3.• The excitation technique of the human speech production mechanism.

• The lip-nasal voice process.

• The operational intricacies of the vocal tract.

• Voiced speech and Unvoiced speech .

Figure 2.9: Overall Speech Production Model

Figure 2.10: Vocal Tract Model

2.3 Coded Excited Linear Prediction CELP

Figure 2.11: CELP Codec

[1] L Anjar Fitriya, Tito Waluyo Purboyo, and Anggunmeka Luhur

[4] S Kapoor and A Chopra. A review of lempel ziv compression techniques.

[5] P Kavitha. A survey on lossless and lossy data compression methods.

[6] Sandar Oo. Study on speech compression and decompression by using

[7] N Kaladharan. A review of different speech coding methods. Interna-

[8] Chhayanka Sharma, D. R. Sahu, Priyanka Gakre, and Ivneet Kaur

[9] David Salomon. Data compression: the complete reference. Springer

[11] Robert Zopf. Real-time implementation of a variable rate CELP speech

You might also like