Professional Documents
Culture Documents
Data Compression
• Lossless Compression
• Lossy Compression
1
Figure 1.2: The data process of data compression
’D’ are connected first, followed by ’C’ and ’D’. The new parent nodes
have the frequency 16 and 22 respectively and are brought together in
the next step. The resulting node and the remaining symbol ’A’ are
subordinated to the root node that is created in a final step [2]
2
Figure 1.3: Huffman Tree
3
Figure 1.4: Shannon Fano Compression
4
advanced compression methods, but the advantage of RLE is that it is
easy to implement and quick to execute thus making it a good alterna-
tive for a complex compression algorithm. [3]
5
bits. In the second phase incremental compression algorithm stores the
prefix of previous symbol from the current symbol and replaces with
integer value. This technique can reduce the size of sorted data by 50
8% using two phase encoding technique.[2]
10. Adaptive Huffman coding: We use Huffman coding for data com-
pression but the limitation of Huffman coding is, to send the probability
table with the compress information, because without the probability
table decoding is not possible. To remove this disadvantage in Huff-
man coding, the adaptive Huffman coding was developed. This table
requires the addition of 0 an extra bytes to the output table, and con-
sequently it usually doesn’t make much difference in the compression
ratio. [2]
6
2. Discrete Cosine Transform (DCT): A discrete expresses a finite
sequence of data points in the terms of the sum of cosine functions os-
cillating at different frequencies. DCT is a lossy compression technique
which is widely used in area of image and audio compression. DCTs
are used to convert data in the summation of series of cosine waves
oscillating at different frequencies. There are very similar to Fourier
transforms but DCT involves uses of cosine functions are much more
efficient as fewer function are needed to approximate a signal. [5]
7
Chapter 2
Speech compression
8
2. Differential PCM: Differential PCM system having the procedure,
the sampled input signal is stored in a predictor, and sends it through
a differentiator, the differentiator compares the previous sample sig-
nal with the current sample signal and sends this difference to the
quantizing and coding phase of PCM each samples are compared to
a prediction result, and the change is called the prediction residual.
Since the difference between input samples is less than a whole input
9
Accordingly, sample rate is directly related to signal worth. Applying
adaptive techniques to delta modulation, quantizer allows for unceasing
step size adjustment. By adjusting the quantization step size, the coder
is able to signify low amplitude signals with greater accuracy.[7]
5. Sub Band Coding: Sub band coding is the waveform coding spec-
tral domain type. It is the process of decomposing of speech signal,
where speech is usually divided into four or eight sub bands by a bank
of filters, and each sub band is tasted at a band pass Nyquist rate and
encoded with dissimilar accuracy in accordance to a perceptual norms.
To reduce number of samples, sampling rate of the signal in each sub
band is reduced by obliteration.
Sub band coding can be used for coding speech at bit rates in the range
of 9.6 Kbps to 32 Kbps. In this range speech quality is roughly equiv-
alent to that of ADPCM at an equivalent bit. The advantage of sub
band coding is that each of the sub band can be encoded separately
using different coders based on perceptual appearances.
A sub band spectral analysis technique was established that signifi-
cantly shrinks the complexity of computing the perceptual model.[7]
10
Figure 2.3: Block Diagram of Sub band Coding
11
Parameter Coding :
1. Linear Predictive Coding: Linear predictive coding, a prevailing,
worthy quality, low bit rate speech analysis compression technique for
encoding a speech signal. The basic approach is to find a set of predic-
tor coefficients that curtail the mean squared error over a small segment
of speech waveform. It has two main components LPC analysis namely
encoding and LPC synthesis decoding. The goal line of the LPC anal-
ysis is to estimate whether the speech signal is enunciated or tacit, to
find the pitch of each frame and to the parameters needed to build the
source filter model. These parameters are transmitted to the receiver
will carry out LPC synthesis using the received parameters. It is an
auto regressive method of speech coding, in which the speech signal at
a precise instant is represented by a linear sum of the previous samples
linear prediction estimates the current sample by conjoining past few
samples linearly. Although auto correlation and covariance methods
have been mostly used to determine LP coefficients. Speech coding
or compression is generally conducted with the use of voice coders or
vocodors.[7]
12
are most repeatedly used during alteration regions between voiced and
unvoiced segments of the speech signal. . In MELP speech coding the
input speech signal limits are predictable first which are then used to
combine speech signal at the output.[7]
Hybrid :
1. Coded Excited Linear Prediction CELP: is a fit structured closed
loop analysis by synthesis hybrid coding technique which combines the
advantages of both techniques waveform and parametric to afford a
robust low bit speech coder for narrow band and medium band speech
coding. Impression of CELP was intuitive as a stab to improve on LPC
coder. Most popular coding systems in the range of 4-8 Kbps bit rate
use CELP (Rhutuja Jage et al., 2016). Examining of an excitation
codebook to offer a consistent excitation structure during encoding is
the vital behindhand CELP functioning. This technique is widely used
for fee quality speech at 16Kbps. Presently the CELP is used very
effectively in MPEG-4 audio speech coding conversion. [7]
13
CSACELP coder procedures input signals on a frame by frame and sub
frame by sub frame root. The algorithm exploits vector quantization
method, both the adaptive and fixed codebook are vector quantized to
form conjugate structure. [7]
VOCODORS :
Vocodors system is based on the analysis synthesis technique, used to imitate
human speech. The vocodors was originally developed as a speech coder for
telecommunications applications, for the purpose of being to code speech
for transmission. The vocodors are further classified as channel vocodors,
formant vocodors.[7]
14
Figure 2.6: A Cross Section of the Human Head
much slower than computers or other electronic devices, and this is also true
with regard to speech. The lungs operate slowly, and the vocal tract changes
shape slowly, so the pitch and loudness of speech vary slowly. When speech is
captured by a microphone and is sampled, we find that adjacent samples are
similar, and even samples separated by 20 ms are strongly correlated. This
correlation is the basis of speech compression. The vocal cords can open and
close, and the opening between them is called the glottis. The movements of
the glottis and vocal tract give rise to different types of sound. [9]
The three main types are as follows:
Voiced sounds: These are the sounds we make when we talk. The
vocal cords vibrate, which opens and closes the glottis, thereby sending
pulses of air at varying pressures to the tract, where it is shaped into
sound waves. Varying the shape of the vocal cords and their tension
changes the rate of vibration of the glottis and therefore controls the
pitch of the sound. Recall that the ear is sensitive to sound frequencies
of from 16 Hz to about 20,000–22,000 Hz. The frequencies of the human
voice, on the other hand, are much more restricted and are generally in
the range of 500 Hz to about 2 kHz. This is equivalent to time periods
of 2 ms to 20 ms, and to a computer, such periods are very long. Thus,
voiced sounds have long-term periodicity, and this is the key to good
speech compression. Figure 2.7 a is a typical example of waveforms of
voiced sound.[9]
Unvoiced sounds: These are sounds that are emitted and can be
heard, but are not parts of speech. Such a sound is the result of hold-
ing the glottis open and forcing air through a constriction in the vocal
tract. When an unvoiced sound is sampled, the samples show little
15
correlation and are random or close to random. b is a typical example
of the waveforms of unvoiced sound. [9]
the vocal tract changes based upon the sounds that we intend to pro-
duce. The formant frequency can be defined as the frequency around
which there is a high concentration of energy. Statistically, it has been
observed that for every kHz there is approximately one formant fre-
quency. Hence, we can observe a total of 3-4 formant frequencies in a
human voice frequency range of 4 KHz.
Since the bandwidth for human speech is from 0 to 4 KHz, we sam-
ple the speech signals at 8 KHz based on the Nyquist criteria to avoid
aliasing.[10]
16
2.2.2 Speech Production Model
Speech Production Model Depending on the content of the speech signal
(voiced or unvoiced) the speech signal comprises of a series of pulses (for
voiced sounds) or random noise (for unvoiced sounds). This spectrum of
signals moves through the vocal tract. The vocal tract behaves as a spectral
shaping filter i.e. the frequency response of the vocal tract is thrust upon
the incoming speech signal. The shape and size of the vocal tract defines the
frequency response and hence the difference in the voices of people.[10]
Development of an accurate speech producing model requires one to de-
velop a speech filter based model of the human speech producing mechanism.
It is presumed that the source of excitation and the vocal tract are inde-
pendent of each other. Therefore, they both are modeled separately. For
modelling the vocal tract it is assumed that the vocal tract has defined char-
acteristics over a 10 ms period of time. Thus once every 10 ms, the vocal
tract configuration changes, bringing about, new vocal tract parameters (i.e.
resonant/formant frequencies)[10]
To build up an accurate model for speech production, it is essential to
build a speech filter based model. The model must precisely represent the
following:[10]
17
S(z) = E(z) * G(z) * A*V(z) * R(z)
Where:
S(z) = Speech at the Output of the Model
E(z) = Excitation Model
G(z) = Glottal Model
A = Gain Factor
V(z) = Vocal Tract Model
R(z) = Radiation Model
Excitation Model: The output of the excitation function of the model
will vary depending on the trait of the speech produced. During the course
of the voiced speech, the excitation will consist of a series of impulses, each
spaced at an interval of the pitch period.During the course of unvoiced speech,
the excitation will be a white noise/random noise type signal.[10]
Glottal Model: The glottal model is used exclusively for the Voiced Speech
component of the human speech. The glottal flow distinguishes the speakers
in speech recognition and speech synthesis mechanisms.[10]
Gain Factor: The energy of the sound is dependent on the gain factor.
Generally, the energy for the voiced speech is many times greater than that
of the unvoiced speech.[10]
Vocal Tract Model: A chain of lossless tubes (short and cylindrical in
shape) form the basis/model of the vocal tract (as shown in Figure 4 below),
each with its own resonant frequency. The design of the lossless tube is dif-
ferent for different people. The resonant frequency depends on the shape of
the tube, and hence, the difference in voices for different people.[10]
The vocal tract model described above is typically used in the low bit-rate
speech codecs, speech recognition systems, speaker authentication/identification
systems, and speech synthesizers as well. It is essential to derive the coef-
ficients of the vocal tract model for every frame of speech. The typical
technique used for deriving the coefficients of the vocal tract model in speech
18
codecs is Linear Predictive Coding (LPC). LPC vocoders can achieve a bit-
rate of 1.2 to 4.8 kbps and hence, is categorized into a low quality, moderate
complexity, and a low bit-rate algorithm.[10]
19
2.3.2 CELP Systems
This section gives a brief description of three major CELP based standards.
The DoD 4.8 kb/s Speech Coding Standard The advances in CELP based
speech coding led to the development of the U.S. Department of Defense
(DoD) 4.8 kb/s standard (Federal Standard 1016) [41]. The standard uses a
10th order synthesis filter computed using the autocorrelation method on a
frame size of 240 samples (30ms). The coefficients are quantized using a 34-
bit non-uniform scalar quantization of the LSPs. Each frame is divided into
4 subframes of 60 samples. The excitation is formed from a one-tap adaptive
codebook and a single stochastic codebook using a sequential search. The
stochastic codebook is sparse, ternary, and overlapped by -2 samples. The
adaptive codebook provides for the possibility of using non-integer delays.
The gains are quantized using scalar quantizers.
3.3.2 VSELP Vector Sum Excited Linear Prediction (VSELP) is the 8
kb/s codec chosen by the Telecommunications Industry Association (TIA)
for the North American digital cellular speech coding standard [4]. VSELP
uses a 10th order synthesis filter and three codebooks: an adaptive codebook,
and two stochastic codebooks. The search of the codebooks is done using
an orthogonalization procedure based on the Gram-Schmidt algorithm. The
excitation codebooks each have 128 vectors obtained as binary linear com-
binations of seven basis vectors. The binary words representing the selected
codevector in each codebook specify the polarities of the linear combination
of basis vectors. Since only the basis vectors of each codebook must be fil-
tered, the search complexity is vastly reduced. The performance of VSELP
is characterized by MOS scores of about 3.7; which is considered to be close
to toll quality. 3.3.3 LD-CELP In 1988, the CCITT established a maximum
delay requirement of 5 ms for a new 16 kb/s speech coding standard. This
resulted in the selection of the LD-CELP algorithm as the CCITT standard
G.728 in 1992 [5]. Classical speech coders must buffer a large block of speech
for linear prediction analysis prior to further signal processing. The synthesis
filter in LD-CELP is based on backward prediction. In this method, the pa-
rameters of the filter are not derived from the original speech, but computed
based on previous reconstructed speech. As such, the synthesis filter can be
derived at both encoder and decoder, thus eliminating the need for quanti-
zation. The backward-adaptive L.P filter used in LD-CELP is 50th order.
The excitation is obtained from a product gain-shape codebook consisting
of a 7-bit shape codebook and a 3-bit backward-adaptive gain quantizer.
LD-CELP achieves toll quality at 16 kb/s with a 5 ms coding delay.
20
Bibliography
[2] Ruchi Gupta, Mukesh Kumar, and Rohit Bathla. Data compression-
lossless and lossy techniques. International Journal of Application or
Innovation in Engineering & Management, 5(7):120–125, 2016.
[3] Neha Sharma, Jasmeet Kaur, Navmeet Kaur, N Sharma, J Kaur, and
N Kaur. A review on various lossless text data compression tech-
niques. Research Cell: An International Journal of Engineering Sci-
ences, 12(2):58–63, 2014.
21
[10] Rhishikesh Agashe. Speech processing model in embedded media pro-
cessing. urlhttps://www.einfochips.com/blog/speech-processing-model-
in-embedded-media-processing/. 2021-03-15.
22