Techniques for Speech Compression and Codecs

Speech Compression and Codecs
By:
Dr. Ahmed A. Khalifa
Agenda
• Introduction
• Speech Signal Digitization
• Speech compression techniques

1. Waveform Compression Coding
2. Parametric Compression Coding
3. Hybrid Compression Coding
• Illustrative Worked Examples
2
Audio Frequencies
• Humans have limited capabilities in terms of hearing and
generating sounds
• Human ear can hear frequencies (20Hz to 20KHz)
• Human can generate sounds (range is much less than what

we can hear)
3
Overview of VoIP VoIP Trend Requirements & Factors affects QoS Speech Compression and Codecs
Introduction to Speech Compression and
Codecs
Purpose of speech compression:

• Reduce the number of bits required to represent
speech signals (by reducing redundancy)
Why?
in order to minimize the requirement for transmission
bandwidth (e.g., for voice transmission over mobile
channels with limited capacity)
• Reduce the storage costs (e.g., for speech

recording)
4
Codecs (Cont.)
• VoIP tool (e.g., Skype, Google Talk and xLite)

normally provides many voice codecs which can be
selected or updated manually or automatically
• Typical voice codecs used in VoIP include:

• ITU-T standards e.g. 64 kb/s G.711 PCM, 8 kb/s G.729 and
5.3/6.3 kb/s G.723.1
• ETSI standards e.g. Adaptive Multi-Rate (AMR)
• Open source codecs e.g. Internet Low Bitrate Codec
(iLBC)
• Proprietary codecs e.g. Skype’s SILK codec (variable bit
rates, range of 6 to 40 kb/s & variable sampling frequencies
from narrowband to super-wideband)
5
Codecs (Cont.)
• Some codecs can only operate at a fixed bit rate
• Many advanced codecs can have variable bit rates which may be used
for adaptive VoIP applications to improve voice quality or QoE
• Some VoIP tools can allow speech codecs used to be changed during
a VoIP session, making it possible to select the most suitable codec for a
given network condition
Remember:
• Voice codecs or speech codecs are based on different speech
compression techniques which aim to:
• Remove redundancy from the speech signal to achieve compression
• Reduce transmission and storage costs
6
Codecs (Cont.)
• Speech compression codecs are compared with the 64 kb/s PCM codec as the
reference for all speech codecs
• Speech codecs with the lowest data rates (e.g., 2.4 or 1.2 kb/s Vocoder) are
used mainly in secure communications
• With compression ratios of about 26.6 or 53.3 (compared to PCM) and still
maintain intelligibility, but with speech quality that is somewhat
‘mechanical’
• Most of speech codecs operate in the range of 4.8 kb/s to 16 kb/s are mainly used
in bandwidth resource limited mobile/wireless applications
• Have good speech quality and reasonable compression ratio
• In general, the higher the speech bit rate, the higher the speech quality and the
greater the bandwidth and storage requirements
• In practice, it is always a trade-off between:
7
Bandwidth utilization & Speech quality
Speech Signal Digitization
• Process to convert speech from analog signal to

digital signal in order for digital processing and
transmission
• Three main phases: sampling, quantization & coding
8
Speech Signal Digitization (Cont.)
• Sampling: periodic measurement of an analog signal and
changes a continuous-time signal into a discrete-time signal
• For a narrow-band speech signal:

• Bandwidth 300 to 3400 Hz ( 0–4 kHz)
• Sampling rate is 8 kHz (i.e., 2 times the maximum signal
bandwidth) in accordance with the sampling theorem
• Time difference between two consecutive samples is 0.125
milliseconds (1/8000 = 0.125)
• If sampling rate ≥ twice the Nyquist frequency (4 kHz for

narrow-band voice), analogue signal can be fully recovered
from the samples
9
- Waveform with sample times
•The trick with the samples is
to take enough of them to
provide an accurate
reproduction of the original
while not sampling too much
•A low sampling rate may

cause details of the original
signal to be lost
•Sampling at too great a

frequency results in
unnecessary bandwidth
consumption
10
• Quantisation : converts the signal from continuous-amplitude
into discrete-amplitude signal
• Amplitude space is divided into 6 steps
three-bit binary codes can be used
• Each sample is approximated by
its closest available quantisation amplitude
• Each sample is coded into
binary bit streams through the coding process
• For Example:
• 1st sample, quantised amplitude = 0 and the coded bits = 100
• 2nd sample,quantised amplitude = 2 and the coded bits = 010
11
- Quantizing the samples
• Quantization Error: Difference
between the quantized amplitude
and actual signal amplitude
• The more quantization steps (fine

quantization), the lower the
quantization error,
but this requires more bits to
represent the signal and the
transmission bandwidth will also
be greater
• i.e., it is always a tradeoff
between the desired
quantization error and the
transmission bandwidth used
12
• Uniform quantization: quantization steps (value of Δ) are kept the same for all signal
amplitudes
• Non-uniform quantization: different quantization steps will be applied
• Fact: speech has non-uniform Probability Density Function (PDFs), with lower level speech
signal having a much higher PDF than high level speech signal
• Applying uniform quantization will normally create higher quantization error (or
quantization noise) for low speech signal and hence lower speech quality
• non-uniform quantization is normally used in speech compression coding where
quantization step will be kept smaller in lower level signal
13
• Coding : convert discrete-amplitude signal into a series of
binary bits (or bitstream) for transmission and storage
• Non-uniform quantisation has been applied in Pulse Coding
Modulation (PCM), the most simple and commonly used speech
codec
• PCM explores non-uniform quantisation by using a logarithm
companding method to provide fine quantisation for low speech
and coarse quantisation for high speech signal
• For G.711 PCM, each sample is assigned a value based on
eight bits
• This provides for 256 possible values, which means that a signal
could have as many as 256 possible lines of resolution
• The capacity of the channel or data rate:
8000 samples × 8 bits per sample = 64000 bit per second
14
Speech waveform and Spectrum
• Speech waveform: time-domain representation of
digitized speech signal.
• Speech spectrum: representation of the speech
signal in the frequency-domain
15
Speech Compression and Coding
• Three basic speech compression techniques:
1. Waveform Compression Coding
2. Parametric Compression Coding
3. Hybrid Compression Coding
16
Waveform Compression Coding
• Mainly to:
– Remove redundancy in the speech waveform (remove waveform
correlation between speech samples to achieve speech compression)
– Reconstruct the speech waveform at the decoder side as closely as

possible to the original speech waveform (minimize the error between
the reconstructed and the original speech waveforms)
• Simple - low implementation complexity – low compression ratios
• Typical bit rate range: 16 kb/s to 64 kb/s

• At bit rate lower than 16 kb/s, the quantisation error is too high, results in
lower speech quality
• Codecs: PCM and ADPCM (Adaptive Differential PCM)
17
Waveform Compression Coding -
PCM
• Two PCM codecs:
• PCM μ-law which is standardised for use in North America
and Japan
• PCM A-law for use in Europe and the rest of the world
• ITU-T G.711 was standardised by ITU-T for PCM codecs in

1988
• Each sample is coded using 8 bits, this yields the PCM

transmission rate of 64 kb/s when 8 kHz sample rate is applied
(8000 samples/s × 8 bits/sample = 64 kb/s)
18
ADPCM
• ADPCM proposed by Jayant in 1974 at Bell Labs
• ADPCM was developed to further compress PCM codec based on
correlation between adjacent speech samples
• ADPCM Consists of adaptive quantiser and adaptive predictor
• At the encoder side:
• ADPCM first converts 8 bit PCM signal (A-law or μ-law) to 16 bit linear PCM
signal
• The adaptive predictor will predict or estimate the current speech signal
based on previously received (reconstructed) N speech signal samples
˜s(n) as given in
• ai , i = 1, . . . , N are the estimated predictor coefficients

19 • Typical N value = 6
ADPCM (Cont.)
ADPCM Encoder ADPCM Decoder

e(n) Difference signal/prediction error, is calculated from the speech signal s(n) and
the signal estimate ˆs(n) and is given in
e(n) < PCM input signal less coding bits are needed to represent ADPCM
sample
•Decoder at receiver side will use the same prediction algorithm to reconstruct the
20
speech sample
ADPCM (Cont.)
Examples:
If ADPCM sample is coded into 4 bits ADPCM bit rate = 4 × 8 = 32 kb/s

One PCM channel (at 64 kb/s) can transmit two ADPCM channels at 32 kb/s each
If ADPCM sample is coded into 2 bits ADPCM bit rate = 2 × 8 = 16 kb/s

One PCM channel can transmit four ADPCM at 16 kb/s each
ITU-T G.726 defines ADPCM bit rate at:

40, 32, 24 & 16 kb/s
which corresponds to:
5, 4, 3 & 2 bits of coding/ADPCM sample
The higher the ADPCM bit rate, the higher the numbers of the quantization levels,
the lower the quantization error, and thus the better the voice quality
•Quality for 40 kb/s ADPCM is better than that of 32 kb/s

•Quality of 24 kb/s ADPCM is also better than that of 16 kb/s
21
Parametric Compression Coding
• Parametric compression only sends relevant parameters related with speech production
to the receiver side and reconstructs the speech from the speech production model
• Speech signal is stationary or the shape of the vocal tract is stable in short period
of time (e.g., 20 ms) = a speech segment
• For this speech segment parameters are obtained via speech analysis at the
encoder
• Parameters includes: the vocal tract filter parameters, voiced/unvoiced decision,
pitch period and gain (signal energy) parameters
• These parameters are then coded into binary bitstream and sent to transmission
channel
• Decoder at the receiver side will reconstruct the speech (carry out speech
synthesis) based on the received parameters
• High in implementation complexity - better compression ratio – low quality, with mechanic
sound, but with reasonable intelligibility
• Codec : Linear Prediction Coding (LPC) vocoder (bit rate from 1.2 to 4.8 kb/s)
• Used in: secure wireless communications systems when transmission bandwidth is very
22
limited
Parametric Compression Coding (Cont.)
- Speech generation mathematical model
Speech generation mathematical model
•Speech signal is voiced or unvoiced

•Speech excitation signal (x(n)) is switched between:
1. Period pulse train signal (controlled by the pitch period of T for the
voiced signal)
2. Random noise signal (for unvoiced speech)
•Excitation signal is amplified by Gain (G or energy of the signal) and then
sent to the vocal tract filter (modeld by linear prediction coding (LPC) filter)
23
Parametric Compression Coding (Cont.) -
Linear Prediction Coding (LPC)
• Linear Prediction Coding (LPC) Model (LPC vocoder (VOice enCODER))
• Continuous speech signal is segmented for 20 ms speech frames
• If we can:
• Detect whether a segment of speech is voiced or unvoiced
(e.g., 20 ms of speech, which corresponds to 160 samples at 8 kHz sampling rate)
• Estimate its LPC filter parameters, pitch period (for voiced signal) and
its gain (power) via speech signal analysis
• We can then:
• just encode and send these parameters to the channel/network
& then
• synthesize the speech based on the received parameters at the
decoder
• This process is repeated for each speech frame
24
Parametric Compression Coding (Cont.)-
Linear Prediction Coding (LPC) Encoder
• Encoder Key Components:
• Pitch estimation (to estimate the pitch period
of the speech segment)
• Voicing decision (to decide whether it is a
voiced or unvoiced frame)
• Gain calculation (to calculate the power of
the speech segment)
• LPC filter analysis (to predict the LPC filter
coefficients for this segment of speech)
Parameters/coefficients are quantized, coded and
packetized appropriately before they are sent to the
channel
• Parameters and coded bits from the LPC encoder:
• Pitch period (T): for example, coded in 7 bits as in LPC-10 (together with voicing
decision)
• Voiced/unvoiced decision: to indicate whether it is voiced or unvoiced segment.
• Gain (G) or signal power: coded in 5 bits as in LPC-10
• Vocal tract model coefficients: or LPC filter coefficients, normally in 10-order, i.e. a1,
25
a2, . . . , a10, coded in 41 bits in LPC-10
Linear Prediction Coding (LPC) Decoder
• Packetised LPC-bitstream are unpacked and
sent to the relevant decoder components :
• LPC decoder to retrieve the LPC
coefficients
• Pitch period decoder to retrieve pitch

period
• Pitch period is used to control the
impulse train sequence period when
in a voiced segment
• Gain decoder to retrieve the power of the

speech segment
• Voicing detection bit is used to control the
voiced/unvoiced switch
• Synthesizer synthesizes the speech

according to received parameters/coefficients
26
Linear Prediction Coding (LPC) Example
LPC-10
• Its coded bits = 54 per speech frame
• Sampling rate = 8 kHz = 8000 samples per second
• 180 samples per frame
• Frame duration = Number of samples / Sampling rate

= 180/8000 = 22.5 ms
• For every 22.5 ms, 54 coded binary bits from the encoder are sent to the channel
• The encoder bit rate = Number of coded bits per frame / Frame duration
= 54 bits/22.5 ms = 2400 bit/s or 2.4 kb/s
• The compression ratio (when compared with 64 kb/s PCM) = 64/2.4 = 26.7
27
Hybrid Compression Coding
• Combine the features of both waveform-based and parametric-
based coding
• keeps the nature of parametric coding : vocal tract filter and

pitch period analysis, and voiced/unvoiced decision
• Instead of using an impulse period train to represent the excitation

signal for voiced speech segment, it uses waveform-like
excitation signal for voiced, unvoiced or transition (containing
both voiced or unvoiced) speech segments
• Example of waveform-based excitation signals :

– Codebook Excitation Linear Prediction (CELP) with range of 4.8
kb/s to 16 kb/s for mobile/wireless/satellite communications
28
• Codecs: G.729, G.723.1, AMR, iLBC and SILK
Narrowband (NB) to Fullband (FB)
Speech Audio Compression
• Narrow-Band (NB):
• Speech frequency range of 300 Hz to 3400 Hz
• Used in traditional digital telephony in the PSTN
• Wide-Band (WB):
• Speech (0–7 kHz, with sampling rate at 16 kHz)
• Used in VoIP applications
• Why?
• To provide high speech transmission quality
(wideband speech will have more higher frequency components and have high speech
fidelity)
• Full-Band (FB):
• Considers the full human auditory bandwidth from 20 Hz to 20 kHz
• Why?
• To provide high quality, for speech, music and general audio.
• G.719 - used in teleconferencing and telepresence applications
29
Summary of NB, WB, SWB and FB
speech/audio compression coding
30
Standardised Narrowband to Fullband
Speech/Audio Codecs
31
Speech/Audio Codecs
32
Speech/Audio Codecs
33
Codec Selection and Performance
• The choice as to which codec to use has

been based on terms of:
Quality and Cost
• If money and bandwidth were not impediments,

organizations would probably use G.711 (64Kbps)
exclusively
• G.711 behaves very well in the presence of network
problems such as latency, packet loss, and jitter
34
Codec Selection and Performance
• An organization connected to the outside world via a T-1

link is limited to 1.544Mbps
• Using G.711, each call would consume one twenty-

fourth of this capacity
• A codec such as G.729 uses less BW
• Of course, at some point, increasing compression to

conserve bandwidth does start to affect call quality
35
IllustrativeWorked Examples
Question 1
• Determine the input and output data rates (in
kb/s) and hence the compression ratio for a
G.711 codec
• Assumptions:
• The input speech signal is first sampled at
8 kHz & that each sample is then
converted to 14-bit linear code before
being compressed into 8-bit non-linear
PCM by the G.711 codec
36
Solution 1
• For the input data:
• Input speech signal is sampled at 8 kHz i.e. 8000 samples
per second
• Each sample is coded using 14
• Thus the input data rate is:
8000×14 = 112,000 (bit/s) = 112 (kb/s)
• For the output data:

• Each sample is coded using 8-bit
• Thus the output data rate is:
8000×8 = 64,000 (bit/s) = 64 (kb/s)
• The compression ratio for a G.711 codec is:

112/64 = 1.75
37
Question 2
• G.726 (40 kb/s ) is the ITU-T standard codec
based on ADPCM
• Assumptions:
• Codec’s input speech signal is 16-bit linear PCM
and the sampling rate is 8 kHz
• Output of the G.726 ADPCM codec can operate at
four possible data rates: 40 kb/s, 32 kb/s, 24 kb/s
and 16 kb/s
• Explain how these rates are obtained
• What the compression ratios are when
compared with 64 kb/s PCM
38
Solution 2
• For 40 kb/s ADPCM, let’s assume the number of bits needed to code each
quantized difference signal is x, then we have:
40 kb/s = 8000 (samples/s)×x (bits/sample)
x = 40×1000/8000 = 5 (bits)
• Thus, using 5 bits to code each quantized difference signal will create an
ADPCM bit steam operating at 40 kb/s
• Similarly, for 32, 24 and 16 kb/s, the required bits for each quantized
difference signal is 4 bits, 3 bits and 2 bits, respectively
• For the compression ratio for 40 kb/s ADPCM when compared with 64 kb/s
PCM, it is 64/40 = 1.6
• For 32, 24 and 16 kb/s ADPCM, the compression ratio is 2, 2.67, 4,
respectively
39
Question 3
• For the G.723.1 codec, it is known that the
transmission bit rates can operate at either
5.3 or 6.3 kb/s
1. What is the frame size for G.723.1 codec?

2. How many speech samples are there
within one speech frame?
3. Determine the number of parameters bits
coded for the G.723.1 encoding
40
Solution 3
• For the G.723.1 codec, the frame size is 30 ms
• As G.723.1 is narrowband codec, the sampling rate is 8
kHz
Frame duration = Number of samples / Sampling rate
• The number of speech samples in a speech frame is:
30 (ms)×8000 (samples/s) = 240 (samples)
Encoder bit rate = Number of coded bits per frame / Frame duration
• For 5.3 kb/s G.723.1, the number of parameters bits is:
30 (ms)×5.3 (kb/s) = 159 (bits)
• For 6.3 kb/s G.723.1, the number of parameters bits is:
30 (ms)×6.3 (kb/s) = 189 (bits)
41
References
1) Lingfen Sun, Is-Haka Mkwawa, Emmanuel Jammeh, Emmanuel

Ifeachor, “Guide to Voice and Video over IP: For Fixed and Mobile
Networks”, 2013
2) Pramode K. Verma, Ling Wang, “Voice over IP Networks: Quality

of Service, Pricing and Security”, Volume 71, 2011
3) Bruce Hartpence, “Packet Guide to Voice over IP”, First Edition,

2013
4) Jonathan Davidson, James Peters, Brian Gracely, “Voice over IP

Fundamentals”, Cisco Press, 2000
42
Thank you!
43
Questions
44

Techniques for Speech Compression and Codecs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Techniques for Speech Compression and Codecs

Uploaded by

Copyright:

Available Formats

Speech Compression and Codecs

• Speech Signal Digitization

• Speech compression techniques

2. Parametric Compression Coding

3. Hybrid Compression Coding

• Illustrative Worked Examples

• Human can generate sounds (range is much less than what

Purpose of speech compression:

• Reduce the storage costs (e.g., for speech

• VoIP tool (e.g., Skype, Google Talk and xLite)

• Typical voice codecs used in VoIP include:

• Process to convert speech from analog signal to

• For a narrow-band speech signal:

• If sampling rate ≥ twice the Nyquist frequency (4 kHz for

•A low sampling rate may

•Sampling at too great a

• The more quantization steps (fine

1. Waveform Compression Coding

2. Parametric Compression Coding

3. Hybrid Compression Coding

– Reconstruct the speech waveform at the decoder side as closely as

• Simple - low implementation complexity – low compression ratios

• Typical bit rate range: 16 kb/s to 64 kb/s

• ITU-T G.711 was standardised by ITU-T for PCM codecs in

• Each sample is coded using 8 bits, this yields the PCM

• ai , i = 1, . . . , N are the estimated predictor coefficients

ADPCM Encoder ADPCM Decoder

If ADPCM sample is coded into 4 bits ADPCM bit rate = 4 × 8 = 32 kb/s

If ADPCM sample is coded into 2 bits ADPCM bit rate = 2 × 8 = 16 kb/s

ITU-T G.726 defines ADPCM bit rate at:

•Quality for 40 kb/s ADPCM is better than that of 32 kb/s

•Speech signal is voiced or unvoiced

• Continuous speech signal is segmented for 20 ms speech frames

• Pitch period decoder to retrieve pitch

• Gain decoder to retrieve the power of the

• Synthesizer synthesizes the speech

• Its coded bits = 54 per speech frame

• Sampling rate = 8 kHz = 8000 samples per second

• 180 samples per frame

• Frame duration = Number of samples / Sampling rate

• keeps the nature of parametric coding : vocal tract filter and

• Instead of using an impulse period train to represent the excitation

• Example of waveform-based excitation signals :

• The choice as to which codec to use has

• If money and bandwidth were not impediments,

• An organization connected to the outside world via a T-1

• Using G.711, each call would consume one twenty-

• A codec such as G.729 uses less BW

• Of course, at some point, increasing compression to

• For the output data:

• The compression ratio for a G.711 codec is:

1. What is the frame size for G.723.1 codec?

1) Lingfen Sun, Is-Haka Mkwawa, Emmanuel Jammeh, Emmanuel

2) Pramode K. Verma, Ling Wang, “Voice over IP Networks: Quality

3) Bruce Hartpence, “Packet Guide to Voice over IP”, First Edition,

4) Jonathan Davidson, James Peters, Brian Gracely, “Voice over IP

You might also like