You are on page 1of 87

h

ia ja
a l ra
ELEC9344:Speech & Audio

tr ai
Processing

us ikb
Am
Chapter 4

SW .
,A
N E
Speech Coding I
U or
ss
fe
o
Pr

1
h
Speech Coding

ia ja
a l ra
¾ Speech Coding is a technique used to compress a

tr ai
speech signal for digital storage or transmission

us ik
purposes and then decompress the stored or

b
transmitted data, so as to allow a perceptually faithful

Am
reproduction of the original signal.
¾ Sophisticated speech coding algorithms exist, so that
SW .
,A
N E
speech signals may be compressed at various bit-rates.
U or
¾ It is known that the lower the bit-rate the lesser the
ss

speech quality of the recovered speech.


fe

¾ However, there is a constant quest to achieve a better


o

speech quality at lower bit-rates.


Pr

2
h
Speech Coders

ia ja
a l ra
¾ There are basically three classes of speech coders:

tr ai
• Waveform Coders

us ik
• Source Coders

b
• Hybrid coders

Am
¾ Waveform Coders accept continuous analogue
SW .
,A
speech signals and encode them into digital signals
N E
prior to transmission. Decoders at the receivers
U or
reverse the encoding process to recover the speech
ss

signals. In the absence of transmission errors, the


fe

recovered speech waveforms have a close


o

resemblance to the original speech waveforms.


Pr

3
h
Waveform Encoding Techniques

ia ja
a l ra
¾ The waveform encoding techniques are:

tr ai
– PCM (Pulse Code Modulation)

us ik
– DPCM (Differential Pulse Code Modulation)

b
– ADPCM (Adaptive Differential Pulse Code

Am
Modulation)
– DM (Delta Modulation)
– ADM(CVSD) [Adaptive Delta Modulation or
SW .
,A
N E Continuously Variable-Slope Delta Modulation]
U or
¾ The above time-domain coding algorithms outlined
ss

above have their counterparts in the frequency


fe

domain:
o

– SBC (Sub-band coding)


Pr

– ATC (Adaptive Transform Coding) 4


h
PCM

ia ja
¾ The simplest waveform coding method is linear pulse

a l ra
code modulation. The analogue signals are quantised

tr ai
linearly (uniform quantisation- all steps are equal

us ik
size). No compression involved.

b
Am
Quantisation We can show that the Signal to
steps (∆V) Quantisation Noise Ratio (SQNR)
SW .
,A
N E is given by:
Speech
U or
SQNR = 6.02B + 1.76 dB
ss
fe

•The SQNR increases


o

approximately 6dB for each bit.


Pr

Linear Quantisation •Assumes Uniform PDF


5
h
Non-Uniform PCM

ia ja
¾ We know that the speech signals are heavily

a l ra
concentrated in the low amplitudes and hence it is a

tr ai
much better strategy to use nonuniform quantiser in

us ik
which the steps are densest at the low levels

b
Am
Speech PDF is not uniform but Gaussian
and lower level speech occurs more
frequently than higher level speech.
SW .
,A
N E Therefore more quantisation noise is
U or
tolerable for high levels of speech than
at lower levels.
ss

Gaussian PDF
fe

Uniform
o

Nonlinear Quantisation PDF


Pr

6
h
DPCM

ia ja
¾ By taking the advantage of considerable amount of

a l ra
redundancy available in the speech signal, the bitrate

tr ai
may be reduced.

us ik
¾ For example, if we take the difference of the speech

b
Am
signal (s(n)- s(n-1)), this will have a lower variance
(and hence lower power) than signal itself.

SW .
¾ If the power is reduced, then we should be able to

,A
N E
encode the signal with fewer bits and thereby reduce
U or
the bit rate required.
ss

e(n) = s(n)-s(n-1)
fe

E(z) = S(z)[1-z-1]
o
Pr

7
h
DPCM

ia ja
¾ The Transmitter computes e(n) and this difference is

a l ra
transmitted through the channel to the receiver (see

tr ai
below), where it is integrated in the feedback loop

us ik
and the original signal is restored

b
Am
S(z) Y(z)
Channel

SW .
,A
N E
U or
Transmitter Receiver
ss

E ( z ) = S ( z )(1 − z −1 )
o fe

E( z)
Y ( z) = = S ( z)
Pr

−1
1− z 8
h
S(z)

ia ja
Y(z)

a l ra
tr ai
(Quantisation noise)

us ikb
Am
DPCM with quantisation of the difference signal. The
quantisation noise is modelled by additive noise

SW .
,A
N E Nq ( z)
Y ( z) = S ( z) +
U or
1 − z −1
ss
fe

Thus the quantisation noise is integrated at the receiver, and this


o

build up of noise is found to be more disturbing as compared to


Pr

the same noise had it been merely added to the original speech
9
signal.
h
Differential quantisation with the quantiser inside a feedback loop at the

ia ja
transmitter

a l ra
S(z) Y(z)

tr ai
us ikb
Am
SW .
,A
N E
E ( z) = S ( z) + C ( z) + Nq ( z)
U or
−1 The quantisation noise is
E ( z ) = [ S ( z ) + N q ( z )](1 − z )
ss

being differenced along with


s(n)
E ( z ) z −1
fe

C ( z) =
o

1 − z −1
Pr

Y ( z) = S ( z) + Nq ( z) 10
h
Next refinement is to recognise that simple differencing does not

ia ja
minimise the error of the output. Replace the simple differencer by a

a l ra
full pth order linear predictor (P(z))

tr ai
S(z) Y(z)

us ikb
Am
SW .
,A
N E
U or
E(z) =[S(z) + Nq(z)](1− P(z)) Using such a predictor, a reduction in variance
ss

of 6dB relative to PCM has been reported (p


E(z)P(z)
fe

C(z) = =2). If the reduction in variance is 6dB


1− z−1 means, we can use one less bit in quantising
o

the residual than would be required if the


Pr

Y(z) = S(z) + Nq(z) original signal were sent. 11


h
ADPCM

ia ja
¾ The compression ratio of DPCM can be further

a l ra
improved by adaptive prediction. Adaptive prediction

tr ai
in its simplest form is simply linear prediction.

us ik
¾ The speech data is segmented into frames of typically

b
Am
10 to 20 ms and the predictor coefficients are
transmitted along with the residual (see next slide).

SW .
¾ At the receiver, an inverse filter controlled by the

,A
N E
predictor coefficients reconstructs the speech.
U or
¾ It is reported that the S/N improvement of more than
ss

12 dB can be achieved for a 10th order linear


fe

predictor.
o
Pr

12
S(z)
Pr
o fe
ss
U or
N E
SW .
Am
,A
b
ADPCM

us ik
tr ai
a l ra
ia ja
h
13
Y(z)
h
Summary of Waveform Coders

ia ja
a l ra
tr ai
us ik
¾ A Law or µ Law PCM: 8 bit compressed

b
PCM, 8 kHz sampling rate; 64 kbps

Am
¾ DPCM: 32 kbps
¾ ADPCM: 24-32 kbps
SW .
,A
N E
U or
ss
fe
o
Pr

14
h
Source Coders

ia ja
a l ra
¾ Source Coders such as Linear Predictive Coding (LPC)

tr ai
vocoders, which are based on the speech production

us ik
model, operate at bit rates as low as 2 kbits/s. However,

b
the synthetic quality of the vocoded speech is not

Am
broadly appropriate for commercial telephone
applications.

SW .
,A
N E
U or
¾ LPC in its basic form has been mainly used in secure
military communications where speech must be carried
ss

out at very low bit rates. A US standard algorithm


fe

known as LPC-10 is widely used.


o
Pr

15
h
LPC-10 Algorithm

ia ja
a l ra
¾The input speech is sampled at 8 kHz and

tr ai
digitised as 12 bit two’s complement words. The

us ik
incoming speech is partioned into 180 sample

b
Am
frames (22.5 ms), resulting in a framerate of
44.44 frames/sec.

SW .
¾A tenth order linear predictor is used to

,A
N E
compute the vocal tract filter parameters
U or
(reflection coefficients)
ss

¾Pitch and voicing are determined using AMDF


fe

algorithm.
o
Pr

16
h
LPC-10 Algorithm

ia ja
a l ra
¾The bit assignments used are: 5 bits each for k1

tr ai
to k2, 4 bits each for k5 through k8, 3 bits for k9

us ik
and 2 bits for k10.

b
Am
¾In the transmitted bit stream, 41 bits are used for
the reflection coefficients, 7 bits for pitch and
SW .
voicing, and 5 bits for amplitude.

,A
N E
¾One bit is used for synchronisation, giving a
U or
total of 54 bits per frame.
ss
fe

¾Since there are 44.44 frames/s, this gives an


overall bit rate of 2.4 kbits/s.
o
Pr

17
h
Limitation of LPC Vocoding

ia ja
a l ra
¾ The main limitation of LPC vocoding is the assumption

tr ai
that speech signals are either voiced or unvoiced, hence

us ik
the source of excitation of the synthesis all-pole filter is

b
Am
either a train of pulses (for voiced speech), or random
noise (for unvoiced speech).

SW .
¾ In fact there are more than two modes in which the

,A
N E
vocal tract is excited and often these modes are mixed.
U or
¾ Even when the speech waveform is voiced, it is a gross
ss

simplification to assume that there is only one point of


fe

excitation in the entire pitch period.


o
Pr

18
h
Hybrid Coders

ia ja
a l ra
¾ Hybrid coders combine features from both source coders

tr ai
and waveform coders. Several hybrid coders employ an

us ik
analysis-by-synthesis process in order to derive code

b
parameters.

Am
¾ A block diagram of an analysis-by-synthesis coding
system is shown below.
SW .
,A
N E
¾ A local decoder is present inside the encoder and the aim
U or
of the analysis-by-synthesis speech coding structure is to
ss

derive some codec parameters so that the difference


fe

between the input and synthesised signal is minimised


o

according to some suitable error minimisation criterion.


Pr

19
LPC vocoder

h
ia ja
a l ra
tr ai
us ik
General Model for Analysis-by-
synthesis coding Structure

b
Am
The coder is capable of producing high
quality speech at bit rates around 10 kbits/s

SW .
,A
N E
In this new generation of analysis-by-synthesis models, no prior
U or
knowledge of a voiced/unvoiced decision or pitch period is needed. The
ss

excitation is modelled by a number of pulses, usually 4 per 5ms, whose


amplitudes and positions are determined by the perceptually weighted error
fe

between the original and synthesised speech.


o

The multipulse approach assumes that both the pulse positions and
Pr

amplitudes are initially unknown, they are then determined inside the
20
minimisation loop one pulse at a time.
h
Hybrid Coders……

ia ja
a l ra
¾ When the bit rate is reduced below 9.6 kbit/s, the multi-

tr ai
pulse excited coders fail to maintain good speech quality.

us ik
¾ This is due to large number of bits needed to encode

b
excitation pulses and the quality deteriorates when these

Am
pulses are coarsely quantised.
¾ Therefore, if analysis-by-synthesis structure is to be used
SW .
,A
N E
for producing good quality speech at bit rates below 8
U or
kbits/s, more suitable approaches to the definition of the
excitation signal must be sought.
ss

¾ The Code-Excited Linear Prediction (CELP) coder


fe

provided an improved excitation signal.


o
Pr

21
h
Speech Coding Standards

ia ja
a l ra
¾For speech coding to be useful in

tr ai
telecommunication applications, it has to be

us ik
standardised (it must conform to the same

b
algorithm and bit format).

Am
¾Speech-coding standards are established by
various standards organisations: For example,
SW .
,A
N E
ITU-T (International Telecommunication
U or
Union, Telecoomuncation Standardisation
ss

Sector (formally CCITT), ETSI (European


fe

Telecommunications Standards Institute) etc


o
Pr

22
h
ITU standard

ia ja
a l ra
¾G711/712: 64 kbits/s A/µ Law PCM

tr ai
us ik
¾G726 : 32 kbits/s ADPCM

b
¾G728: 16 kbits/s LD-CELP (Low Delay CELP)

Am
¾G729: 8 kbits/s CS-ACELP (Conjugate Structure
Algebraic CELP)
SW .
,A
N E
¾The next standardisation issue involves a 4 kbit/s
U or
speech coding algorithm. The main terms of
ss

reference for the ITU 4 kbits/s speech coding


o fe

standard are given below.


Pr

23
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
fe
o
Pr

Terms and Reference for ITU 4 kbit/s speech coding Standard 24


h
Low Bit rate Speech Coding

ia ja
a l ra
¾ It is important to develop low bit-rate speech

tr ai
coding systems for voice storage such as voice

us ik
mail and voice driven directory enquiries

b
¾ Therefore , the speech research on speech coding

Am
methods whose bit rate is less than 4 kbits/s but
still provide toll quality speech has recently

SW .
become very active.

,A
N E
¾ The ultimate goal is to design a low-delay, low bit
U or
rate codec (in real time) (around 2.4 kbits/s) with
ss

good perceptual quality for digital mobile


fe

communications and other applications.


o
Pr

25
h
Analysis-by-Synthesis Speech Coding

ia ja
a l ra
¾ As the Code-Excited Linear Prediction (CELP)

tr ai
coder is based on the analysis-by-analysis

us ik
technique, first the theory behind the analysis-by-

b
synthesis method is explained and then it is

Am
extended to CELP coding.

SW .
,A
N E
¾ The basic structure of the general model for
U or
analysis-by-synthesis predictive coding of speech
ss

is shown below.
o fe
Pr

26
h
General model for analysis-by-

ia ja
synthesis Linear Predictive Coding

a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss

¾ The model consists of four main parts:


– Excitation Generator
fe

– Synthesis Filter
o

– Error Weighting filter


Pr

– Error minimisation 27
h
The Short-Term Predictor (Synthesis Filter)

ia ja
a l ra
tr ai
¾ The synthesis filter is an all-pole time varying

us ik
filter for modelling the short-time spectral

b
envelope of the speech waveform.

Am
¾ It is often called a short-term prediction filter
because its coefficients are computed by
predicting a speech sample from a few previous (8
SW .
,A
N E
to 16 samples) samples.
U or
¾ The set of coefficients is called Linear predictive
ss

Coding (LPC) parameters.


o fe
Pr

28
h
¾ The spectral envelope of a speech segment of length L

ia ja
samples can be approximated by the transfer function

a l ra
of an all-pole digital filter of the form .

tr ai
us ikb
Am
¾ The coefficients ak are computed using the method of
linear prediction . The number of coefficients p is
called the predictor order.
SW .
,A
N E
¾ The basic idea behind the linear predictive analysis is
U or
that a speech sample can be approximated as a linear
ss

combination of ‘p’ past speech samples i.e.


o fe
Pr

29
h
^
¾ Where s(n) is the speech sample and S (n) is the

ia ja
predicted speech sample at sampling instant n. The

a l ra
prediction error, ε (n) is defined as

tr ai
us ikb
Am
¾The prediction error signal, ε (n) is shown in
the diagram:
SW .
,A
N E
U or
ss
o fe
Pr

30
h
¾ It can be seen from the above figure when ak = αk,

ia ja
then ε(n) = G u(n) and the linear prediction is

a l ra
called an inverse filter. The transfer function of

tr ai
the inverse filter is given by

us ikb
Am
¾ For voiced speech this means that ε(n) would
SW .
,A
consist of a train of impulses; ε(n) would be small
N E
most of the time except at the beginning of the
U or
pitch period .
ss
o fe
Pr

31
h
ia ja
¾ An examples of a signal s(n) and a prediction error
for the vowel ‘a’ is shown below.

a l ra
tr ai
us ikb
Am
¾ Because of the time-varying nature of the speech, the

SW .
prediction coefficients should be estimated from

,A
N E
short segments of speech signal (10-20 ms).
U or
ss

¾ The basic approach is to find a set of predictor


fe

coefficients ak that will minimize the mean-squared


o

prediction error over a short segment of speech


Pr

waveform. 32
h
ia ja
General model for analysis-by-synthesis Linear Predictive Coding

a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr

33
h
The Error Weighting Filter

ia ja
a l ra
¾ The function of the perceptual error weighting filter,

tr ai
W(z), in the above diagram is to reduce a large part of

us ik
perceived noise in the coder which comes from the
frequency region where signal is low.

b
Am
¾ The theory of auditory masking suggests that noise in
the formant regions would be particularly or totally
SW .
,A
N E
masked by the speech signal.
U or
¾ Therefore to reduce perceived noise , the error spectrum
ss

E(z) is shaped (see figure previous slide) so that the


fe

frequency component in the noise around the formant


o

regions are allowed to have higher energy relative to the


Pr

components in the inter-formant regions. 34


h
¾ The coefficients for the error weighting filter are

ia ja
derived from the LPC coefficients.

a l ra
¾ The weighting filter W(z) can be expressed as,

tr ai
us ikb
Am
SW .
,A
N E
U or
where
ss
fe

The value γ D is determined by the degree to which


o

one wishes to deemphasize the formant regions in the


Pr

error spectrum E(z). 35


¾ Note that decreasing γ D = 1 increases the bandwidth of

h
ia ja
the poles of W(z).

a l ra
¾ If γ D =1 and W(z)=1 which is equivalent to no weighting .

tr ai
¾ A good choice is to use a value of γ D between 0.8 and

us ik
0.9.

b
Am
¾ In this coder γ D is chosen as 0.9.
¾ Note that decreasing γ D increases the bandwidth of

SW .
the poles of ⎧⎨ AA((zz)) ⎫⎬

,A
N E ⎩
'

¾ The increase in the bandwidth ∆ω is given by
U or
ss
fe

¾ Where fs the sampling frequency


o
Pr

36
h
¾Figure below shows an example of the spectrum

ia ja
of 1/A(z) and A ( z ) A ⎛⎜⎜⎝ γ z ⎞⎟⎟⎠ for different values of γD

a l ra
D

tr ai
¾An increase in bandwidth of the poles is noticeable

us ik
when γD changed from 0.9 to 0.8

b
Am
SW .
,A
N E Spectrum of 1/A(z) and
U or
A(z)/A(z/γ) for different
values of γ
ss
o fe
Pr

37
h
The Error Minimization

ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss

¾The excitation generator produces the excitation


fe

sequence which is fed to the synthesis filter to produce


o
Pr

the reconstructed speech.


38
h
The Error Minimization

ia ja
a l ra
¾ The most common error minimization criterion is

tr ai
the mean squared error.

us ikb
¾ In the model (see previous slide) subjectively

Am
meaningful error minimization criterion is used,
where the error e(n) is passed through a perceptual
SW .
,A
N E
weighting filter.
U or
¾ The excitation sequence is optimized by
ss

minimizing the perceptually weighted error


fe

between the original and synthesized speech.


o
Pr

39
h
The Encoder

ia ja
a l ra
¾ In the encoding procedure, the synthesis filter

tr ai
parameters (LPC coeffiecients are determined

us ik
from speech samples (20 ms of speech is a frame

b
= 160 samples) outside the optimisation loop.

Am
¾ The optimum excitation sequence for this
synthesis filter is then determined by minimising
SW .
,A
N E
the weighted error criterion.
U or
¾ The excitation optimisation interval is usually in
ss

the range of 4 to 7.5 ms which is less than the LPC


fe

coefficients update frame size.


o
Pr

40
h
¾Thus the speech frame is divided into sub

ia ja
frames (normally four 5 ms sub frames = 4x40

a l ra
samples), where the excitation is determined

tr ai
individually for each sub frame.

us ikb
Am
SW .
,A
N E
¾A new set of LPC coefficients is transmitted
U or
every 20ms, although interpolation of LPC
parameters between adjacent frame can be
ss

used to obtain different set of parameters for


fe

every sub frame.


o
Pr

41
h
¾ This interpolation enables the updating of the filter

ia ja
parameters every 5 ms while transmitting them to the

a l ra
decoder every 20 ms. i.e. without requiring the higher

tr ai
bit rate associated with shorter updating frames.

us ikb
¾ The set of predictor coefficients, ak,, cannot be used

Am
for interpolation ,because the interpolated parameters
in this case do not guarantee a stable synthesis filter.

SW .
,A
N E
¾ The interpolation is, therefore, performed using a
U or
transformed set of parameters where the filter stability
ss

can be easily guaranteed by using the log-area ratios


fe

(LAR), Line Spectral Paris (LSP ) or Line Spectrum


o

Frequencies (LSF).
Pr

42
h
ia ja
¾ If FN is the quantised LPC vector in the present
frame and FN-1 is the quantised LPC vector from

a l ra
the past frame, then the interpolated LPC vector

tr ai
SFk in a subframe k is given by

us ikb
Am
where δk is a fraction between 0 and 1.

SW .
,A
N E
¾ The variable δk is gradually decreased with the sub
U or
frame index.
ss

¾ Figure given in the next slide shows two frames


fe

and corresponding sub frames.


o
Pr

43
h
¾ For specific example, a good choice of values of δk

ia ja
a l ra
is 0.75 ,0.5 ,0.25 and 0 , for k=1,…..,4.

tr ai
¾ Using these values, the interpolated LPC coefficients

us ik
in the four subframes are given by,

b
Am
SW .
,A
N E
U or
ss
o fe

Interpolation in this manner, however, increases the


Pr

overall codec delay. 44


h
The Long-Term Predictor

ia ja
a l ra
¾ While the short-term predictor models the spectral

tr ai
envelope of the speech segment being analyzed,

us ik
the long term predictor, or the pitch, is used to

b
model the fine structure of that envelope.

Am
¾ After inverse filtering the prediction error, ε(n)
SW .
,A
N E
, still exhibits some periodicity related to the pitch
U or
period of the original speech when it is voiced
ss
fe

¾ This periodicity is of the order of 20 to 147


o
Pr

samples (55 to 400 HZ) .


45
h
The Long-Term Predictor…..

ia ja
a l ra
¾ Adding the pitch predictor to the inverse filter

tr ai
further removes redundancy in the residual signal

us ik
and turns it into a noise like process (see below).

b
Am
SW .
,A
N E
U or
ss
o fe
Pr

46
h
The Long-Term Predictor …..

ia ja
a l ra
¾ It is called a pitch predictor since it removes the

tr ai
pitch periodicity or a long-term predictor since the

us ik
predictor delay is between 20 and 147 samples.

b
Am
¾ The output signal r(n) after the pitch predictor is
very close to random Gaussian noise.

SW .
¾ The transfer function of the pitch predictor is

,A
N E
given by
U or
ss
o fe
Pr

47
h
The Long-Term Predictor ……

ia ja
a l ra
¾ For m1=m2=0 , a one-tap predictor is obtained ,for

tr ai
m1=m2=1 , a three tap predictor results .

us ik
¾ The delay α usually represents the pitch period in

b
Am
samples (or a multiple of it and bk represents the
long-term predictor coefficients.

SW .
¾ The pitch predictor is not an essential requirement in

,A
N E
medium bit rate LPC coders such as multi-pulse
U or
excited LPC and regular pulse excited LPC
ss

coders, however including a pitch predictor


fe

improves their performance.


o
Pr

48
h
¾ The pitch predictor is essential in low bit rate

ia ja
speech coders, such as those using the CELP

a l ra
technique.

tr ai
¾ A one-tap pitch predictor is shown below and is

us ik
given by

b
Am
SW .
,A
N E
Where ε(n) is the
residual after short-term
U or
prediction or inverse
ss

filtering
fe

¾ For the one-tap predictor,the residual r(n) is given


by,
o
Pr

49
h
Open-Loop Method

ia ja
a l ra
¾ The parameters α and b0 are determined by

tr ai
minimizing the mean squared residual error after

us ik
short-term and long-term prediction over a period

b
of N samples.

Am
¾ The mean squared residual E is given by

SW .
,A
N E
U or
ss

b0 ≤ 1
o fe
Pr

50
h
¾ Substituting b0 into mean squared residual

ia ja
equation gives

a l ra
Minimizing E

tr ai
requires maximizing

us ik
the normalized

b
correlation term in

Am
this equation.

SW .
,A
N E
U or
This term is computed for all possible values α
ss
fe

over its specified range, and the value α which


o

maximizes this term (or minimising E) is


Pr

chosen. 51
h
The general model for analysis-by-synthesis linear

ia ja
predictive coding as given previously may be modified in

a l ra
order to include a long-term predictor or pitch predictor.

tr ai
us ikb
Am
SW .
,A
N E
U or
ss
o fe
Pr

52
h
ia ja
¾It was shown above that the pitch predictor

a l ra
parameters b0 and α are calculated directly from

tr ai
the LPC residual signal ε(n). .

us ik
¾This is known as the open-loop method for

b
Am
calculating the pitch predictor parameters.
¾However, a significant improvement is achieved
SW .
,A
when pitch predictor parameters b0 and α are
N E
optimized inside the analysis-by-synthesis loop.
U or
In this case, the computation of the parameters
ss
fe

contributes directly to the weighted error


o

minimization procedure.
Pr

53
h
Closed-Loop Method

ia ja
a l ra
¾ A different configuration (see next slide) can be

tr ai
derived from the general model for analysis by

us ik
synthesis LPC coding with Long-term predictor

b
that may be useful for closed-loop analysis.

Am
¾ The weighting filter W(z) is given by

SW .
,A
N E
U or
ss
o fe
Pr

54
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
55
h
¾ In this new configuration , the original and the synthesized

ia ja
speech are weighted separately before subtraction.

a l ra
tr ai
us ikb
Am
SW .
,A
N E
¾ Assuming one-tap long-term prediction , the output of the
pitch synthesis filter given by
U or
ss
fe

¾ It is first assumed that no excitation has been determined,


so the above equation reduces to
o
Pr

56
h
ia ja
^
¾ The weighted synthesis speech , s (n) is given by
W

a l ra
tr ai
us ikb
Where h(n) is the impulse response of the

Am
^
combined filter W(z)/A(z) and s 0 ( n ) is the zero
input response of the combined filter W(z)/A(z),

SW .
,A
i.e. the output of the combined filter due to its
N E
initial states.
U or
¾ The weighted error between the original and
ss

synthesized speech is given by


o fe

e w ( n ) = s w ( n ) − sˆ w ( n )
Pr

57
h
By combining and

ia ja
a l ra
e w ( n ) = s w ( n ) − sˆ w ( n ) we obtain

tr ai
us ikb
Am
¾The mean squared weighted error is given by

SW .
,A
N E
U or
ss
o fe
Pr

58
¾ Substituting b0 we obtain,

h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
¾Significant speech quality improvement over the
ss

open loop solution is achieved when the long-term


fe

predictor parameters b0 and α are computed inside the


o

optimization loop using the above equations.


Pr

59
h
ia ja
a l ra
¾ The disadvantage of the closed-loop solution is the
extra computation needed to compute the

tr ai
convolution equation in the previous slide over

us ik
the range of α.

b
Am
¾ However, a fast procedure to compute this

SW .
,A
convolution yα(n) for all the possible delays is
N E
available.
U or
ss
fe

¾ Normally α is calculated using open-loop method


and b0 is calculated using closed-loop method.
o
Pr

60
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss

A basic Structure for high quality analysis-by-synthesis LPC coding


o fe
Pr

61
h
Code-Excited Linear Prediction

ia ja
a l ra
¾ The residual signal after short-term and long-term

tr ai
prediction becomes noise like and its is assumed

us ik
that this residual can be modelled by a zero-mean

b
Gaussian process with a slowly varying power

Am
spectrum.
¾ As a result the excitation frame may be vector

SW .
quantized using a large stochastic codebook.

,A
N E
U or
¾ In the CELP approach , a 5ms (40 samples)
excitation frame is modelled by a Gaussian vector
ss

chosen from a large Gaussian codebook by


fe

minimizing the perceptually weighted error


o

between the original and synthesized speech.


Pr

62
h
Code-Excited Linear Prediction ……..

ia ja
a l ra
¾ Usually a codebook of 1024 entries is needed, and

tr ai
the optimum innovation sequence is chosen by an

us ik
exhaustive search.

b
Am
¾ This exhaustive search of the excitation codebook
greatly complicates the CELP implementation.

SW .
¾ The excitation generation in the previous slide(s)

,A
N E
can now be replaced by a stochastic codebook
U or
with a gain β
ss

¾ This is show in the next slide.


o fe
Pr

63
h
A basic Structure for high quality

ia ja
analysis-by-synthesis LPC coding

a l ra
tr ai
us ikb
Am
Speech synthesis
section

SW .
,A
N E
U or
ss
o fe
Pr

64
h
Code-Excited Linear Prediction ..

ia ja
a l ra
¾ The stochastic codebook is sometimes referred to

tr ai
as a fixed codebook and the gain β is referred to as

us ik
a fixed gain.

b
Am
¾ The speech synthesis section (see previous slide)
is redrawn in the diagram below to include the
stochastic codebook.
SW .
,A
N E
U or
ss
o fe
Pr

65
h
Code-Excited Linear Prediction ..

ia ja
a l ra
¾ Using P(z)=1-b0z-α, the diagram in the previous

tr ai
slide can be modified as shown below.

us ik
¾ Here α is the pitch period.

b
Am
SW .
,A
N E
U or
ss
o fe
Pr

66
h
Code-Excited Linear Prediction ..

ia ja
a l ra
¾ It can be seen that each delay cell contains 40

tr ai
samples and these values are updated every 5ms

us ik
(sub frame delay).

b
Am
¾ The long-term predictor can now be replaced by
an adaptive codebook and the appropriate address
selected from this codebook will respond to the
SW .
,A
N E
pitch delay α.
U or
¾ The gain b0 is now denoted by the adaptive gain g.
ss

Figure.in the previous slide is now redrawn to


fe

include the adaptive codebook (see next slide)


o
Pr

67
h
Code-Excited Linear Prediction (continued)

ia ja
a l ra
¾ Figure 11 shows a schematic diagram of the CELP

tr ai
coder including the adaptive codebook.

us ik
¾ The address selected from the adaptive codebook

b
Am
(α), the corresponding gain (g), the address (k)
selected from the stochastic codebook and the
corresponding gain(β) are sent to the decoder.
SW .
,A
N E
¾ This uses identical codebooks to determine the
U or
excitation signal at the input to the Short-term
ss

predictor in order to produce the reconstructed


fe

speech.
o
Pr

68
h
Schematic Diagram of the CELP synthesis model

ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr

69
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
Schematic Diagram of a CELP Coder

70
h
Code-Excited Linear Prediction …

ia ja
a l ra
¾ The first stage in determining the excitation

tr ai
parameters is to calculate the pitch delay α and the

us ik
adaptive gain g .

b
Am
¾ Then the equations for stochastic codebook gain β
and the best excitation codeword (k) are derived.

SW .
¾ Using the Figure in the previous slide, the

,A
N E
weighted synthesized speech can be written as
U or
ss
o fe
Pr

71
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
¾The weighted error between error the original and
ss

synthesized speech is given by:


ofe
Pr

72
¾By Combining the above two equations we obtain:
h
ia ja
a l ra
¾ Assume that no excitation has been determined,

tr ai
i.e. ck(n)=0

us ikb
¾ Therefore the above equation is rewritten as

Am
SW .
,A
N E
¾ The mean squared weighted error is given by
U or
ss
o fe

¾ Setting gives
Pr

73
h
ia ja
a l ra
tr ai
us ikb
¾ Substituting g in equation given in the previous

Am
slide results in

SW .
,A
N E
U or
ss
o fe
Pr

74
h
¾ The stochastic codebook parameters β and k are

ia ja
derived in the following sequence:

a l ra
tr ai
us ikb
Am
SW .
,A
N E
¾The mean squared weighted error is given by
U or
ss
o fe
Pr

75
Setting

Pr
o fe
ss Ew now becomes
U or
gives

N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
76
h
ia ja
¾ Letting leads

a l ra
to the following matrix form:

tr ai
us ikb
Am
SW .
,A
N E
U or
ss
o fe

¾Similarly
Pr

77
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
Schematic Diagram of a CELP Coder

78
h
Code-Excited Linear Prediction ..

ia ja
a l ra
tr ai
¾ The excitation codebook contains (see figure in

us ik
the previous slide) L code words (stochastic

b
Am
vectors) of length N samples (typically L=1024
and N=40 samples corresponding to a 5 ms
excitation frame).
SW .
,A
N E
¾ The excitation signal for a speech frame of length
U or
N is chosen by an exhaustive search of the
ss

codebook .
fe

¾ This is a computationally demanding procedure


o

which is difficult to implement in real time.


Pr

79
h
ia ja
¾ There are alternative methods to simplify the

a l ra
codebook search procedure without affecting the

tr ai
quality of the output speech available such as:

us ik
– CELP search procedure using autocorrelation

b
approach

Am
– Structured codebooks
– Sparse exciting codebooks

SW .
,A
N E
– Ternary codebooks
U or
– Algebraic codebooks
ss

– Overlapping codebooks
o fe
Pr

80
h
Code-Excited Linear Prediction

ia ja
a l ra
¾ It can be seen from the diagram that CELP coders

tr ai
do not distinguished between voiced and unvoiced

us ik
frames.

b
¾ Input speech is divided into non-overlapping

Am
frames typically 20ms (160 samples) in duration
and ten LPC coefficients or Line Spectrum Paris

SW .
(LSP) are calculated for a single frame.

,A
N E
¾ A speech frame is divided into sub frames
U or
(normally 4 sub frames = 4x40 samples) and the
ss

parameters g, α, β and k given by the previous


fe

equations are calculated for each sub frame.


o
Pr

81
h
Code-Excited Linear Prediction …

ia ja
a l ra
¾These parameters along with LSP’s

tr ai
quantized before transmitting them to the

us ik
decoder.

b
Am
¾At the decoder 40 samples are synthesized
at any one time.

SW .
¾The fundamental frequency in speech can

,A
N E
change more quickly than its spectral
U or
content and hence the parameters g ,β,α and
ss

k are calculated more frequently than LPC


fe

analysis.
o
Pr

82
h
ia ja
The bit allocation for a 6.8 kbits/s CELP

a l ra
coding is shown in the following table.

tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr

83
h
Implementation of a CELP coder

ia ja
¾ Table below shows the default analysis conditions

a l ra
that can be used in a CELP coder simulation.

tr ai
¾ The implementation of the CELP coder is based

us ik
the calculations of the parameters and a block

b
Am
diagram of the excitation sequence is shown in the
next slide. in This excitation sequence is repeated
for every sub frame.
SW .
,A
N E
U or
ss
o fe
Pr

84
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
85
h
ia ja
Speech Quality Measurement

a l ra
¾ The original speech and the decoded speech may

tr ai
be compared in two ways:

us ik
(a) By calculating the overall signal to noise ration

b
Am
as follows:

SW .
,A
N E
U or
ss
o fe
Pr

86
h
ia ja
(b) by calculating the segmental signal to noise

a l ra
ratio as follows:

tr ai
us ikb
Am
SW .
,A
N E
U or
K-number of frames ;SNRk is the S/N of the kth
frame calculated using equation given in the
ss

previous slide
ofe
Pr

87

You might also like