Professional Documents
Culture Documents
ia ja
a l ra
ELEC9344:Speech & Audio
tr ai
Processing
us ikb
Am
Chapter 4
SW .
,A
N E
Speech Coding I
U or
ss
fe
o
Pr
1
h
Speech Coding
ia ja
a l ra
¾ Speech Coding is a technique used to compress a
tr ai
speech signal for digital storage or transmission
us ik
purposes and then decompress the stored or
b
transmitted data, so as to allow a perceptually faithful
Am
reproduction of the original signal.
¾ Sophisticated speech coding algorithms exist, so that
SW .
,A
N E
speech signals may be compressed at various bit-rates.
U or
¾ It is known that the lower the bit-rate the lesser the
ss
2
h
Speech Coders
ia ja
a l ra
¾ There are basically three classes of speech coders:
tr ai
• Waveform Coders
us ik
• Source Coders
b
• Hybrid coders
Am
¾ Waveform Coders accept continuous analogue
SW .
,A
speech signals and encode them into digital signals
N E
prior to transmission. Decoders at the receivers
U or
reverse the encoding process to recover the speech
ss
3
h
Waveform Encoding Techniques
ia ja
a l ra
¾ The waveform encoding techniques are:
tr ai
– PCM (Pulse Code Modulation)
us ik
– DPCM (Differential Pulse Code Modulation)
b
– ADPCM (Adaptive Differential Pulse Code
Am
Modulation)
– DM (Delta Modulation)
– ADM(CVSD) [Adaptive Delta Modulation or
SW .
,A
N E Continuously Variable-Slope Delta Modulation]
U or
¾ The above time-domain coding algorithms outlined
ss
domain:
o
ia ja
¾ The simplest waveform coding method is linear pulse
a l ra
code modulation. The analogue signals are quantised
tr ai
linearly (uniform quantisation- all steps are equal
us ik
size). No compression involved.
b
Am
Quantisation We can show that the Signal to
steps (∆V) Quantisation Noise Ratio (SQNR)
SW .
,A
N E is given by:
Speech
U or
SQNR = 6.02B + 1.76 dB
ss
fe
ia ja
¾ We know that the speech signals are heavily
a l ra
concentrated in the low amplitudes and hence it is a
tr ai
much better strategy to use nonuniform quantiser in
us ik
which the steps are densest at the low levels
b
Am
Speech PDF is not uniform but Gaussian
and lower level speech occurs more
frequently than higher level speech.
SW .
,A
N E Therefore more quantisation noise is
U or
tolerable for high levels of speech than
at lower levels.
ss
Gaussian PDF
fe
Uniform
o
6
h
DPCM
ia ja
¾ By taking the advantage of considerable amount of
a l ra
redundancy available in the speech signal, the bitrate
tr ai
may be reduced.
us ik
¾ For example, if we take the difference of the speech
b
Am
signal (s(n)- s(n-1)), this will have a lower variance
(and hence lower power) than signal itself.
SW .
¾ If the power is reduced, then we should be able to
,A
N E
encode the signal with fewer bits and thereby reduce
U or
the bit rate required.
ss
e(n) = s(n)-s(n-1)
fe
E(z) = S(z)[1-z-1]
o
Pr
7
h
DPCM
ia ja
¾ The Transmitter computes e(n) and this difference is
a l ra
transmitted through the channel to the receiver (see
tr ai
below), where it is integrated in the feedback loop
us ik
and the original signal is restored
b
Am
S(z) Y(z)
Channel
SW .
,A
N E
U or
Transmitter Receiver
ss
E ( z ) = S ( z )(1 − z −1 )
o fe
E( z)
Y ( z) = = S ( z)
Pr
−1
1− z 8
h
S(z)
ia ja
Y(z)
a l ra
tr ai
(Quantisation noise)
us ikb
Am
DPCM with quantisation of the difference signal. The
quantisation noise is modelled by additive noise
SW .
,A
N E Nq ( z)
Y ( z) = S ( z) +
U or
1 − z −1
ss
fe
the same noise had it been merely added to the original speech
9
signal.
h
Differential quantisation with the quantiser inside a feedback loop at the
ia ja
transmitter
a l ra
S(z) Y(z)
tr ai
us ikb
Am
SW .
,A
N E
E ( z) = S ( z) + C ( z) + Nq ( z)
U or
−1 The quantisation noise is
E ( z ) = [ S ( z ) + N q ( z )](1 − z )
ss
C ( z) =
o
1 − z −1
Pr
Y ( z) = S ( z) + Nq ( z) 10
h
Next refinement is to recognise that simple differencing does not
ia ja
minimise the error of the output. Replace the simple differencer by a
a l ra
full pth order linear predictor (P(z))
tr ai
S(z) Y(z)
us ikb
Am
SW .
,A
N E
U or
E(z) =[S(z) + Nq(z)](1− P(z)) Using such a predictor, a reduction in variance
ss
ia ja
¾ The compression ratio of DPCM can be further
a l ra
improved by adaptive prediction. Adaptive prediction
tr ai
in its simplest form is simply linear prediction.
us ik
¾ The speech data is segmented into frames of typically
b
Am
10 to 20 ms and the predictor coefficients are
transmitted along with the residual (see next slide).
SW .
¾ At the receiver, an inverse filter controlled by the
,A
N E
predictor coefficients reconstructs the speech.
U or
¾ It is reported that the S/N improvement of more than
ss
predictor.
o
Pr
12
S(z)
Pr
o fe
ss
U or
N E
SW .
Am
,A
b
ADPCM
us ik
tr ai
a l ra
ia ja
h
13
Y(z)
h
Summary of Waveform Coders
ia ja
a l ra
tr ai
us ik
¾ A Law or µ Law PCM: 8 bit compressed
b
PCM, 8 kHz sampling rate; 64 kbps
Am
¾ DPCM: 32 kbps
¾ ADPCM: 24-32 kbps
SW .
,A
N E
U or
ss
fe
o
Pr
14
h
Source Coders
ia ja
a l ra
¾ Source Coders such as Linear Predictive Coding (LPC)
tr ai
vocoders, which are based on the speech production
us ik
model, operate at bit rates as low as 2 kbits/s. However,
b
the synthetic quality of the vocoded speech is not
Am
broadly appropriate for commercial telephone
applications.
SW .
,A
N E
U or
¾ LPC in its basic form has been mainly used in secure
military communications where speech must be carried
ss
15
h
LPC-10 Algorithm
ia ja
a l ra
¾The input speech is sampled at 8 kHz and
tr ai
digitised as 12 bit two’s complement words. The
us ik
incoming speech is partioned into 180 sample
b
Am
frames (22.5 ms), resulting in a framerate of
44.44 frames/sec.
SW .
¾A tenth order linear predictor is used to
,A
N E
compute the vocal tract filter parameters
U or
(reflection coefficients)
ss
algorithm.
o
Pr
16
h
LPC-10 Algorithm
ia ja
a l ra
¾The bit assignments used are: 5 bits each for k1
tr ai
to k2, 4 bits each for k5 through k8, 3 bits for k9
us ik
and 2 bits for k10.
b
Am
¾In the transmitted bit stream, 41 bits are used for
the reflection coefficients, 7 bits for pitch and
SW .
voicing, and 5 bits for amplitude.
,A
N E
¾One bit is used for synchronisation, giving a
U or
total of 54 bits per frame.
ss
fe
17
h
Limitation of LPC Vocoding
ia ja
a l ra
¾ The main limitation of LPC vocoding is the assumption
tr ai
that speech signals are either voiced or unvoiced, hence
us ik
the source of excitation of the synthesis all-pole filter is
b
Am
either a train of pulses (for voiced speech), or random
noise (for unvoiced speech).
SW .
¾ In fact there are more than two modes in which the
,A
N E
vocal tract is excited and often these modes are mixed.
U or
¾ Even when the speech waveform is voiced, it is a gross
ss
18
h
Hybrid Coders
ia ja
a l ra
¾ Hybrid coders combine features from both source coders
tr ai
and waveform coders. Several hybrid coders employ an
us ik
analysis-by-synthesis process in order to derive code
b
parameters.
Am
¾ A block diagram of an analysis-by-synthesis coding
system is shown below.
SW .
,A
N E
¾ A local decoder is present inside the encoder and the aim
U or
of the analysis-by-synthesis speech coding structure is to
ss
19
LPC vocoder
h
ia ja
a l ra
tr ai
us ik
General Model for Analysis-by-
synthesis coding Structure
b
Am
The coder is capable of producing high
quality speech at bit rates around 10 kbits/s
SW .
,A
N E
In this new generation of analysis-by-synthesis models, no prior
U or
knowledge of a voiced/unvoiced decision or pitch period is needed. The
ss
The multipulse approach assumes that both the pulse positions and
Pr
amplitudes are initially unknown, they are then determined inside the
20
minimisation loop one pulse at a time.
h
Hybrid Coders……
ia ja
a l ra
¾ When the bit rate is reduced below 9.6 kbit/s, the multi-
tr ai
pulse excited coders fail to maintain good speech quality.
us ik
¾ This is due to large number of bits needed to encode
b
excitation pulses and the quality deteriorates when these
Am
pulses are coarsely quantised.
¾ Therefore, if analysis-by-synthesis structure is to be used
SW .
,A
N E
for producing good quality speech at bit rates below 8
U or
kbits/s, more suitable approaches to the definition of the
excitation signal must be sought.
ss
21
h
Speech Coding Standards
ia ja
a l ra
¾For speech coding to be useful in
tr ai
telecommunication applications, it has to be
us ik
standardised (it must conform to the same
b
algorithm and bit format).
Am
¾Speech-coding standards are established by
various standards organisations: For example,
SW .
,A
N E
ITU-T (International Telecommunication
U or
Union, Telecoomuncation Standardisation
ss
22
h
ITU standard
ia ja
a l ra
¾G711/712: 64 kbits/s A/µ Law PCM
tr ai
us ik
¾G726 : 32 kbits/s ADPCM
b
¾G728: 16 kbits/s LD-CELP (Low Delay CELP)
Am
¾G729: 8 kbits/s CS-ACELP (Conjugate Structure
Algebraic CELP)
SW .
,A
N E
¾The next standardisation issue involves a 4 kbit/s
U or
speech coding algorithm. The main terms of
ss
23
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
fe
o
Pr
ia ja
a l ra
¾ It is important to develop low bit-rate speech
tr ai
coding systems for voice storage such as voice
us ik
mail and voice driven directory enquiries
b
¾ Therefore , the speech research on speech coding
Am
methods whose bit rate is less than 4 kbits/s but
still provide toll quality speech has recently
SW .
become very active.
,A
N E
¾ The ultimate goal is to design a low-delay, low bit
U or
rate codec (in real time) (around 2.4 kbits/s) with
ss
25
h
Analysis-by-Synthesis Speech Coding
ia ja
a l ra
¾ As the Code-Excited Linear Prediction (CELP)
tr ai
coder is based on the analysis-by-analysis
us ik
technique, first the theory behind the analysis-by-
b
synthesis method is explained and then it is
Am
extended to CELP coding.
SW .
,A
N E
¾ The basic structure of the general model for
U or
analysis-by-synthesis predictive coding of speech
ss
is shown below.
o fe
Pr
26
h
General model for analysis-by-
ia ja
synthesis Linear Predictive Coding
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
– Synthesis Filter
o
– Error minimisation 27
h
The Short-Term Predictor (Synthesis Filter)
ia ja
a l ra
tr ai
¾ The synthesis filter is an all-pole time varying
us ik
filter for modelling the short-time spectral
b
envelope of the speech waveform.
Am
¾ It is often called a short-term prediction filter
because its coefficients are computed by
predicting a speech sample from a few previous (8
SW .
,A
N E
to 16 samples) samples.
U or
¾ The set of coefficients is called Linear predictive
ss
28
h
¾ The spectral envelope of a speech segment of length L
ia ja
samples can be approximated by the transfer function
a l ra
of an all-pole digital filter of the form .
tr ai
us ikb
Am
¾ The coefficients ak are computed using the method of
linear prediction . The number of coefficients p is
called the predictor order.
SW .
,A
N E
¾ The basic idea behind the linear predictive analysis is
U or
that a speech sample can be approximated as a linear
ss
29
h
^
¾ Where s(n) is the speech sample and S (n) is the
ia ja
predicted speech sample at sampling instant n. The
a l ra
prediction error, ε (n) is defined as
tr ai
us ikb
Am
¾The prediction error signal, ε (n) is shown in
the diagram:
SW .
,A
N E
U or
ss
o fe
Pr
30
h
¾ It can be seen from the above figure when ak = αk,
ia ja
then ε(n) = G u(n) and the linear prediction is
a l ra
called an inverse filter. The transfer function of
tr ai
the inverse filter is given by
us ikb
Am
¾ For voiced speech this means that ε(n) would
SW .
,A
consist of a train of impulses; ε(n) would be small
N E
most of the time except at the beginning of the
U or
pitch period .
ss
o fe
Pr
31
h
ia ja
¾ An examples of a signal s(n) and a prediction error
for the vowel ‘a’ is shown below.
a l ra
tr ai
us ikb
Am
¾ Because of the time-varying nature of the speech, the
SW .
prediction coefficients should be estimated from
,A
N E
short segments of speech signal (10-20 ms).
U or
ss
waveform. 32
h
ia ja
General model for analysis-by-synthesis Linear Predictive Coding
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr
33
h
The Error Weighting Filter
ia ja
a l ra
¾ The function of the perceptual error weighting filter,
tr ai
W(z), in the above diagram is to reduce a large part of
us ik
perceived noise in the coder which comes from the
frequency region where signal is low.
b
Am
¾ The theory of auditory masking suggests that noise in
the formant regions would be particularly or totally
SW .
,A
N E
masked by the speech signal.
U or
¾ Therefore to reduce perceived noise , the error spectrum
ss
ia ja
derived from the LPC coefficients.
a l ra
¾ The weighting filter W(z) can be expressed as,
tr ai
us ikb
Am
SW .
,A
N E
U or
where
ss
fe
h
ia ja
the poles of W(z).
a l ra
¾ If γ D =1 and W(z)=1 which is equivalent to no weighting .
tr ai
¾ A good choice is to use a value of γ D between 0.8 and
us ik
0.9.
b
Am
¾ In this coder γ D is chosen as 0.9.
¾ Note that decreasing γ D increases the bandwidth of
SW .
the poles of ⎧⎨ AA((zz)) ⎫⎬
,A
N E ⎩
'
⎭
¾ The increase in the bandwidth ∆ω is given by
U or
ss
fe
36
h
¾Figure below shows an example of the spectrum
ia ja
of 1/A(z) and A ( z ) A ⎛⎜⎜⎝ γ z ⎞⎟⎟⎠ for different values of γD
a l ra
D
tr ai
¾An increase in bandwidth of the poles is noticeable
us ik
when γD changed from 0.9 to 0.8
b
Am
SW .
,A
N E Spectrum of 1/A(z) and
U or
A(z)/A(z/γ) for different
values of γ
ss
o fe
Pr
37
h
The Error Minimization
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ia ja
a l ra
¾ The most common error minimization criterion is
tr ai
the mean squared error.
us ikb
¾ In the model (see previous slide) subjectively
Am
meaningful error minimization criterion is used,
where the error e(n) is passed through a perceptual
SW .
,A
N E
weighting filter.
U or
¾ The excitation sequence is optimized by
ss
39
h
The Encoder
ia ja
a l ra
¾ In the encoding procedure, the synthesis filter
tr ai
parameters (LPC coeffiecients are determined
us ik
from speech samples (20 ms of speech is a frame
b
= 160 samples) outside the optimisation loop.
Am
¾ The optimum excitation sequence for this
synthesis filter is then determined by minimising
SW .
,A
N E
the weighted error criterion.
U or
¾ The excitation optimisation interval is usually in
ss
40
h
¾Thus the speech frame is divided into sub
ia ja
frames (normally four 5 ms sub frames = 4x40
a l ra
samples), where the excitation is determined
tr ai
individually for each sub frame.
us ikb
Am
SW .
,A
N E
¾A new set of LPC coefficients is transmitted
U or
every 20ms, although interpolation of LPC
parameters between adjacent frame can be
ss
41
h
¾ This interpolation enables the updating of the filter
ia ja
parameters every 5 ms while transmitting them to the
a l ra
decoder every 20 ms. i.e. without requiring the higher
tr ai
bit rate associated with shorter updating frames.
us ikb
¾ The set of predictor coefficients, ak,, cannot be used
Am
for interpolation ,because the interpolated parameters
in this case do not guarantee a stable synthesis filter.
SW .
,A
N E
¾ The interpolation is, therefore, performed using a
U or
transformed set of parameters where the filter stability
ss
Frequencies (LSF).
Pr
42
h
ia ja
¾ If FN is the quantised LPC vector in the present
frame and FN-1 is the quantised LPC vector from
a l ra
the past frame, then the interpolated LPC vector
tr ai
SFk in a subframe k is given by
us ikb
Am
where δk is a fraction between 0 and 1.
SW .
,A
N E
¾ The variable δk is gradually decreased with the sub
U or
frame index.
ss
43
h
¾ For specific example, a good choice of values of δk
ia ja
a l ra
is 0.75 ,0.5 ,0.25 and 0 , for k=1,…..,4.
tr ai
¾ Using these values, the interpolated LPC coefficients
us ik
in the four subframes are given by,
b
Am
SW .
,A
N E
U or
ss
o fe
ia ja
a l ra
¾ While the short-term predictor models the spectral
tr ai
envelope of the speech segment being analyzed,
us ik
the long term predictor, or the pitch, is used to
b
model the fine structure of that envelope.
Am
¾ After inverse filtering the prediction error, ε(n)
SW .
,A
N E
, still exhibits some periodicity related to the pitch
U or
period of the original speech when it is voiced
ss
fe
ia ja
a l ra
¾ Adding the pitch predictor to the inverse filter
tr ai
further removes redundancy in the residual signal
us ik
and turns it into a noise like process (see below).
b
Am
SW .
,A
N E
U or
ss
o fe
Pr
46
h
The Long-Term Predictor …..
ia ja
a l ra
¾ It is called a pitch predictor since it removes the
tr ai
pitch periodicity or a long-term predictor since the
us ik
predictor delay is between 20 and 147 samples.
b
Am
¾ The output signal r(n) after the pitch predictor is
very close to random Gaussian noise.
SW .
¾ The transfer function of the pitch predictor is
,A
N E
given by
U or
ss
o fe
Pr
47
h
The Long-Term Predictor ……
ia ja
a l ra
¾ For m1=m2=0 , a one-tap predictor is obtained ,for
tr ai
m1=m2=1 , a three tap predictor results .
us ik
¾ The delay α usually represents the pitch period in
b
Am
samples (or a multiple of it and bk represents the
long-term predictor coefficients.
SW .
¾ The pitch predictor is not an essential requirement in
,A
N E
medium bit rate LPC coders such as multi-pulse
U or
excited LPC and regular pulse excited LPC
ss
48
h
¾ The pitch predictor is essential in low bit rate
ia ja
speech coders, such as those using the CELP
a l ra
technique.
tr ai
¾ A one-tap pitch predictor is shown below and is
us ik
given by
b
Am
SW .
,A
N E
Where ε(n) is the
residual after short-term
U or
prediction or inverse
ss
filtering
fe
49
h
Open-Loop Method
ia ja
a l ra
¾ The parameters α and b0 are determined by
tr ai
minimizing the mean squared residual error after
us ik
short-term and long-term prediction over a period
b
of N samples.
Am
¾ The mean squared residual E is given by
SW .
,A
N E
U or
ss
b0 ≤ 1
o fe
Pr
50
h
¾ Substituting b0 into mean squared residual
ia ja
equation gives
a l ra
Minimizing E
tr ai
requires maximizing
us ik
the normalized
b
correlation term in
Am
this equation.
SW .
,A
N E
U or
This term is computed for all possible values α
ss
fe
chosen. 51
h
The general model for analysis-by-synthesis linear
ia ja
predictive coding as given previously may be modified in
a l ra
order to include a long-term predictor or pitch predictor.
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
o fe
Pr
52
h
ia ja
¾It was shown above that the pitch predictor
a l ra
parameters b0 and α are calculated directly from
tr ai
the LPC residual signal ε(n). .
us ik
¾This is known as the open-loop method for
b
Am
calculating the pitch predictor parameters.
¾However, a significant improvement is achieved
SW .
,A
when pitch predictor parameters b0 and α are
N E
optimized inside the analysis-by-synthesis loop.
U or
In this case, the computation of the parameters
ss
fe
minimization procedure.
Pr
53
h
Closed-Loop Method
ia ja
a l ra
¾ A different configuration (see next slide) can be
tr ai
derived from the general model for analysis by
us ik
synthesis LPC coding with Long-term predictor
b
that may be useful for closed-loop analysis.
Am
¾ The weighting filter W(z) is given by
SW .
,A
N E
U or
ss
o fe
Pr
54
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
55
h
¾ In this new configuration , the original and the synthesized
ia ja
speech are weighted separately before subtraction.
a l ra
tr ai
us ikb
Am
SW .
,A
N E
¾ Assuming one-tap long-term prediction , the output of the
pitch synthesis filter given by
U or
ss
fe
56
h
ia ja
^
¾ The weighted synthesis speech , s (n) is given by
W
a l ra
tr ai
us ikb
Where h(n) is the impulse response of the
Am
^
combined filter W(z)/A(z) and s 0 ( n ) is the zero
input response of the combined filter W(z)/A(z),
SW .
,A
i.e. the output of the combined filter due to its
N E
initial states.
U or
¾ The weighted error between the original and
ss
e w ( n ) = s w ( n ) − sˆ w ( n )
Pr
57
h
By combining and
ia ja
a l ra
e w ( n ) = s w ( n ) − sˆ w ( n ) we obtain
tr ai
us ikb
Am
¾The mean squared weighted error is given by
SW .
,A
N E
U or
ss
o fe
Pr
58
¾ Substituting b0 we obtain,
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
¾Significant speech quality improvement over the
ss
59
h
ia ja
a l ra
¾ The disadvantage of the closed-loop solution is the
extra computation needed to compute the
tr ai
convolution equation in the previous slide over
us ik
the range of α.
b
Am
¾ However, a fast procedure to compute this
SW .
,A
convolution yα(n) for all the possible delays is
N E
available.
U or
ss
fe
60
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
61
h
Code-Excited Linear Prediction
ia ja
a l ra
¾ The residual signal after short-term and long-term
tr ai
prediction becomes noise like and its is assumed
us ik
that this residual can be modelled by a zero-mean
b
Gaussian process with a slowly varying power
Am
spectrum.
¾ As a result the excitation frame may be vector
SW .
quantized using a large stochastic codebook.
,A
N E
U or
¾ In the CELP approach , a 5ms (40 samples)
excitation frame is modelled by a Gaussian vector
ss
62
h
Code-Excited Linear Prediction ……..
ia ja
a l ra
¾ Usually a codebook of 1024 entries is needed, and
tr ai
the optimum innovation sequence is chosen by an
us ik
exhaustive search.
b
Am
¾ This exhaustive search of the excitation codebook
greatly complicates the CELP implementation.
SW .
¾ The excitation generation in the previous slide(s)
,A
N E
can now be replaced by a stochastic codebook
U or
with a gain β
ss
63
h
A basic Structure for high quality
ia ja
analysis-by-synthesis LPC coding
a l ra
tr ai
us ikb
Am
Speech synthesis
section
SW .
,A
N E
U or
ss
o fe
Pr
64
h
Code-Excited Linear Prediction ..
ia ja
a l ra
¾ The stochastic codebook is sometimes referred to
tr ai
as a fixed codebook and the gain β is referred to as
us ik
a fixed gain.
b
Am
¾ The speech synthesis section (see previous slide)
is redrawn in the diagram below to include the
stochastic codebook.
SW .
,A
N E
U or
ss
o fe
Pr
65
h
Code-Excited Linear Prediction ..
ia ja
a l ra
¾ Using P(z)=1-b0z-α, the diagram in the previous
tr ai
slide can be modified as shown below.
us ik
¾ Here α is the pitch period.
b
Am
SW .
,A
N E
U or
ss
o fe
Pr
66
h
Code-Excited Linear Prediction ..
ia ja
a l ra
¾ It can be seen that each delay cell contains 40
tr ai
samples and these values are updated every 5ms
us ik
(sub frame delay).
b
Am
¾ The long-term predictor can now be replaced by
an adaptive codebook and the appropriate address
selected from this codebook will respond to the
SW .
,A
N E
pitch delay α.
U or
¾ The gain b0 is now denoted by the adaptive gain g.
ss
67
h
Code-Excited Linear Prediction (continued)
ia ja
a l ra
¾ Figure 11 shows a schematic diagram of the CELP
tr ai
coder including the adaptive codebook.
us ik
¾ The address selected from the adaptive codebook
b
Am
(α), the corresponding gain (g), the address (k)
selected from the stochastic codebook and the
corresponding gain(β) are sent to the decoder.
SW .
,A
N E
¾ This uses identical codebooks to determine the
U or
excitation signal at the input to the Short-term
ss
speech.
o
Pr
68
h
Schematic Diagram of the CELP synthesis model
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr
69
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
Schematic Diagram of a CELP Coder
70
h
Code-Excited Linear Prediction …
ia ja
a l ra
¾ The first stage in determining the excitation
tr ai
parameters is to calculate the pitch delay α and the
us ik
adaptive gain g .
b
Am
¾ Then the equations for stochastic codebook gain β
and the best excitation codeword (k) are derived.
SW .
¾ Using the Figure in the previous slide, the
,A
N E
weighted synthesized speech can be written as
U or
ss
o fe
Pr
71
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
¾The weighted error between error the original and
ss
72
¾By Combining the above two equations we obtain:
h
ia ja
a l ra
¾ Assume that no excitation has been determined,
tr ai
i.e. ck(n)=0
us ikb
¾ Therefore the above equation is rewritten as
Am
SW .
,A
N E
¾ The mean squared weighted error is given by
U or
ss
o fe
¾ Setting gives
Pr
73
h
ia ja
a l ra
tr ai
us ikb
¾ Substituting g in equation given in the previous
Am
slide results in
SW .
,A
N E
U or
ss
o fe
Pr
74
h
¾ The stochastic codebook parameters β and k are
ia ja
derived in the following sequence:
a l ra
tr ai
us ikb
Am
SW .
,A
N E
¾The mean squared weighted error is given by
U or
ss
o fe
Pr
75
Setting
Pr
o fe
ss Ew now becomes
U or
gives
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
76
h
ia ja
¾ Letting leads
a l ra
to the following matrix form:
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
o fe
¾Similarly
Pr
77
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
Schematic Diagram of a CELP Coder
78
h
Code-Excited Linear Prediction ..
ia ja
a l ra
tr ai
¾ The excitation codebook contains (see figure in
us ik
the previous slide) L code words (stochastic
b
Am
vectors) of length N samples (typically L=1024
and N=40 samples corresponding to a 5 ms
excitation frame).
SW .
,A
N E
¾ The excitation signal for a speech frame of length
U or
N is chosen by an exhaustive search of the
ss
codebook .
fe
79
h
ia ja
¾ There are alternative methods to simplify the
a l ra
codebook search procedure without affecting the
tr ai
quality of the output speech available such as:
us ik
– CELP search procedure using autocorrelation
b
approach
Am
– Structured codebooks
– Sparse exciting codebooks
SW .
,A
N E
– Ternary codebooks
U or
– Algebraic codebooks
ss
– Overlapping codebooks
o fe
Pr
80
h
Code-Excited Linear Prediction
ia ja
a l ra
¾ It can be seen from the diagram that CELP coders
tr ai
do not distinguished between voiced and unvoiced
us ik
frames.
b
¾ Input speech is divided into non-overlapping
Am
frames typically 20ms (160 samples) in duration
and ten LPC coefficients or Line Spectrum Paris
SW .
(LSP) are calculated for a single frame.
,A
N E
¾ A speech frame is divided into sub frames
U or
(normally 4 sub frames = 4x40 samples) and the
ss
81
h
Code-Excited Linear Prediction …
ia ja
a l ra
¾These parameters along with LSP’s
tr ai
quantized before transmitting them to the
us ik
decoder.
b
Am
¾At the decoder 40 samples are synthesized
at any one time.
SW .
¾The fundamental frequency in speech can
,A
N E
change more quickly than its spectral
U or
content and hence the parameters g ,β,α and
ss
analysis.
o
Pr
82
h
ia ja
The bit allocation for a 6.8 kbits/s CELP
a l ra
coding is shown in the following table.
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr
83
h
Implementation of a CELP coder
ia ja
¾ Table below shows the default analysis conditions
a l ra
that can be used in a CELP coder simulation.
tr ai
¾ The implementation of the CELP coder is based
us ik
the calculations of the parameters and a block
b
Am
diagram of the excitation sequence is shown in the
next slide. in This excitation sequence is repeated
for every sub frame.
SW .
,A
N E
U or
ss
o fe
Pr
84
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
85
h
ia ja
Speech Quality Measurement
a l ra
¾ The original speech and the decoded speech may
tr ai
be compared in two ways:
us ik
(a) By calculating the overall signal to noise ration
b
Am
as follows:
SW .
,A
N E
U or
ss
o fe
Pr
86
h
ia ja
(b) by calculating the segmental signal to noise
a l ra
ratio as follows:
tr ai
us ikb
Am
SW .
,A
N E
U or
K-number of frames ;SNRk is the S/N of the kth
frame calculated using equation given in the
ss
previous slide
ofe
Pr
87