Dokumen - Tips Elec9344speech Audio Processing 4pdfspeech Signal For Digital Storage or Transmission

h
ia ja
a l ra
ELEC9344:Speech & Audio
tr ai
Processing
us ikb
Am
Chapter 4
SW .
,A
N E
Speech Coding I
U or
ss
fe
o
Pr
1
h
Speech Coding
ia ja
a l ra
¾ Speech Coding is a technique used to compress a
tr ai
speech signal for digital storage or transmission
us ik
purposes and then decompress the stored or
b
transmitted data, so as to allow a perceptually faithful
Am
reproduction of the original signal.
¾ Sophisticated speech coding algorithms exist, so that
SW .
,A
N E
speech signals may be compressed at various bit-rates.
U or
¾ It is known that the lower the bit-rate the lesser the
ss
speech quality of the recovered speech.

fe
¾ However, there is a constant quest to achieve a better

o
speech quality at lower bit-rates.

Pr
2
h
Speech Coders
ia ja
a l ra
¾ There are basically three classes of speech coders:
tr ai
• Waveform Coders
us ik
• Source Coders
b
• Hybrid coders
Am
¾ Waveform Coders accept continuous analogue
SW .
,A
speech signals and encode them into digital signals
N E
prior to transmission. Decoders at the receivers
U or
reverse the encoding process to recover the speech
ss
signals. In the absence of transmission errors, the

fe
recovered speech waveforms have a close

o
resemblance to the original speech waveforms.

Pr
3
h
Waveform Encoding Techniques
ia ja
a l ra
¾ The waveform encoding techniques are:
tr ai
– PCM (Pulse Code Modulation)
us ik
– DPCM (Differential Pulse Code Modulation)
b
– ADPCM (Adaptive Differential Pulse Code
Am
Modulation)
– DM (Delta Modulation)
– ADM(CVSD) [Adaptive Delta Modulation or
SW .
,A
N E Continuously Variable-Slope Delta Modulation]
U or
¾ The above time-domain coding algorithms outlined
ss
above have their counterparts in the frequency

fe
domain:
o
– SBC (Sub-band coding)

Pr
– ATC (Adaptive Transform Coding) 4

h
PCM
ia ja
¾ The simplest waveform coding method is linear pulse
a l ra
code modulation. The analogue signals are quantised
tr ai
linearly (uniform quantisation- all steps are equal
us ik
size). No compression involved.
b
Am
Quantisation We can show that the Signal to
steps (∆V) Quantisation Noise Ratio (SQNR)
SW .
,A
N E is given by:
Speech
U or
SQNR = 6.02B + 1.76 dB
ss
fe
•The SQNR increases

o
approximately 6dB for each bit.

Pr
Linear Quantisation •Assumes Uniform PDF

5
h
Non-Uniform PCM
ia ja
¾ We know that the speech signals are heavily
a l ra
concentrated in the low amplitudes and hence it is a
tr ai
much better strategy to use nonuniform quantiser in
us ik
which the steps are densest at the low levels
b
Am
Speech PDF is not uniform but Gaussian
and lower level speech occurs more
frequently than higher level speech.
SW .
,A
N E Therefore more quantisation noise is
U or
tolerable for high levels of speech than
at lower levels.
ss
Gaussian PDF
fe
Uniform
o
Nonlinear Quantisation PDF

Pr
6
h
DPCM
ia ja
¾ By taking the advantage of considerable amount of
a l ra
redundancy available in the speech signal, the bitrate
tr ai
may be reduced.
us ik
¾ For example, if we take the difference of the speech
b
Am
signal (s(n)- s(n-1)), this will have a lower variance
(and hence lower power) than signal itself.
SW .
¾ If the power is reduced, then we should be able to
,A
N E
encode the signal with fewer bits and thereby reduce
U or
the bit rate required.
ss
e(n) = s(n)-s(n-1)
fe
E(z) = S(z)[1-z-1]
o
Pr
7
h
DPCM
ia ja
¾ The Transmitter computes e(n) and this difference is
a l ra
transmitted through the channel to the receiver (see
tr ai
below), where it is integrated in the feedback loop
us ik
and the original signal is restored
b
Am
S(z) Y(z)
Channel
SW .
,A
N E
U or
Transmitter Receiver
ss
E ( z ) = S ( z )(1 − z −1 )
o fe
E( z)
Y ( z) = = S ( z)
Pr
−1
1− z 8
h
S(z)
ia ja
Y(z)
a l ra
tr ai
(Quantisation noise)
us ikb
Am
DPCM with quantisation of the difference signal. The
quantisation noise is modelled by additive noise
SW .
,A
N E Nq ( z)
Y ( z) = S ( z) +
U or
1 − z −1
ss
fe
Thus the quantisation noise is integrated at the receiver, and this

o
build up of noise is found to be more disturbing as compared to

Pr
the same noise had it been merely added to the original speech
9
signal.
h
Differential quantisation with the quantiser inside a feedback loop at the
ia ja
transmitter
a l ra
S(z) Y(z)
tr ai
us ikb
Am
SW .
,A
N E
E ( z) = S ( z) + C ( z) + Nq ( z)
U or
−1 The quantisation noise is
E ( z ) = [ S ( z ) + N q ( z )](1 − z )
ss
being differenced along with

s(n)
E ( z ) z −1
fe
C ( z) =
o
1 − z −1
Pr
Y ( z) = S ( z) + Nq ( z) 10
h
Next refinement is to recognise that simple differencing does not
ia ja
minimise the error of the output. Replace the simple differencer by a
a l ra
full pth order linear predictor (P(z))
tr ai
S(z) Y(z)
us ikb
Am
SW .
,A
N E
U or
E(z) =[S(z) + Nq(z)](1− P(z)) Using such a predictor, a reduction in variance
ss
of 6dB relative to PCM has been reported (p

E(z)P(z)
fe
C(z) = =2). If the reduction in variance is 6dB

1− z−1 means, we can use one less bit in quantising
o
the residual than would be required if the

Pr
Y(z) = S(z) + Nq(z) original signal were sent. 11

h
ADPCM
ia ja
¾ The compression ratio of DPCM can be further
a l ra
improved by adaptive prediction. Adaptive prediction
tr ai
in its simplest form is simply linear prediction.
us ik
¾ The speech data is segmented into frames of typically
b
Am
10 to 20 ms and the predictor coefficients are
transmitted along with the residual (see next slide).
SW .
¾ At the receiver, an inverse filter controlled by the
,A
N E
predictor coefficients reconstructs the speech.
U or
¾ It is reported that the S/N improvement of more than
ss
12 dB can be achieved for a 10th order linear

fe
predictor.
o
Pr
12
S(z)
Pr
o fe
ss
U or
N E
SW .
Am
,A
b
ADPCM
us ik
tr ai
a l ra
ia ja
h
13
Y(z)
h
Summary of Waveform Coders
ia ja
a l ra
tr ai
us ik
¾ A Law or µ Law PCM: 8 bit compressed
b
PCM, 8 kHz sampling rate; 64 kbps
Am
¾ DPCM: 32 kbps
¾ ADPCM: 24-32 kbps
SW .
,A
N E
U or
ss
fe
o
Pr
14
h
Source Coders
ia ja
a l ra
¾ Source Coders such as Linear Predictive Coding (LPC)
tr ai
vocoders, which are based on the speech production
us ik
model, operate at bit rates as low as 2 kbits/s. However,
b
the synthetic quality of the vocoded speech is not
Am
broadly appropriate for commercial telephone
applications.
SW .
,A
N E
U or
¾ LPC in its basic form has been mainly used in secure
military communications where speech must be carried
ss
out at very low bit rates. A US standard algorithm

fe
known as LPC-10 is widely used.

o
Pr
15
h
LPC-10 Algorithm
ia ja
a l ra
¾The input speech is sampled at 8 kHz and
tr ai
digitised as 12 bit two’s complement words. The
us ik
incoming speech is partioned into 180 sample
b
Am
frames (22.5 ms), resulting in a framerate of
44.44 frames/sec.
SW .
¾A tenth order linear predictor is used to
,A
N E
compute the vocal tract filter parameters
U or
(reflection coefficients)
ss
¾Pitch and voicing are determined using AMDF

fe
algorithm.
o
Pr
16
h
LPC-10 Algorithm
ia ja
a l ra
¾The bit assignments used are: 5 bits each for k1
tr ai
to k2, 4 bits each for k5 through k8, 3 bits for k9
us ik
and 2 bits for k10.
b
Am
¾In the transmitted bit stream, 41 bits are used for
the reflection coefficients, 7 bits for pitch and
SW .
voicing, and 5 bits for amplitude.
,A
N E
¾One bit is used for synchronisation, giving a
U or
total of 54 bits per frame.
ss
fe
¾Since there are 44.44 frames/s, this gives an

overall bit rate of 2.4 kbits/s.
o
Pr
17
h
Limitation of LPC Vocoding
ia ja
a l ra
¾ The main limitation of LPC vocoding is the assumption
tr ai
that speech signals are either voiced or unvoiced, hence
us ik
the source of excitation of the synthesis all-pole filter is
b
Am
either a train of pulses (for voiced speech), or random
noise (for unvoiced speech).
SW .
¾ In fact there are more than two modes in which the
,A
N E
vocal tract is excited and often these modes are mixed.
U or
¾ Even when the speech waveform is voiced, it is a gross
ss
simplification to assume that there is only one point of

fe
excitation in the entire pitch period.

o
Pr
18
h
Hybrid Coders
ia ja
a l ra
¾ Hybrid coders combine features from both source coders
tr ai
and waveform coders. Several hybrid coders employ an
us ik
analysis-by-synthesis process in order to derive code
b
parameters.
Am
¾ A block diagram of an analysis-by-synthesis coding
system is shown below.
SW .
,A
N E
¾ A local decoder is present inside the encoder and the aim
U or
of the analysis-by-synthesis speech coding structure is to
ss
derive some codec parameters so that the difference

fe
between the input and synthesised signal is minimised

o
according to some suitable error minimisation criterion.

Pr
19
LPC vocoder
h
ia ja
a l ra
tr ai
us ik
General Model for Analysis-by-
synthesis coding Structure
b
Am
The coder is capable of producing high
quality speech at bit rates around 10 kbits/s
SW .
,A
N E
In this new generation of analysis-by-synthesis models, no prior
U or
knowledge of a voiced/unvoiced decision or pitch period is needed. The
ss
excitation is modelled by a number of pulses, usually 4 per 5ms, whose

amplitudes and positions are determined by the perceptually weighted error
fe
between the original and synthesised speech.

o
The multipulse approach assumes that both the pulse positions and
Pr
amplitudes are initially unknown, they are then determined inside the
20
minimisation loop one pulse at a time.
h
Hybrid Coders……
ia ja
a l ra
¾ When the bit rate is reduced below 9.6 kbit/s, the multi-
tr ai
pulse excited coders fail to maintain good speech quality.
us ik
¾ This is due to large number of bits needed to encode
b
excitation pulses and the quality deteriorates when these
Am
pulses are coarsely quantised.
¾ Therefore, if analysis-by-synthesis structure is to be used
SW .
,A
N E
for producing good quality speech at bit rates below 8
U or
kbits/s, more suitable approaches to the definition of the
excitation signal must be sought.
ss
¾ The Code-Excited Linear Prediction (CELP) coder

fe
provided an improved excitation signal.

o
Pr
21
h
Speech Coding Standards
ia ja
a l ra
¾For speech coding to be useful in
tr ai
telecommunication applications, it has to be
us ik
standardised (it must conform to the same
b
algorithm and bit format).
Am
¾Speech-coding standards are established by
various standards organisations: For example,
SW .
,A
N E
ITU-T (International Telecommunication
U or
Union, Telecoomuncation Standardisation
ss
Sector (formally CCITT), ETSI (European

fe
Telecommunications Standards Institute) etc

o
Pr
22
h
ITU standard
ia ja
a l ra
¾G711/712: 64 kbits/s A/µ Law PCM
tr ai
us ik
¾G726 : 32 kbits/s ADPCM
b
¾G728: 16 kbits/s LD-CELP (Low Delay CELP)
Am
¾G729: 8 kbits/s CS-ACELP (Conjugate Structure
Algebraic CELP)
SW .
,A
N E
¾The next standardisation issue involves a 4 kbit/s
U or
speech coding algorithm. The main terms of
ss
reference for the ITU 4 kbits/s speech coding

o fe
standard are given below.

Pr
23
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
fe
o
Pr
Terms and Reference for ITU 4 kbit/s speech coding Standard 24

h
Low Bit rate Speech Coding
ia ja
a l ra
¾ It is important to develop low bit-rate speech
tr ai
coding systems for voice storage such as voice
us ik
mail and voice driven directory enquiries
b
¾ Therefore , the speech research on speech coding
Am
methods whose bit rate is less than 4 kbits/s but
still provide toll quality speech has recently
SW .
become very active.
,A
N E
¾ The ultimate goal is to design a low-delay, low bit
U or
rate codec (in real time) (around 2.4 kbits/s) with
ss
good perceptual quality for digital mobile

fe
communications and other applications.

o
Pr
25
h
Analysis-by-Synthesis Speech Coding
ia ja
a l ra
¾ As the Code-Excited Linear Prediction (CELP)
tr ai
coder is based on the analysis-by-analysis
us ik
technique, first the theory behind the analysis-by-
b
synthesis method is explained and then it is
Am
extended to CELP coding.
SW .
,A
N E
¾ The basic structure of the general model for
U or
analysis-by-synthesis predictive coding of speech
ss
is shown below.
o fe
Pr
26
h
General model for analysis-by-
ia ja
synthesis Linear Predictive Coding
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
¾ The model consists of four main parts:

– Excitation Generator
fe
– Synthesis Filter
o
– Error Weighting filter

Pr
– Error minimisation 27
h
The Short-Term Predictor (Synthesis Filter)
ia ja
a l ra
tr ai
¾ The synthesis filter is an all-pole time varying
us ik
filter for modelling the short-time spectral
b
envelope of the speech waveform.
Am
¾ It is often called a short-term prediction filter
because its coefficients are computed by
predicting a speech sample from a few previous (8
SW .
,A
N E
to 16 samples) samples.
U or
¾ The set of coefficients is called Linear predictive
ss
Coding (LPC) parameters.

o fe
Pr
28
h
¾ The spectral envelope of a speech segment of length L
ia ja
samples can be approximated by the transfer function
a l ra
of an all-pole digital filter of the form .
tr ai
us ikb
Am
¾ The coefficients ak are computed using the method of
linear prediction . The number of coefficients p is
called the predictor order.
SW .
,A
N E
¾ The basic idea behind the linear predictive analysis is
U or
that a speech sample can be approximated as a linear
ss
combination of ‘p’ past speech samples i.e.

o fe
Pr
29
h
^
¾ Where s(n) is the speech sample and S (n) is the
ia ja
predicted speech sample at sampling instant n. The
a l ra
prediction error, ε (n) is defined as
tr ai
us ikb
Am
¾The prediction error signal, ε (n) is shown in
the diagram:
SW .
,A
N E
U or
ss
o fe
Pr
30
h
¾ It can be seen from the above figure when ak = αk,
ia ja
then ε(n) = G u(n) and the linear prediction is
a l ra
called an inverse filter. The transfer function of
tr ai
the inverse filter is given by
us ikb
Am
¾ For voiced speech this means that ε(n) would
SW .
,A
consist of a train of impulses; ε(n) would be small
N E
most of the time except at the beginning of the
U or
pitch period .
ss
o fe
Pr
31
h
ia ja
¾ An examples of a signal s(n) and a prediction error
for the vowel ‘a’ is shown below.
a l ra
tr ai
us ikb
Am
¾ Because of the time-varying nature of the speech, the
SW .
prediction coefficients should be estimated from
,A
N E
short segments of speech signal (10-20 ms).
U or
ss
¾ The basic approach is to find a set of predictor

fe
coefficients ak that will minimize the mean-squared

o
prediction error over a short segment of speech

Pr
waveform. 32
h
ia ja
General model for analysis-by-synthesis Linear Predictive Coding
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr
33
h
The Error Weighting Filter
ia ja
a l ra
¾ The function of the perceptual error weighting filter,
tr ai
W(z), in the above diagram is to reduce a large part of
us ik
perceived noise in the coder which comes from the
frequency region where signal is low.
b
Am
¾ The theory of auditory masking suggests that noise in
the formant regions would be particularly or totally
SW .
,A
N E
masked by the speech signal.
U or
¾ Therefore to reduce perceived noise , the error spectrum
ss
E(z) is shaped (see figure previous slide) so that the

fe
frequency component in the noise around the formant

o
regions are allowed to have higher energy relative to the

Pr
components in the inter-formant regions. 34

h
¾ The coefficients for the error weighting filter are
ia ja
derived from the LPC coefficients.
a l ra
¾ The weighting filter W(z) can be expressed as,
tr ai
us ikb
Am
SW .
,A
N E
U or
where
ss
fe
The value γ D is determined by the degree to which

o
one wishes to deemphasize the formant regions in the

Pr
error spectrum E(z). 35

¾ Note that decreasing γ D = 1 increases the bandwidth of
h
ia ja
the poles of W(z).
a l ra
¾ If γ D =1 and W(z)=1 which is equivalent to no weighting .
tr ai
¾ A good choice is to use a value of γ D between 0.8 and
us ik
0.9.
b
Am
¾ In this coder γ D is chosen as 0.9.
¾ Note that decreasing γ D increases the bandwidth of
SW .
the poles of ⎧⎨ AA((zz)) ⎫⎬
,A
N E ⎩
'
⎭
¾ The increase in the bandwidth ∆ω is given by
U or
ss
fe
¾ Where fs the sampling frequency

o
Pr
36
h
¾Figure below shows an example of the spectrum
ia ja
of 1/A(z) and A ( z ) A ⎛⎜⎜⎝ γ z ⎞⎟⎟⎠ for different values of γD
a l ra
D
tr ai
¾An increase in bandwidth of the poles is noticeable
us ik
when γD changed from 0.9 to 0.8
b
Am
SW .
,A
N E Spectrum of 1/A(z) and
U or
A(z)/A(z/γ) for different
values of γ
ss
o fe
Pr
37
h
The Error Minimization
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
¾The excitation generator produces the excitation

fe
sequence which is fed to the synthesis filter to produce

o
Pr
the reconstructed speech.

38
h
The Error Minimization
ia ja
a l ra
¾ The most common error minimization criterion is
tr ai
the mean squared error.
us ikb
¾ In the model (see previous slide) subjectively
Am
meaningful error minimization criterion is used,
where the error e(n) is passed through a perceptual
SW .
,A
N E
weighting filter.
U or
¾ The excitation sequence is optimized by
ss
minimizing the perceptually weighted error

fe
between the original and synthesized speech.

o
Pr
39
h
The Encoder
ia ja
a l ra
¾ In the encoding procedure, the synthesis filter
tr ai
parameters (LPC coeffiecients are determined
us ik
from speech samples (20 ms of speech is a frame
b
= 160 samples) outside the optimisation loop.
Am
¾ The optimum excitation sequence for this
synthesis filter is then determined by minimising
SW .
,A
N E
the weighted error criterion.
U or
¾ The excitation optimisation interval is usually in
ss
the range of 4 to 7.5 ms which is less than the LPC

fe
coefficients update frame size.

o
Pr
40
h
¾Thus the speech frame is divided into sub
ia ja
frames (normally four 5 ms sub frames = 4x40
a l ra
samples), where the excitation is determined
tr ai
individually for each sub frame.
us ikb
Am
SW .
,A
N E
¾A new set of LPC coefficients is transmitted
U or
every 20ms, although interpolation of LPC
parameters between adjacent frame can be
ss
used to obtain different set of parameters for

fe
every sub frame.

o
Pr
41
h
¾ This interpolation enables the updating of the filter
ia ja
parameters every 5 ms while transmitting them to the
a l ra
decoder every 20 ms. i.e. without requiring the higher
tr ai
bit rate associated with shorter updating frames.
us ikb
¾ The set of predictor coefficients, ak,, cannot be used
Am
for interpolation ,because the interpolated parameters
in this case do not guarantee a stable synthesis filter.
SW .
,A
N E
¾ The interpolation is, therefore, performed using a
U or
transformed set of parameters where the filter stability
ss
can be easily guaranteed by using the log-area ratios

fe
(LAR), Line Spectral Paris (LSP ) or Line Spectrum

o
Frequencies (LSF).
Pr
42
h
ia ja
¾ If FN is the quantised LPC vector in the present
frame and FN-1 is the quantised LPC vector from
a l ra
the past frame, then the interpolated LPC vector
tr ai
SFk in a subframe k is given by
us ikb
Am
where δk is a fraction between 0 and 1.
SW .
,A
N E
¾ The variable δk is gradually decreased with the sub
U or
frame index.
ss
¾ Figure given in the next slide shows two frames

fe
and corresponding sub frames.

o
Pr
43
h
¾ For specific example, a good choice of values of δk
ia ja
a l ra
is 0.75 ,0.5 ,0.25 and 0 , for k=1,…..,4.
tr ai
¾ Using these values, the interpolated LPC coefficients
us ik
in the four subframes are given by,
b
Am
SW .
,A
N E
U or
ss
o fe
Interpolation in this manner, however, increases the

Pr
overall codec delay. 44

h
The Long-Term Predictor
ia ja
a l ra
¾ While the short-term predictor models the spectral
tr ai
envelope of the speech segment being analyzed,
us ik
the long term predictor, or the pitch, is used to
b
model the fine structure of that envelope.
Am
¾ After inverse filtering the prediction error, ε(n)
SW .
,A
N E
, still exhibits some periodicity related to the pitch
U or
period of the original speech when it is voiced
ss
fe
¾ This periodicity is of the order of 20 to 147

o
Pr
samples (55 to 400 HZ) .

45
h
The Long-Term Predictor…..
ia ja
a l ra
¾ Adding the pitch predictor to the inverse filter
tr ai
further removes redundancy in the residual signal
us ik
and turns it into a noise like process (see below).
b
Am
SW .
,A
N E
U or
ss
o fe
Pr
46
h
The Long-Term Predictor …..
ia ja
a l ra
¾ It is called a pitch predictor since it removes the
tr ai
pitch periodicity or a long-term predictor since the
us ik
predictor delay is between 20 and 147 samples.
b
Am
¾ The output signal r(n) after the pitch predictor is
very close to random Gaussian noise.
SW .
¾ The transfer function of the pitch predictor is
,A
N E
given by
U or
ss
o fe
Pr
47
h
The Long-Term Predictor ……
ia ja
a l ra
¾ For m1=m2=0 , a one-tap predictor is obtained ,for
tr ai
m1=m2=1 , a three tap predictor results .
us ik
¾ The delay α usually represents the pitch period in
b
Am
samples (or a multiple of it and bk represents the
long-term predictor coefficients.
SW .
¾ The pitch predictor is not an essential requirement in
,A
N E
medium bit rate LPC coders such as multi-pulse
U or
excited LPC and regular pulse excited LPC
ss
coders, however including a pitch predictor

fe
improves their performance.

o
Pr
48
h
¾ The pitch predictor is essential in low bit rate
ia ja
speech coders, such as those using the CELP
a l ra
technique.
tr ai
¾ A one-tap pitch predictor is shown below and is
us ik
given by
b
Am
SW .
,A
N E
Where ε(n) is the
residual after short-term
U or
prediction or inverse
ss
filtering
fe
¾ For the one-tap predictor,the residual r(n) is given

by,
o
Pr
49
h
Open-Loop Method
ia ja
a l ra
¾ The parameters α and b0 are determined by
tr ai
minimizing the mean squared residual error after
us ik
short-term and long-term prediction over a period
b
of N samples.
Am
¾ The mean squared residual E is given by
SW .
,A
N E
U or
ss
b0 ≤ 1
o fe
Pr
50
h
¾ Substituting b0 into mean squared residual
ia ja
equation gives
a l ra
Minimizing E
tr ai
requires maximizing
us ik
the normalized
b
correlation term in
Am
this equation.
SW .
,A
N E
U or
This term is computed for all possible values α
ss
fe
over its specified range, and the value α which

o
maximizes this term (or minimising E) is

Pr
chosen. 51
h
The general model for analysis-by-synthesis linear
ia ja
predictive coding as given previously may be modified in
a l ra
order to include a long-term predictor or pitch predictor.
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
o fe
Pr
52
h
ia ja
¾It was shown above that the pitch predictor
a l ra
parameters b0 and α are calculated directly from
tr ai
the LPC residual signal ε(n). .
us ik
¾This is known as the open-loop method for
b
Am
calculating the pitch predictor parameters.
¾However, a significant improvement is achieved
SW .
,A
when pitch predictor parameters b0 and α are
N E
optimized inside the analysis-by-synthesis loop.
U or
In this case, the computation of the parameters
ss
fe
contributes directly to the weighted error

o
minimization procedure.
Pr
53
h
Closed-Loop Method
ia ja
a l ra
¾ A different configuration (see next slide) can be
tr ai
derived from the general model for analysis by
us ik
synthesis LPC coding with Long-term predictor
b
that may be useful for closed-loop analysis.
Am
¾ The weighting filter W(z) is given by
SW .
,A
N E
U or
ss
o fe
Pr
54
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
55
h
¾ In this new configuration , the original and the synthesized
ia ja
speech are weighted separately before subtraction.
a l ra
tr ai
us ikb
Am
SW .
,A
N E
¾ Assuming one-tap long-term prediction , the output of the
pitch synthesis filter given by
U or
ss
fe
¾ It is first assumed that no excitation has been determined,

so the above equation reduces to
o
Pr
56
h
ia ja
^
¾ The weighted synthesis speech , s (n) is given by
W
a l ra
tr ai
us ikb
Where h(n) is the impulse response of the
Am
^
combined filter W(z)/A(z) and s 0 ( n ) is the zero
input response of the combined filter W(z)/A(z),
SW .
,A
i.e. the output of the combined filter due to its
N E
initial states.
U or
¾ The weighted error between the original and
ss
synthesized speech is given by

o fe
e w ( n ) = s w ( n ) − sˆ w ( n )
Pr
57
h
By combining and
ia ja
a l ra
e w ( n ) = s w ( n ) − sˆ w ( n ) we obtain
tr ai
us ikb
Am
¾The mean squared weighted error is given by
SW .
,A
N E
U or
ss
o fe
Pr
58
¾ Substituting b0 we obtain,
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
¾Significant speech quality improvement over the
ss
open loop solution is achieved when the long-term

fe
predictor parameters b0 and α are computed inside the

o
optimization loop using the above equations.

Pr
59
h
ia ja
a l ra
¾ The disadvantage of the closed-loop solution is the
extra computation needed to compute the
tr ai
convolution equation in the previous slide over
us ik
the range of α.
b
Am
¾ However, a fast procedure to compute this
SW .
,A
convolution yα(n) for all the possible delays is
N E
available.
U or
ss
fe
¾ Normally α is calculated using open-loop method

and b0 is calculated using closed-loop method.
o
Pr
60
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
A basic Structure for high quality analysis-by-synthesis LPC coding

o fe
Pr
61
h
Code-Excited Linear Prediction
ia ja
a l ra
¾ The residual signal after short-term and long-term
tr ai
prediction becomes noise like and its is assumed
us ik
that this residual can be modelled by a zero-mean
b
Gaussian process with a slowly varying power
Am
spectrum.
¾ As a result the excitation frame may be vector
SW .
quantized using a large stochastic codebook.
,A
N E
U or
¾ In the CELP approach , a 5ms (40 samples)
excitation frame is modelled by a Gaussian vector
ss
chosen from a large Gaussian codebook by

fe
minimizing the perceptually weighted error

o
between the original and synthesized speech.

Pr
62
h
Code-Excited Linear Prediction ……..
ia ja
a l ra
¾ Usually a codebook of 1024 entries is needed, and
tr ai
the optimum innovation sequence is chosen by an
us ik
exhaustive search.
b
Am
¾ This exhaustive search of the excitation codebook
greatly complicates the CELP implementation.
SW .
¾ The excitation generation in the previous slide(s)
,A
N E
can now be replaced by a stochastic codebook
U or
with a gain β
ss
¾ This is show in the next slide.

o fe
Pr
63
h
A basic Structure for high quality
ia ja
analysis-by-synthesis LPC coding
a l ra
tr ai
us ikb
Am
Speech synthesis
section
SW .
,A
N E
U or
ss
o fe
Pr
64
h
Code-Excited Linear Prediction ..
ia ja
a l ra
¾ The stochastic codebook is sometimes referred to
tr ai
as a fixed codebook and the gain β is referred to as
us ik
a fixed gain.
b
Am
¾ The speech synthesis section (see previous slide)
is redrawn in the diagram below to include the
stochastic codebook.
SW .
,A
N E
U or
ss
o fe
Pr
65
h
ia ja
a l ra
¾ Using P(z)=1-b0z-α, the diagram in the previous
tr ai
slide can be modified as shown below.
us ik
¾ Here α is the pitch period.
b
Am
SW .
,A
N E
U or
ss
o fe
Pr
66
h
ia ja
a l ra
¾ It can be seen that each delay cell contains 40
tr ai
samples and these values are updated every 5ms
us ik
(sub frame delay).
b
Am
¾ The long-term predictor can now be replaced by
an adaptive codebook and the appropriate address
selected from this codebook will respond to the
SW .
,A
N E
pitch delay α.
U or
¾ The gain b0 is now denoted by the adaptive gain g.
ss
Figure.in the previous slide is now redrawn to

fe
include the adaptive codebook (see next slide)

o
Pr
67
h
Code-Excited Linear Prediction (continued)
ia ja
a l ra
¾ Figure 11 shows a schematic diagram of the CELP
tr ai
coder including the adaptive codebook.
us ik
¾ The address selected from the adaptive codebook
b
Am
(α), the corresponding gain (g), the address (k)
selected from the stochastic codebook and the
corresponding gain(β) are sent to the decoder.
SW .
,A
N E
¾ This uses identical codebooks to determine the
U or
excitation signal at the input to the Short-term
ss
predictor in order to produce the reconstructed

fe
speech.
o
Pr
68
h
Schematic Diagram of the CELP synthesis model
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr
69
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
Schematic Diagram of a CELP Coder
70
h
Code-Excited Linear Prediction …
ia ja
a l ra
¾ The first stage in determining the excitation
tr ai
parameters is to calculate the pitch delay α and the
us ik
adaptive gain g .
b
Am
¾ Then the equations for stochastic codebook gain β
and the best excitation codeword (k) are derived.
SW .
¾ Using the Figure in the previous slide, the
,A
N E
weighted synthesized speech can be written as
U or
ss
o fe
Pr
71
h
ia ja
a l ra
tr ai
us ikb
Am
SW .
,A
N E
U or
¾The weighted error between error the original and
ss
synthesized speech is given by:

ofe
Pr
72
¾By Combining the above two equations we obtain:
h
ia ja
a l ra
¾ Assume that no excitation has been determined,
tr ai
i.e. ck(n)=0
us ikb
¾ Therefore the above equation is rewritten as
Am
SW .
,A
N E
¾ The mean squared weighted error is given by
U or
ss
o fe
¾ Setting gives
Pr
73
h
ia ja
a l ra
tr ai
us ikb
¾ Substituting g in equation given in the previous
Am
slide results in
SW .
,A
N E
U or
ss
o fe
Pr
74
h
¾ The stochastic codebook parameters β and k are
ia ja
derived in the following sequence:
a l ra
tr ai
us ikb
Am
SW .
,A
N E
¾The mean squared weighted error is given by
U or
ss
o fe
Pr
75
Setting
Pr
o fe
ss Ew now becomes
U or
gives
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
76
h
ia ja
¾ Letting leads
a l ra
to the following matrix form:
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
o fe
¾Similarly
Pr
77
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
Schematic Diagram of a CELP Coder
78
h
ia ja
a l ra
tr ai
¾ The excitation codebook contains (see figure in
us ik
the previous slide) L code words (stochastic
b
Am
vectors) of length N samples (typically L=1024
and N=40 samples corresponding to a 5 ms
excitation frame).
SW .
,A
N E
¾ The excitation signal for a speech frame of length
U or
N is chosen by an exhaustive search of the
ss
codebook .
fe
¾ This is a computationally demanding procedure

o
which is difficult to implement in real time.

Pr
79
h
ia ja
¾ There are alternative methods to simplify the
a l ra
codebook search procedure without affecting the
tr ai
quality of the output speech available such as:
us ik
– CELP search procedure using autocorrelation
b
approach
Am
– Structured codebooks
– Sparse exciting codebooks
SW .
,A
N E
– Ternary codebooks
U or
– Algebraic codebooks
ss
– Overlapping codebooks
o fe
Pr
80
h
Code-Excited Linear Prediction
ia ja
a l ra
¾ It can be seen from the diagram that CELP coders
tr ai
do not distinguished between voiced and unvoiced
us ik
frames.
b
¾ Input speech is divided into non-overlapping
Am
frames typically 20ms (160 samples) in duration
and ten LPC coefficients or Line Spectrum Paris
SW .
(LSP) are calculated for a single frame.
,A
N E
¾ A speech frame is divided into sub frames
U or
(normally 4 sub frames = 4x40 samples) and the
ss
parameters g, α, β and k given by the previous

fe
equations are calculated for each sub frame.

o
Pr
81
h
Code-Excited Linear Prediction …
ia ja
a l ra
¾These parameters along with LSP’s
tr ai
quantized before transmitting them to the
us ik
decoder.
b
Am
¾At the decoder 40 samples are synthesized
at any one time.
SW .
¾The fundamental frequency in speech can
,A
N E
change more quickly than its spectral
U or
content and hence the parameters g ,β,α and
ss
k are calculated more frequently than LPC

fe
analysis.
o
Pr
82
h
ia ja
The bit allocation for a 6.8 kbits/s CELP
a l ra
coding is shown in the following table.
tr ai
us ikb
Am
SW .
,A
N E
U or
ss
ofe
Pr
83
h
Implementation of a CELP coder
ia ja
¾ Table below shows the default analysis conditions
a l ra
that can be used in a CELP coder simulation.
tr ai
¾ The implementation of the CELP coder is based
us ik
the calculations of the parameters and a block
b
Am
diagram of the excitation sequence is shown in the
next slide. in This excitation sequence is repeated
for every sub frame.
SW .
,A
N E
U or
ss
o fe
Pr
84
Pr
o fe
ss
U or
N E
SW .
Am
,A
us ikb
tr ai
a l ra
ia ja
h
85
h
ia ja
Speech Quality Measurement
a l ra
¾ The original speech and the decoded speech may
tr ai
be compared in two ways:
us ik
(a) By calculating the overall signal to noise ration
b
Am
as follows:
SW .
,A
N E
U or
ss
o fe
Pr
86
h
ia ja
(b) by calculating the segmental signal to noise
a l ra
ratio as follows:
tr ai
us ikb
Am
SW .
,A
N E
U or
K-number of frames ;SNRk is the S/N of the kth
frame calculated using equation given in the
ss
previous slide
ofe
Pr
87

Dokumen - Tips Elec9344speech Audio Processing 4pdfspeech Signal For Digital Storage or Transmission

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dokumen - Tips Elec9344speech Audio Processing 4pdfspeech Signal For Digital Storage or Transmission

Uploaded by

Copyright:

Available Formats

h

speech quality of the recovered speech.

¾ However, there is a constant quest to achieve a better

speech quality at lower bit-rates.

signals. In the absence of transmission errors, the

recovered speech waveforms have a close

resemblance to the original speech waveforms.

above have their counterparts in the frequency

– SBC (Sub-band coding)

– ATC (Adaptive Transform Coding) 4

•The SQNR increases

approximately 6dB for each bit.

Linear Quantisation •Assumes Uniform PDF

Nonlinear Quantisation PDF

Thus the quantisation noise is integrated at the receiver, and this

build up of noise is found to be more disturbing as compared to

being differenced along with

of 6dB relative to PCM has been reported (p

C(z) = =2). If the reduction in variance is 6dB

the residual than would be required if the

Y(z) = S(z) + Nq(z) original signal were sent. 11

12 dB can be achieved for a 10th order linear

out at very low bit rates. A US standard algorithm

known as LPC-10 is widely used.

¾Pitch and voicing are determined using AMDF

¾Since there are 44.44 frames/s, this gives an

simplification to assume that there is only one point of

excitation in the entire pitch period.

derive some codec parameters so that the difference

between the input and synthesised signal is minimised

according to some suitable error minimisation criterion.

excitation is modelled by a number of pulses, usually 4 per 5ms, whose

between the original and synthesised speech.

¾ The Code-Excited Linear Prediction (CELP) coder

provided an improved excitation signal.

Sector (formally CCITT), ETSI (European

Telecommunications Standards Institute) etc

reference for the ITU 4 kbits/s speech coding

standard are given below.

Terms and Reference for ITU 4 kbit/s speech coding Standard 24

good perceptual quality for digital mobile

communications and other applications.

¾ The model consists of four main parts:

– Error Weighting filter

Coding (LPC) parameters.

combination of ‘p’ past speech samples i.e.

¾ The basic approach is to find a set of predictor

coefficients ak that will minimize the mean-squared

prediction error over a short segment of speech

E(z) is shaped (see figure previous slide) so that the

frequency component in the noise around the formant

regions are allowed to have higher energy relative to the

components in the inter-formant regions. 34

The value γ D is determined by the degree to which

one wishes to deemphasize the formant regions in the

error spectrum E(z). 35

¾ Where fs the sampling frequency

¾The excitation generator produces the excitation

sequence which is fed to the synthesis filter to produce

the reconstructed speech.

minimizing the perceptually weighted error

between the original and synthesized speech.

the range of 4 to 7.5 ms which is less than the LPC