You are on page 1of 50

Feature Extraction in Speech Coding and

Recognition
Report of PhD research internship in ASP Group, OGI-OHSU,
2001/2002

Petr Motlı́ček

March 19, 2003 — version 1.0


Contents

1 Introduction 4
1.1 Speech recognition tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 SpeechDat-Car databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 AURORA2-TIdigits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Static feature extraction 6


2.1 Mel-frequency cepstral (MFCC analysis (experiments) . . . . . . . . . . . . . . . . . . . 6
2.2 Application of an all-pole modeling with additional analysis in speech recognition (ex-
periments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Spectral analysis, Critical band warping, equal-loudness, power law . . . . . . . 7
2.2.2 Linear prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2.1 Autoregressive modeling . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2.2 Computation of gain factor G . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2.3 Properties of the model spectrum . . . . . . . . . . . . . . . . . . . . 12
2.2.2.4 Definition of an all-pole model for a0 6= 1 . . . . . . . . . . . . . . . . 13
2.2.2.5 Normalized minimum error . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2.6 Estimation of the order of all-pole model . . . . . . . . . . . . . . . . 15
2.2.2.7 Cepstral analysis-basic PLP . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2.8 Experimental choosing a value of p . . . . . . . . . . . . . . . . . . . . 17
2.2.2.9 Spectral sensitivity of PLP-cepstral coefficients . . . . . . . . . . . . . 17
2.2.2.10 Application of different value of root (Power law) . . . . . . . . . . . 19
2.2.2.11 Equal loudness and preemphasis in PLP . . . . . . . . . . . . . . . . . 19
2.2.2.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Line spectral frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3.1 Spectral sensitivity of PLP-line spectral frequencies . . . . . . . . . . 23
2.2.4 Reflection coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4.1 Spectral sensitivity of PLP-reflection coefficients . . . . . . . . . . . . 24
2.2.4.2 Spectral sensitivity of PLP-LPC . . . . . . . . . . . . . . . . . . . . . 24
2.2.5 Log-Area Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.6 Selective Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.7 Discrete all-pole modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.7.0.1 I-S error minimization . . . . . . . . . . . . . . . . . . . . . . 31
2.2.7.0.2 Parameters of the optimal all-pole model . . . . . . . . . . . 32
2.2.7.0.3 Iteratively solved set of nonlinear equations . . . . . . . . . . 33
2.2.7.0.4 Potential approaches . . . . . . . . . . . . . . . . . . . . . . . 34

3 Feature normalization 35
3.1 Additive noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Convolutional distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Spectral subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2
3.4 Spectral Mean Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Wiener filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Mean and variance normalization for robust speech recognition . . . . . . . . . . . . . 38
3.6.1 PLP-LPC, PLP-LSF, PLP-Refl, PLP-LAR with application of MVN . . . . . . 41
3.7 Correlation index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Compensation of the noise using linear and non-linear transformations . . . . . . . . . 44
3.9 Histogram based normalization of the features . . . . . . . . . . . . . . . . . . . . . . . 44
3.9.1 Training data as the target histogram . . . . . . . . . . . . . . . . . . . . . . . 45
3.9.2 Gaussian distribution as the target histogram . . . . . . . . . . . . . . . . . . . 46

3
Chapter 1

Introduction

This document is dealing with the feature extraction techniques generally used in speech recognition
tasks. The most popular features used in speech recognition, such as Mel-Filter bank cepstral co-
efficients (MFCCs) and Perceptual Linear Prediction (PLP) coefficients, were taken as the baseline
features.
A significant amount of effort has been devoted to establishing speech feature extraction schemes
which enable robust and high performance speech recognition in a range of operating environments.
Each scheme consists of several processing stages that will be subjected to the theoretical analysis as
well as their contribution to the resulting performance of the whole system.
In many descriptions, feature extraction is considered as comprising three different stages:
• namely static feature extraction
• feature normalization
• inclusion of temporal information.
In our work we have concentrated on first two stages. In all our speech recognition experiments delta
and acceleration components (∆, ∆∆ coefficients) were applied as the temporal derivatives.

1.1 Speech recognition tasks


1.1.1 SpeechDat-Car databases
Later described feature extraction algorithms were in most of the cases tested on SpeechDat - Car
(SDC) databases used for Advanced DSR Front-End Evaluation: Italian SDC [?], Spanish SDC [?], and
Finish SDC. From many previous experiments we have found out that the recognition performances
obtained from Italian SDC are most likely to predict the performance over the whole SDC set.
In all SDC, the speech recordings were taken from the close-talk microphone and from one of the
hands-free microphones. Data were recorded at 16kHz, but downsampled to 8kHz. The databases
contain various utterances of digits.
During experiments, the robustness was tested under three different training conditions. For each
of these three conditions 70% of the files were used for training, 30% for testing.
• Well-matched condition (wm): All the files (close-talk and hands-free microphones) were
used for training and testing.
• Medium mis-matched condition (mm): Only recordings made with the hands-free micro-
phone were used for training and testing.
• Highly mis-matched condition (hm): For the training only close-talk microphone recordings
were used, whereas for testing the hands-free files were taken.

4
1.1.2 AURORA2-TIdigits
Furthermore Aurora 2 (noisy TI-digits) database, that is fully described in [?], was used for the eval-
uation in some of our experiments too. The experiment task was the speaker independent recognition
of digit sequences with sampling frequency 8 kHz. The database consist of clean speech as well as
speech with noise artificially added at several SNRs (20dB, 15dB, 10 dB, 5dB, 0dB). Four noises were
used in training part: recording inside a subway, babble, car noise, recording in an exhibition hall.
The conditions are divided into multi-condition training and clean training. Three different sets
of speech data were taken for the recognition:

• Set “a”: The noises are the same as for training.

• Set “b”: Different types of noises were used for this test (restaurant, street, airport, train
station), which should represent realistic scenarios for application in a mobile terminal.

• Set “c”: The same noise as in test “a” and “b” were used. But the different frequency charac-
teristics were applied (speech transmitted through a different channel).

For SDC as well as Aurora2 experiments, the reference recognizer is based on HTK software
package (version 2.2 from Entropic). In order to compare recognition results when applying different
feature extraction algorithms, the training and recognition parameters are well defined. There is no
restriction in the string length of recognized numbers. The digits are recognized as whole word HMMs
with 16 states per word (plus 2 dummy states at the beginning and end). The number of states has
been chosen with respect to commonly used frame rate 10ms. The HMMs are simple left-right models
without skips over states. The diagonal covariance matrix is considered (only the variances of all
acoustic coefficients). Mixture of 3 Gaussians per state is computed.
A vector size of 15 coefficients plus delta and acceleration coefficients is defined. The vector size
may be changed. Two pause models are defined:

• “sil”: consists of 3 states (mixture of 6 Gaussian) with a special transition structure. It models
the pauses before and after utterance.

• “sp”: used to to model pauses between words. It consists of a single state which is tied with the
middle state of the first pause model.

The Training is done in several steps by applying the embedded Baum-Welch reestimation scheme
(HTK tool HERest).

5
Chapter 2

Static feature extraction

The purpose of feature extraction (often referred to as signal modeling algorithms) is to transform
audio data into a space where the observations from the same class will be grouped together and
observations of different classes will be pushed apart. For their derivations, psychological studies of
the human auditory and articulatory systems were used.
The short-time Fourier spectrum is usually examined as the first preprocessing block of the feature
extraction. The length of analyzed frames is 25ms with 10ms time shift. The frames are weighted
by a Hamming window that provide spectral analysis with a flatter pass band and significantly less
stop band ripple. This property with the fact that hamming window normalize the signal, so that the
energy of the signal will be unchanged through the operation, play an important role for obtaining
smoothly varying parametric estimates.
In most of our experiments the absolute energy of the spectrum of a given frame is considered
as the one of the feature. There is several possibilities to compute it from the spectrum and will be
mentioned later.
The other part of spectral measurement can be in general considered as the measurement at
specific frequencies, which corresponds to the standard behavior of the hair cells in a cochlea in a
human auditory system.
Most feature extraction methods use cepstral analysis to extract this vocal tract component from
the speech signal. Many algorithms have been proposed to compute the cepstrum. The most successful
methods also include the attributes of the psychological processes of human hearing into analysis.

2.1 Mel-frequency cepstral (MFCC analysis (experiments)

Nowadays MFCC analysis is considered as the standard method for feature extraction in speech
recognition tasks. It uses a bank of Mel-filters (Fig. 2.2) modeling the hair spacing along the basilar
membrane of the ear.
Fig. 2.1 illustrates classical MFCC feature extraction. Spectral trajectories behind each particular
processing stage for voiced frame of speech is shown there.

Experiments:
The 64 Hz to 4 kHz frequency range is applied for computing the Mel-warped spectrum. Standard
Mel-scale warping function is used:

¡ f ¢
FM el (f ) = 2595.log10 1 + . (2.1)
700
Applying the short-time Fourier transform on the input speech, an input power spectrum is obtained:

P (ω) = |STFT w(n) ∗ s(n) |2 ,


¡ ¢
(2.2)

6
where s(n) is input speech and w(n) is weighting window. P (ω) is transformed into 23 spectral subbands
equidistant in the Mel frequency scale. Natural logarithm is performed on the outputs of the Mel filter
bank. 15 cepstral coefficients are obtained applying DCT on the log-energies f i of 23 spectral subbands:
23
X ¡ πi ¢
ci = fj .cos (j − 0.5) , 0 <= i <= 14. (2.3)
j=1
23

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
MFCC 37.17 85.18 93.83 37.53 66.69 92.01 37.47 75.37 84.52

Aurora 2-WER [%] noisy train clean train


test “A” “B” “C” “A” “B” “C”
MFCC 12.67 13.82 14.56 45.53 50.12 36.12

Table 2.1: Word recognition results.

5 5
x 10 x 10
4.5 10

4 9

3.5 8

7
3

−−−> P(omega)
2

6
−−−> |S|

2.5
5
2
4
1.5
3
1
2
0.5
1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20
−−−> f [kHz] −−−> freq. bands
2000

1500

1000

500

−500

−1000

−1500
0 2000 4000 6000 8000 10000 12000

Power
−−−> samples

MFB Log DCT Static


1
FFT MFCCs
...
200
14

frames
13

12

11

10

7
5 10 15 20
−−−> freq. bands

Figure 2.1: MFCC analysis with frequency trajectories of voiced speech forF sampl = 8kHz.

2.2 Application of an all-pole modeling with additional analysis in


speech recognition (experiments)
2.2.1 Spectral analysis, Critical band warping, equal-loudness, power law
Perceptual Linear Prediction (PLP) combines several engineering approximations of psychology of
human hearing processes. Critical band analysis simulated by an auditory-based warping of the
frequency axis is derived from the frequency sensitivity of human hearing. In original approach, Bark
scale warping function is employed:
h f r
¡ f ¢2 i
FBark = 6ln + +1 (2.4)
600 600
Mel-scale filter bank analysis uses triangular shaped windows (Fig, 2.2), in PLP analysis the
window shape is designed to simulate critical bank masking curves (Fig, 2.3). Both allocate more
filters to the lower frequencies, where hearing is more sensitive.

7
In order to compensate the unequal sensitivity of human hearing at different frequencies, the next
processing stage in PLP analysis simulates equal loudness curve, such as:

(ω 2 + 56, 8.106 ).ω 4


E(ω) = . (2.5)
(ω 2 + 6, 3.106 )2 .(ω 2 + 0, 38.109 )

In MFCC analysis, preemphasis is applied in the time-domain using first-order high pass filter:

H(z) = 1 − αz −1 . (2.6)

Equal loudness (EQL) compensation is applied on the power spectrum P (ωm ) computed using Eq. 2.2.
Next processing stage called intensity-loudness power law models the non-linear relation between
the intensity of sound and its perceived loudness. In PLP analysis a cubic root compensation of
critical band energies is applied. This type of compensation is related to the logarithm applied on
the Mel-filter bank channels in MFCC analysis. The power spectrum after application of power law
is denoted as Pp (ωm ), m =< 1, N >, where N is the number of discrete frequencies (let us keep the
same notation of Pp (ωm ) as P (ωm ).
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4
−−−> f [kHz]

Figure 2.2: Bank of Mel filters for Fsampl = 8kHz.

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 500 1000 1500 2000 2500 3000 3500 4000

Figure 2.3: Bank of Bark filters for Fsampl = 8kHz.

2.2.2 Linear prediction


Speech can be modeled as being produced by a periodic or noiselike source that is driving a tube. Such
tube then represents a vocal tract. It can be shown that basing the analysis (in a very general way) on
such a production model leads to a spectral estimate that is both succint and smooth, and for which
the nature of the smoothness has a number of desirable properties. Obviously, the real vocal tract is
complicated and need to be modeled by nonuniform tube consisting of multiple shorter concatenated
tubes of differing cross-sectional areas but having the same length. This could be viewed as an
approximation to a continuous vocal tract shape. The resulting tube would have a set of resonances
that would be similar to those for an actual vocal tract. Experiments in speech perception suggested
the fundamental importance of the formants for human listeners. In other words it means that we can
assume that we need a model that can represent a sufficient number of resonances [?].
Each formant can be represented by a pole-only transfer function of the form:
1
Ĥi (z) = , (2.7)
1 + bi z −1 + ci z −2

8
5 5
x 10 x 10
4.5 4.5

4 4

3.5 3.5

3 3

−−−> P(omega)
−−−> |S|2
2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20
−−−> f [kHz] −−−> freq. bands

−−−> time

Power Frequency Equal Power


FFT warping loudness law
...
frames
5
x 10 35 80
10
30 70
9

8 25
60

−−−> P(omega)
7 20
−−−> P(omega)

50
6

−−−> R
15
5 40
10
4 30
5
3
20
0
2
−5 10
1

0 −10 0
5 10 15 20 5 10 15 20 25 30 35 40 45 5 10 15 20
−−−> freq. bands −−−> time −−−> freq. bands

RCs
CEPs
LSFs
Post All pole IDFT Spectrum
LARs
...
processing modelling process.

80
80

70 70

60 60

−−−> P(omega)
−−−> P(omega)

50 50

40 40

30 30

20 20

10 10

0
2 4 6 8 10 12 14 16 18 20 22
0
5 10 15 20 25 30 35 40 45
−−−> freq. bands −−−> freq. bands

Figure 2.4: PLP analysis with frequency trajectories of voiced speech forF sampl = 8kHz.

where for the monent the filter gain is ignored. In order to reasonably cover the natural speech,
the cascade of several such filters would be needed. This cascade approach has been used in many
synthesis applications. Multiplying through all of these sections, a direct-form implementation of the
spectra is obtained:
1
Ĥ(z) = Pp , (2.8)
1 + k=1 ak z −k

where ak are the coefficients of the resulting pth order polynomial.

2.2.2.1 Autoregressive modeling


P (ωm ) is approximated by the spectrum of an all-pole model using the autocorrelation method. It can
be shown that the inverse Fourier transform (IDFT) of a power spectrum P (ω) yields to the sequence
of autocorrelation coefficients.
Some spectrum processing is needed before applying IDFT to make P (ω) conjugate symmetric. If
such property of P (ωm ) is satisfied, resulting autocorrelation coefficients are real (projection of P (ω m )
into cosine basis only):
N −1
1 X
Ri = P (ωm ) cos(iωm ). (2.9)
N
m=0

In the all-pole model we assume a frequency transfer function given in Eq. 2.8, but already with
gain factor G:
G
Ĥ(z) = Pp , (2.10)
1 + k=1 ak z −k
where
p
X
A(z) = 1 + ak z −k . (2.11)
k=1

9
ak are the predictor coefficients, and p is the order of the model. The problem is to determine these
three factors describing all-pole model. The model power spectrum is then given:
G2
P̂ (ω) = ¯ ¯ . (2.12)
¯1 + P p −jkω ¯2
k=1 ak e

A linear prediction (LP) uses error measure ELP between P (ωm ) and P̂ (ωm ) for discrete spectra:
N N
G2 X P (ωm ) 1 X ¯2
P (ωm )¯A(ejωm )¯ .
¯
ELP = = (2.13)
N N
m=1 P̂ (ωm ) m=1

ELP can be interpreted as the total energy of the “error signal” that can be obtained by passing of
the hypothetical signal sn through the inverse filter A(z) and can be proved using Perceval’s theorem.
Important fact is that ELP is defined to be independent of G, as can be seen from Eq. 2.13. The gain
factor will be determined later from energy considerations.
The parameters {ak } are determined by minimizing ELP in Eq. 2.13 with respect to each of the
parameters. Hence we have:
∂ELP
= 0, 1 ≤ i ≤ p. (2.14)
∂ai
It can be shown that:
p
∂ELP h X i
= 2 Ri + ak Ri−k , (2.15)
∂ai
k=1
where Ri is determined by Eq. 2.9. Therefore,
p
X
ak Ri−k = −Ri , 1 ≤ i ≤ p. (2.16)
k=1

The minimum error is obtained by substituting Eq. 2.9 and 2.16 in Eq. 2.13:
p
X
ELPp = R0 + ak Rk , (2.17)
k=1

where ELPp reflect the dependence of ELP on order of all-pole model. Coupling Eq. 2.16, 2.17 we can
get:
p n−R , 1 ≤ i ≤ p,
i
X
ak Ri−k = (2.18)
ELPp , i = 0.
k=1
The problem is reduced into the task of solving a set of p equations with p unknowns. There exists
several standard methods for determining unknowns (performing the necessary computations). From
Eq. 2.16 with respect to the covariance method of deriving all-pole model, it can be noted that the
matrix of coefficients in each case is a covariance matrix that is symmetric and positive semidefinite
(in practice definite). Therefore it can be solved by the square-root method. Very efficient method of
determining {ak } from Eq. 2.16 is proposed by Levinson by noting that p × p autocorrelation matrix
is symmetric and the elements along each diagonal are same (a Toeplitz matrix) as well as the the
column vector on the right side of the following equation is to be a general column vector:
R0 R1 R2 . . . Rp−1 a1 −R1
    
 R1 R0 R1 . . . Rp−2   a2  −R2 
   

 R2 R1 R0 . . . Rp−3  a3  −R3 
   
 
 . . . .  .  =  .  (2.19)
   

 . . . .  .   . 
    

 . . . .  .   . 
Rp−1 Rp−2 Rp−3 . . . R0 ap −Rp

10
Another method proposed by Durbin highly increasing the efficiency of the previous algorithm use the
fact that the right side column vector contains the same elements found in the autocorrelation matrix.
Durbin’s recursive algorithm is specified as follows:

ELP0 = R0 , (2.20)
Pi−1 (i−1)
−Ri + j=1 aj Ri−j
ki = , (2.21)
ELPi
(i) (i) (i−1) (i−1)
ai = k i , aj = a j + ki aj−1 , 1 ≤ j ≤ i − 1, (2.22)

ELPi = (1 − ki2 )Ei−1 . (2.23)


Eq. 2.20 - 2.23 are solved recursively for i = 1, 2, . . . , p, and the final solution is:
(p)
aj = a j , 1 ≤ j ≤ p. (2.24)

The interesting property of Eq. 2.19 is that the solution is not affected if all the autocorrelation
coefficients are scaled by a constant, such as their normalization by dividing R 0 . The minimum total
error ELPi in Eq. 2.23 decreases (or does not change) as the order of the predictor increases. E LPi is
all the time positive, since it is a squared error:

0 ≤ ELPi ≤ ELPi , ELP0 = R0 . (2.25)

2.2.2.2 Computation of gain factor G


In the all-pole model we assume that the signal sn is given as a linear combination of past values and
some input un :
Xp
sn = ak sn−k + Gun . (2.26)
k=1

The derivations of parameters such as Autocorrelation or Covariance methods are derived using an
intuitive least squares approach, assuming that sn is a deterministic signal. In method of least squares,
there is made an assumption that the input un is totally unknown. Therefore the signal can be
predicted only approximately from a linearly weighted summation of past samples (s̃ n approximated
sn ):
Xp
s̃n = ak sn−k , (2.27)
k=1

and the error between the actual value sn and the predicted value s̃n is given by:
p
X
en = sn − s̃n = sn + ak sn−k . (2.28)
k=1

Since we assume in least square approach that the the input is unknown, the sense of computation
is quite uncertain. However, let Eq.2.28 to be rewritten to the form:
p
X
sn = − ak sn−k + en . (2.29)
k=1

The only signal that will result in the signal sn as output is that, where Gun = en (input signal is
proportional to the error signal). However, if we consider for any input u n , the energy in the output
must equal that of the original signal sn , then we can determine the total energy in the input signal.

11
Since the filter H(z) is fixed, the total energy in the input signal, Gun must equal to the total energy
in the error signal, which is given by ELPp in Eq. 2.17.
For specification of G, let the input to the all-pole filter H(z) be an impulse (unit sample) at n = 0,
i.e. un = δn0 . Then the output of H(z) is its impulse response hn , where:
p
X
hn = − ak hn−k + δn0 . (2.30)
k=1

If we determine the autocorrelation R̂i of the impulse response hn , we can find out interesting rela-
tionship to the autocorrelation Ri of the signal sn :
p
X
ak R̂i−k = −R̂i , 1 ≤ |i| ≤ ∞, (2.31)
k=1

p
X
ak R̂k + G2 = −R̂0 . (2.32)
k=1
To satisfy the condition of equalness of the total energies of hn and sn , we must have:

R̂0 = R0 , (2.33)

because the zeroth autocorrelation coefficient is equal to the total energy in the signal. From Eq. 2.33,
2.16 and 2.31 we can write;
R̂i = Ri , 0 ≤ i ≤ p. (2.34)
This says that the first p + 1 autocorrelation coefficients of the impulse response of H(z) are equal
to the corresponding autocorrelation coefficients of the signal. The all-pole modeling problem can
then be reformulated so that we want to find a filter of the form H(z) in Eq. 2.10, whose first p + 1
values of the autocorrelation of its impulse response are equal to the first p + 1 values of the signal
autocorrelation. Now the determination of G is very easy. With respect to the Eq. 2.32, 2.17 and 2.34
the gain is equal to:
X p X p
G2 = ELPp = R0 + ak Rk = ak Rk , (2.35)
k=1 k=0

where G2 is the total energy in the input Gδn0 .

2.2.2.3 Properties of the model spectrum


The poles of the all-pole model can be found by computing the roots. Since the coefficients {a k } are
real, some of roots (or none) are real, the rest are conjugate symmetric. Because of the fact that P (ω)
is positive definite spectrum, the poles are guaranteed to be inside the unit circle.
If the order of all-pole model is increasing, the range over which Ri and R̂i are equal, resulting in
better fit of P̂ (ω) to P (ω). If p → ∞, R̂i becomes identical to Ri for all i, so that P̂ (ω) = P (ω).
From Eq. 2.13 with respect to the fact that ELPp = G2 :

N
1 X P (ωm )
= 1, (2.36)
N
m=1 P̂ (ωm )

The property from previous Eq. 2.36 is satisfied for all values of p (even for case p → ∞).
If P (ω) is an all-pole spectrum with p0 poles, the Eq. 2.36 becomes identity, too. Then:

P̂ (ω) = P (ω), p ≥ p0 . (2.37)

12
Another property of P̂ (ω) is that its slope:

∂ P̂ (ω)
= 0, ω = 0, π, (2.38)
∂ω
which can be seen by rewriting Eq. 2.12 as:

G2
P̂ (ω) = Pp , (2.39)
b0 + 2 k=1 bk cos(kω)

where:
p−|k|
X
bk = an an+|k| , a0 = 1, 0 ≤ k ≤ p, (2.40)
n=0

∂ P̂ (ω)
and taking ∂ω . b0 are the autocorrelation coefficients of the impulse response of the inverse filter
A(z).

2.2.2.4 Definition of an all-pole model for a0 6= 1


We can rewrite Eq. 2.12 into the form without gain factor:
k
P̂ (ω) = ¯ Pp ¯ . (2.41)
−jkω ¯2
k=0 dk e
¯

Here the G is incorporated in the coefficients of denominator, i.e. d0 is not restricted to 1. Constant k
is unknown, but dependent on {dk }. We will try to derive the coefficients {dk } (d0 6= 1) of such filter
from ordinary {ak } coefficients (a0 = 1). Let the both sides of Eq. 2.35 divide by G2 :
p
X ¯ 1
G2 = ak Rk , (2.42)
¯
¯ 2
G
k=0
p
X ak
1= Rk . (2.43)
G2
k=0

Then the set of equations in Eq. 2.19 results into:


  a0   
R0 R1 R2 . . . Rp 1

G2
R1 R0 R1 . . . Rp−1   a12  0
 G
 a22  0
   
R2 R1 R0 . . . Rp−2 
 G
 a3   
   
R3 R2 R1 . . . Rp−3 
   G2  = 0  , (2.44)
 . . . .   .  .
   

 . . . .   .  .
   

 . . . .   .  .
ap
Rp Rp−1 Rp−2 . . . R0 G2
0
ak
where dk = G 2 . For such a set of linear equations we can again use Levinson’s recursive procedure
(Toeplitz matrix). When comparing Eq. 2.12 and 2.41 it is clear that k = d 0 ; The relationship between
{ak } and {dk } is very easy:

dk
ak = (2.45)
d0
1
G= √ . (2.46)
d0

13
Although this way of solution all-pole modeling task seems to be quite weird, we will use such set of
equations given by Eq. 2.44 later. If we rewrite the matrix Eq. 2.44 into the standard form we have:
p
X
dk Ri−k = 0, 1≤i≤p (2.47)
k=0

and
p
X 1
ak Rk = . (2.48)
d0
k=0

2.2.2.5 Normalized minimum error


There is an interesting aspect of the normalized minimum total squarer error V p that is defined as the
ratio of the energy in the minimum error sequence en to the energy in the speech signal:
Ep G2
Vp = = . (2.49)
R0 R0
From previous equation it follows that Vp is monotonically decreasing function of p, where 0 < Vp ≤ 1.
In case of p = 0, V0 = 1 and as p → ∞, Vp approaches a minimum value Vmin = V∞ . It can be shown
that:
eĉ0 eĉ0
Vp = = , (2.50)
R̂0 R0
where:
N −1
1 X
ĉ0 = log(P̂ (ωm )), (2.51)
N
m=0
is the zeroth cepstral coefficient (inverse Fourier transform of the logarithm of the power spectrum).
In fact, P̂ is the power spectrum of estimated all-pole model so that we should have used integral form
of equation (P̂ is continuous). However for computation of Vm in discrete form is sufficient.
If p → ∞, P̂ (ω) is equal to P (ω) and we obtain: an expression for the normalized error:
ec 0
Vm in = V∞ = , (2.52)
R0
where c0 is the zeroth cepstral coefficient of the signal, and R0 is the energy in the signal.
1

0.98

0.96

0.94
−−−> Vp [−]

unvoiced frame

0.92
Vmin = 0.92

0.9
unvoiced frame (preemphasis)

0.88 voiced frame

Vmin = 0.875
0.86
0 5 10 15 20 25 30
−−−> number of poles p

Figure 2.5: Normalized error curves Vm in for unvoiced and voiced frames of speech without/with
application of preemphasis.

If we rewrite Eq. 2.52 as a function of P (ωm ):


h P i
N −1
exp N1 m=0 log(P (ωm ))
Vmin = 1 PN −1
. (2.53)
N m=0 P (ωm )

14
70

65

60

−−−> |H| [dB]


2
55

50

45
2 4 6 8 10 12 14 16 18 20 22
−−−> frequency bands

Figure 2.6: Spectral smoothing of a warped spectrum (solid line) by all-pole model (dashed line) with
p = 14.

The numerator can be seen as an geometric mean of P , whereas the denominator represents an
arithmetic mean. Such as ration is equal to one if all the data are equal and the value decreases as the
spread of the data increases. Observing Eq. 2.49 and 2.53 we can see that V min depends only on the
shape of the signal spectrum, but Vp is completely related to the shape of the approximate spectrum
(all-pole model). This fact is important in interpreting the properties of V p curve for the spectra of
different sounds. As can be seen from Fig. 2.5 the error curve for voiced frames are much lower than
for unvoiced. Hence Vp can be suggested as a possible parameter for the detection of voicing (but V p is
only dependent on the shape of the spectrum and has nothing to do with the fact of voicing itself). It
is easy to prove that if the spectrum is flat then Vmin = 1 and the error curve is the highest possible.
This means that we are not able to approximate the flat spectrum by an all-pole model. On the other
hand, if all the energy is concentrated in certain regions of the spectrum and the rest is zero, then
Vmin = 0, and the error curve is the lowest possible. In general, voiced frames have most of the energy
concentrated in one region at low frequencies, resulting in low error curves. Unvoiced frames have the
energy more spread out across the spectrum.
In case of applying any distortion to the input signal (such as preemphasis employing Eq. 2.6)
that influence the shape of spectrum, it can largely affect the error curves, as can be seen in Fig. 2.5.
The problems can be found when telephone speech, where much of the low frequency energy has been
filtered out with sharp reduction of dynamic range, is to be used as the input of the voicing detector.

2.2.2.6 Estimation of the order of all-pole model

Again from Fig. 2.5 we can estimate the value of p such that P̂ (ω) approximates the envelope of P (ω)
optimally (in case of LP modeling of speech spectrum, the formant structure is to be evident). The
error curve starts at value 1 and monotonically decreases to its own V min as p → ∞. Mostly in error
curve we can observe the “knee” that is the value of p when the curve starts to slope very slowly toward
its asymptote (for curves in Fig. 2.5, p = 8 and p = 14 (voiced, unvoiced spectra). Those values of p
are optimal to approximate the signal spectrum. A lower value of p results in a grosser approximation
to the spectral envelope, whereas a larger value of p will add detailed spectral information on the top
of the spectral envelope.

2.2.2.7 Cepstral analysis-basic PLP

Linear system approximating the envelope of given signal spectrum (in our case perceptually warped
spectrum) is fully described by the set of linear prediction coefficients {a(k)} (LPC) with additional
information about the total energy represented by the gain factor. Such a linear system can be either
defined by different type of coefficients. Once we have estimated LPC we are unrestricted to use this
one type of representation of linear system. This is great advantage of PLP versus MFCC.

15
The derivation of cepstral coefficients from given set of LPC is simple because there exists direct
transformation from LPC to cepstral coefficients. For its derivation, first we apply the logarithm on
the Ĥ(z):
¡ ¢ h G i
log Ĥ(z) = log Pp , (2.54)
1 + k=1 ak z −k
If A(z) (denominator of Eq. 2.54) is of pth order and all the poles are inside the unit circle (it is the
property of an all-pole model) and A(z → ∞) = 1, we can apply the Taylor expansion of log( Ĥ(z))
to;

X
log Ĥ(z) = c0 + c1 z −1 + c2 z −2 + · · · = ci z −i ,
¡ ¢
(2.55)
i=0

where {ci } are the cepstral coefficients of LPC. If we substitute Ĥ(z) and apply the derivation on both
sides of Eq. 2.55 to get rid of the logarithm we will obtain:
p
X ∞ p
£X ¤ £X
kak z −k = ici z −i . ak z −k .
¤
− (2.56)
k=1 i=1 k=0

With respect to a0 = 1 and by comparison the terms of the same root of the left and right side of
Eq. 2.56, we can easily derive the cepstral coefficients of LPC:

c0 = 0,
c1 = −a1 ,
k−1
X i
ck = −ak − ci ak−i , 2 ≤ k ≤ p, (2.57)
k
i=1
p
X k−i
ck = − ck−i ai , k = p + 1, p + 2, . . . .
k
i=1

Cepstral coefficients derived in previous equations are related to the spectral envelope established by
linear prediction analysis, and so they differ in general to cepstral coefficients computed directly from
power spectrum (such as in MFCC analysis).

Experiments:
We have made many experiments with PLP-cepstrum as features in AURORA2 and SDC tasks. In
Fig. 2.4 the processing scheme for PLP feature extraction is given. It is clear that there is more algorithm
parameters than in MFCC analysis, so that more experiments is to be done in order to find optimal
parameters:

– PLP7: Mel warping 129 → 23 frequency bands, Fstart = 0Hz, c0 = mean(log(P (ωm ))), p = 14.
Because of all-pole model property (Eq. 2.38), it seems to be reasonable to repeat the first and last
frequency band of warped spectrum before its symmetrization, as can be seen in Fig. 2.7. c 1 . . . c13
cepstral coefficients computed. root (Power law) = 0.33, EQL not applied.
– PLP5: Similar to PLP7, c0 = G, Fstart = 64Hz.
– PLP6: Similar to PLP6, p = 20.
– PLP8: Similar to PLP5, not repetition of the first and the last freq. band (just symmetrization of
spectrum by flipping operation).
– PLP9: Similar to PLP5, c0 = log(G).
– PLP10: Similar to PLP7, p = 12.

In another experiment related to PLP7, where Fstart = 64Hz, the recognition performance was almost
the same.

16
Aurora 2-WER [%] noisy train clean train
test “A” “B” “C” “A” “B” “C”
PLP7 12.29 12.75 13.84 44.30 49.11 35.76

Aurora 2-WER [%] noisy train


test “A” “B” “C”
PLP5 13.62 13.18 15.76
PLP6 13.88 13.74 16.13
PLP8 13.66 13.47 16.22
PLP9 12.32 12.66 13.79
PLP10 14.00 15.12 16.14

Aurora 2-WER [%] noisy train


test “A” “B” “C”
PLP0 13.89 13.73 17.18
PLP1 13.76 14.16 17.00
PLP2 14.62 15.88 18.97
PLP3 13.38 14.94 18.25

Table 2.2: Word recognition results.

2.2.2.8 Experimental choosing a value of p


Experiments:
– PLP0: Bark warping 129 → 15 frequency bands, root (Power law)= 0.33, EQL not applied, c 0 = G,
spectral processing (repetition and symmetrization as in PLP7, c1 . . . c13 - standard cepstral coeff,
p = 16.
– PLP1: Same as PLP0:, p = 14.
– PLP2: Same as PLP0:, p = 10.
– PLP3: Same as PLP0:, p = 12.

2.2.2.9 Spectral sensitivity of PLP-cepstral coefficients


Interesting property of cepstral coefficients in general can be observed when their spectral sensitivity
is computed:
∂S ¯ ∆S ¯
= lim ¯ ¯, (2.58)
¯ ¯
∂ci ∆ci →0 ∆ci
where ∆S is the spectral deviation due to the change ∆ci in the ith cepstral coefficient. Using the mean
absolute log spectral measure to determine spectral deviation yields to the spectral sensitivity shown
in Fig. 2.8. Each curve has been obtained by computing the spectral sensitivity (using a 16 points of
power spectrum) as one of the 14 cepstral coefficients was varied over the range (0.5 − 1.5) × c ioriginal
while the remaining 13 cepstral coefficients (we did not take into account c 0 ) were kept unchanged.
Across various types of speech frames, these sensitivity curves have the same shape.
Spectral sensitivity is computed using norm-based spectral distortion measure d qSP :
v
u N −1 ¯
q . u
q 1 X¯ £ Porig (ωm ) ¤¯¯q
dSP = t ¯10log10 ¯ , [dB] (2.59)
N Pch (ωm )
m=0

where q is usually equal to 2. The term rms log spectral measure is used when the log 10 is replaced
by the natural logarithm. The mean absolute log spectral measure is obtained by setting q = 1. For
the limiting case as q approaches infinity, the term peak log spectral difference is used. The P orig (ωm )

17
is in our case the original spectrum obtained with the original values of ci , whereas the Pch (ωm ) is the
power spectrum related to the varied ci .
The Fig. 2.8 shows the spectral sensitivity of different cepstral coefficients. The original cepstral
values have been changed from the original value ciorig to the value (0.5 − 1.5) × ciorig . The symmetric
shape of spectral sensitivity curve comes from the property of inverse Fourier transform which is
employed in relationship between spectrum and cepstrum and from the fact that q = 2. It is also clear
from the following equation:
X X³ ´³ ´T
(corig − cch )2 = (Porig − Pch )Tr Tr(Porig − Pch )
n n
X (2.60)
= (Porig − Pch )2 ,
n

where

• Tr is the transformation matrix (in our case inverse DCT, matrix n × n, P ∗ Tr = c,

• Porig is vector of orig. power spectra (vector 1 × n),

• Pch is vector of power spectra obtained from modified cepstrum cch (vector 1 × n),

• corig and cch are the vectors 1 × n.

Eq. 2.60 in general follows the property of symmetry of spectral distortion curve. The spectral distor-
tion of power spectra is not sensitive to the direction of variance of c orig .
80

70

60

50
−−−> P(omega)

40

30

20

10

0
0 5 10 15 20 25 30 35 40 45
−−−> frequency bands

Figure 2.7: Repetition of the first and the last frequency band and spectrum symmetrization to get
better matching of an all-pole model to the signal spectrum.

0.5

0.45
2
4
0.4 11
6
3
0.35 9
−−−> Spectral sensitivity [dB]

15
5
0.3 14
7
10
0.25 8
13
12
0.2

0.15

0.1

0.05

0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
−−−> × of original value

Figure 2.8: Spectral sensitivity curves for the cepstral coefficients of a 14 th order PLP analysis. Only
sensitivity of c1 . . . c14 is computed.

18
2.2.2.10 Application of different value of root (Power law)
Experiments:
The different values of root have been examined in PLP used for Aurora 2 project. In Aurora 2 experiments
a noise estimation with noise subtraction algorithm have been employed (the system which performed
the best in MFCCs type of feature extraction (sub fea10 frame orig nocoder ), F start = 64Hz of MFB,
DC offset in spectral domain, temporal LDA filter (last data-driven filter coefficients), voice activity
detector (VAD) (submitted version). We did not use mean and variance normalization (MVN) and the
final coding/decoding algorithm, (no TRAPS, no feature-net).
PLP part of code: MFB warping (23 bands), 15 output cepstral coefficients, c 0 = log(G), No EQL
transformation:

– PLP4 nocoder: root = 0.33.


– PLP5 nocoder: root = 0.2.
– PLP7 nocoder: root = 0.4.

For comparison of PLP feature extraction we performed MFCC analysis with untouched noise estimation
and suppression system (sub fea10 frame orig nocoder ), the same TLDA and VAD were used (no MVN
and the final coding/decoding):

– subfea 10 frame orig noonline nocoder: Aurora 2 evaluation, MFCC analysis employed.

SDC-Accuracy [%] Italian


test hm mm wm
PLP4 nocoder 73.7 93.01 96.92
PLP5 nocoder 73.81 92.77 96.99
PLP7 nocoder 73.7 93.13 96.96
subfea 10 frame orig noonline nocoder 77.24 92.77 96.94

Aurora 2-WER [%] noisy train clean train


test “A” “B” “C” “A” “B” “C”
PLP4 nocoder 9.13 9.65 11.36 18.29 17.73 20.18
PLP5 nocoder 9.19 9.70 11.44 18.57 17.81 20.45
PLP7 nocoder 9.15 9.60 11.29 18.05 17.43 19.86
subfea 10 frame orig noonline nocoder 9.33 9.66 11.62 18.21 17.45 20.17

Table 2.3: Word recognition results.

2.2.2.11 Equal loudness and preemphasis in PLP


The power spectrum processed by the critical band analysis is preemphasized by the simulated equal
loudness curve. Its function is to approximate the power spectrum to the non-equal sensitivity of
human hearing at different frequencies and to simulate the sensitivity of hearing at about 40 dB level.
The approximation is given in Eq. 2.3 that represents a transfer function of a filter with asymptotes of
12 dB/oct between 0 and 400 Hz, 0 dB/oct between 400Hz and 1200 Hz, 6 dB/oct between 1200 and
3100 Hz, and 0 dB/oct between 3100 Hz and Fsampling /2. This approximation is suitable for sound
levels up to 5000 Hz.
Instead of Eq. 2.5 we can apply the simple FIR filter given by Eq. 2.6 with α ∼ 0.95.

Experiments:
In first experiments we have used the same experimental setup as in case of PLP4 nocoder described in
sect. 2.2.2.10, but obviously with application of EQL approximation:

19
20

10

−10

−20

−−−> E(f) [dB]


−30

−40

−50

−60

−70

−80
0 500 1000 1500 2000 2500 3000 3500 4000
−−−> f [Hz]

Figure 2.9: EQL curve given by Eq. 2.5 (solid line) and Eq. 2.6 (dashed line).

– PLP6 nocoder: Application of 40 dB EQL approximation computed for central frequencies of each
Mel filter bank (Fstart = 0Hz) using Eq. 2.5.
– PLP8 nocoder: Application of 40 dB EQL approximation computed for central frequencies of each
Mel filter bank (Fstart = 64Hz) using Eq. 2.5.
– PLP9 nocoder: Application of preemphasis using Eq. 2.6 for central frequencies of each Mel filter
bank (Fstart = 64Hz).

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
PLP6 nocoder 78.5 92.65 97.15 85.55 83.04 96.35 84.21 93.26 96.87
subfea 10 frame orig noonline nocoder 77.24 92.77 96.94 85.3 85.16 96.27 83.19 92.63 96.90

SDC-Accuracy [%] Italian


test hm mm wm
PLP8 nocoder 76.4 92.89 97.02
PLP9 nocoder 74.78 92.49 97.06

Aurora 2-WER [%] noisy train clean train


test “A” “B” “C” “A” “B” “C”
PLP6 nocoder 9.17 9.28 11.25 18.04 17.17 19.49
PLP8 nocoder 9.26 9.40 11.33 18.71 17.46 20.63
PLP9 nocoder 9.26 9.66 11.37 18.39 17.61 20.27
subfea 10 frame orig noonline nocoder 9.33 9.66 11.62 18.21 17.45 20.17

Table 2.4: Word recognition results.

We have tried to employ EQL in experiments with basic PLP analysis, as described in sect. 2.2.2.7. The
experiments were the same as PLP7 :

– PLP20: Application of 40 dB EQL approximation computed for central frequencies of each Mel
filter bank (Fstart = 0Hz for all SDC-db) using Eq. 2.5.

2.2.2.12 Conclusion
Experiments with PLP based feature extraction were run on AURORA 2 DB. The purpose of this work
was to compare the results with MFC based feature extraction and attempt to find the advantages
and the drawbacks of them. Due to the fact that PLP analysis needs more parameters to be set, we
have experimented with raw PLP feature extraction and later tried to impact such optimized code
into full AURORA 2 code, where the section related to MFC analysis was replaced by PLP.

20
SDC-Accuracy [%] Italian Finish Spanish
test hm mm wm hm mm wm hm mm wm
PLP7 38.14 85.18 94.26 41.2 65.73 91.86 38.92 73.01 87.15
PLP20 36.3 84.5 93.47 33.85 58.69 91.27 45.80 68.92 88.27

Aurora 2-WER [%] noisy train clean train


test “A” “B” “C” “A” “B” “C”
PLP7 12.29 12.75 13.84 44.30 49.11 35.76
PLP20 12.07 12.56 14.11 44.35 47.9 33.78

Table 2.5: Word recognition results.

Some of the parameters of PLP analysis were estimated experimentally as well as analytically.
Some of them only experimentally.
One of the most important parameters in PLP analysis is the order of an all-pole model used to
approximate the power spectrum of input speech. Analytically (sect. 2.2.2.6 on randomly selected
unvoiced frame (unvoiced frame needs higher order of an all-pole model to approximate its power
spectrum) we have obtained p = 14 that corresponds to the results mentioned in Tab. 2.2.
Intensity-loudness power law is employed in PLP to approximate the power law of hearing and
to simulate the non-linear relation between the intensity of sound and its perceived loudness. This
operation also reduces the spectral amplitude variation of the warped power spectrum so that the
approximating all-pole model can be used with a relatively low model order (essentially says that the
all-pole model can better fit given spectrum). In standard PLP analysis power root constant is 0.33.
Its increase towards square root (0.5) almost does not affect the performance, but the performance
starts degrading with smaller values of root.
Interesting observations come from EQL block used in PLP to approximate the non-equal sensi-
tivity of hearing at different frequencies. From our experiments it follows that EQL does not play
important role in PLP analysis (see results in Tab. 2.5). However bigger robustness of PLP base fea-
ture extraction (with EQL preemphasis) is obtained in full AURORA 2 feature extraction approach
(Tab. 2.4), where the noise suppression algorithm has been employed. Then, better performance is
mainly obtained in high mismatch experiments, but there can be noticed some improvement either
for well-matched conditions.
Applying EQL preemphasis with somehow optimized parameters of such PLP based feature ex-
traction approach, over all performance for AURORA 2 speech recognition task is higher than for
MFCC based feature extraction approach, where all preprocessing block were untouched.

2.2.3 Line spectral frequencies


One of the most popular parametric spectral representation uses the line spectral frequencies (LSFs),
also known as line spectrum pairs (LSPs). LSFs were proposed to be used in speech compression.
The LPC coefficients {ak } are known to be inappropriate for quantization, because of their large
dynamic range and applying the quantization, all-pole model can get unstable. Hence, different set of
parameters representing the same spectral information were proposed (reflection coefficients, log area
rations, LSP, . . . ).
LSP representation is rather artificial. In such spectral representation, two polynomials are given:

P (z) = A(z) + z −(p+1) A(z −1 ) (2.61)


−(p+1) −1
Q(z) = A(z) − z A(z ). (2.62)

And it follows that:


1£ ¤
A(z) = P (z) + Q(z) , (2.63)
2

21
where A(z) is an inverse all-pole model given by Eq. 2.11 which minimizes the residual energy. There
are several important properties of P (z) and Q(z):

• All zeros of P (z) and Q(z) are on the unit circle.

• Zeros of P (z) and Q(z) are interlaced with each other.

• Minimum phase property of A(z) is easily preserved after quantization of the zeros of P (z) and
Q(z).

• The LSF coefficients allow interpretation in terms of formant frequencies. If two neighboring
LSFs are close in frequency, it is likely that they correspond to a narrow bandwidth spectral
resonance in that frequency region. Otherwise, they usually contribute to the overall tilt of the
spectrum.

• Shifting the line spectral frequencies has a localized spectral effect – quantization errors in an
LSF will primarily affect the region of the spectrum around that frequency.

The first two properties are useful for finding the zeros of P (z) and Q(z). The third property ensures
the stability of the synthesis filter. Straightforward computation of the LSFs is not efficient due to the
extraction of the complex errors of a high order polynomial. However, there were proposed a methods
applying a discrete cosine transform or Chebyshev polynomials.
We have been observing the behavior of LSFs in terms of their use in PLP based feature extrac-
tion for speech recognition. Therefore, the spectrum was at the beginning warped, and other PLP
processing operations were applied.

22

20

18

16
−−−> frequency bands

14

12

10

20 40 60 80 100 120
−−−> frames

Figure 2.10: Trajectories of LSFs for spectrum analyzed by PLP (23 critical bands, F sampl = 8kHz).

1/A(z) 1/Q(z) 1/P(z)


1 1 1
Imaginary Part

Imaginary Part

Imaginary Part

0.5 0.5 0.5

0 0 0

−0.5 −0.5 −0.5

−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
Real Part Real Part Real Part

Figure 2.11: Poles of A(z), P(z), Q(z) for 10th order of A(z) in “z” plane.

22
100

50

−−−> |X| [dB]

−50
0 500 1000 1500 2000 2500 3000 3500 4000
−−−> f [Hz]

¯ 1 ¯
¯ (solid curve), ¯ 1 ¯ (dashed curve) and ¯ 1 ¯ (dash-dotted
¯ ¯ ¯ ¯
Figure 2.12: Frequency responses of ¯ A(z) P (z) Q(z)
curve). Vertical lines are LSFs (in range of (0-4000) Hz).

2.2.3.1 Spectral sensitivity of PLP-line spectral frequencies


Spectral sensitivity of LSFs is computed using the method described in sect. 2.2.2.9. Here, each LSF
like coefficient varied in range of 0 to π value, whereas the rest was kept unchanged. The results are
shown in Fig. 2.13. In order to have 0th spectral sensitivity for all LSFs, one more value (very first)
was added to the Fig. 2.13.
4

3.5

3
−−−> Spectral sensitivity [dB]

2.5
14
12
10
2 8
6
4
2
1.5

1 15
13
11
9
0.5 7
5
3
0
0 0.5 1 1.5 2 2.5 3
−−−> f [radians]

Figure 2.13: Spectral sensitivity of LSFs of 14th order PLP analysis for normalized frequency range
0 − π.

2.2.4 Reflection coefficients


Reflection coefficients (denoted ki for i = 1 . . . p, where p is the order of an all-pole model) are a
by-product of Levinson-Durbin algorithm (see Eq. 2.21), but can be recursively computed form the
(p)
filter coefficients {ak }. The recursion is initialized with ak = ak f or 1 ≤ k ≤ p. The reflection
coefficients are then computed from:
(i)
ki = a i
(i) (i) (i) (2.64)
(i−1) aj − ai ai−j
aj = , 1 ≤ j ≤ i − 1,
1 − ki2

where the index i starts from p and decrements at each iteration until i = 1. The coefficients k i
correspond to the gain factors in the lattice structure implementation of the LP analysis filter A(z)
(see Fig. 2.14). The lattice and transversal structures yield the same output, except in the time-varying
case - the memory/initial conditions of the filters being the cause of this difference. The LP analysis
filter is guaranteed to be minimum phase when |ki | < 1 f or i = 1, . . . p. Another advantage is that
(p) (q)
changing the order of the filter does not affect the coefficients computed; i.e., k i = ki for i = 1, . . . p,

23
(p) (q)
where ki and ki are the reflections coefficients for a pth and q th order predictor, respectively, and
p ≤ q.
f(0)[n] f(1)[n] f(p−1)[n] + f(p)[n]
+ + Residual
error e(n)
_
k(1) k(p)
s[n]

speech
k(1) k(p) _
b(0)[n] b(1)[n] b(p−1)[n] b(p)[n]
1/z + 1/z + +

Figure 2.14: Lattice structure of an all-pole LPC filter. The signals f (i)[n] and b(i)[n] are the ith
order forward and backward prediction errors, respectively. Reflection coefficients k(i) refer to {k i } in
the text.

2.2.4.1 Spectral sensitivity of PLP-reflection coefficients


Reflection coefficients have poor linear quantization properties. The sensitivity curves, as seen in
Fig. 2.15 have the same general ∪ shape. This is consistent with the fact that reflections coefficients
perform poorly when linearly quantized, especially as the magnitude of the reflection coefficients
approach unity.
2

1.8

1.6

1.4
−−−> Spectral sensitivity [dB]

1.2

0.8

0.6

0.4

0.2

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−−−> range of [−1, +1]

Figure 2.15: Spectral sensitivity for the reflection coefficients {ki } of 14th order PLP analysis for the
range −1, +1.

2.2.4.2 Spectral sensitivity of PLP-LPC


Spectral sensitivity curves of LPC {ak } have obviously the same shape as those for cepstral coefficients
{ck }. The curves in Fig. 2.16 were derived the same way as described in sect. 2.2.2.9.

2.2.5 Log-Area Ratios


Since the quantized coefficient sets that have the largest spectral deviation contribute the most to
perception, a quantization scheme that minimizes that maximum spectral deviation is desirable. The
log-area ratios (LARs) are computed from reflection coefficients {k i }:
1 + ki
gi = log , 1 ≤ i ≤ p, (2.65)
1 − ki
and are a non-linear transformation whose spectral sensitivity curves are approximately flat (the shape
is very similar to spectral sensitivity shapes of cepstral coefficients). The inverse transformation is:
eg i − 1
ki = , 1 ≤ i ≤ p. (2.66)
eg i + 1

24
0.5

0.45

0.4
2
4
0.35 3
6

−−−> Spectral sensitivity [dB]


11
0.3 15
5
7
9
0.25
10
8
14
0.2
13
12
0.15

0.1

0.05

0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
−−−> × of original value

Figure 2.16: Spectral sensitivity of LPC of 14th order PLP analysis. Only sensitivity of a1 . . . a14 is
computed.

The inverse sine transformation given by:


gi = sin−1 ki , (2.67)
also has good linear quantization properties.

2.2.6 Selective Linear Prediction


The LP analysis can be considered as a method of spectral modeling where we usually suppose that
the model spectrum spans the same frequency range as the signal spectrum (or in PLP analysis warped
power spectrum). The problem of LP modeling can be generalized to the case where we wish to fit
only selected portion of given spectrum, or we wish to model different part of spectrum using different
model.
In general we have a spectrum P (ωm ), 0 ≤ ωm ≤ ωb , and the task is to match the spectrum in a
region ωα ≤ ωm ≤ ωβ by an all-pole having power frequency response P̂ (ω) as given by Eq. 2.12. In
order to compute the parameters of P̂ (ω), we simply map the given region into the unit circle such
that ωα → 0 and ωβ → π. Then the procedure of estimation of LPC as described in sect. 2.2.2 is
applied.

2000

1500

1000

500

0
Power Frequency Equal Power
−500

−1000

−1500
0 2000 4000 6000 8000 10000 12000
1
... FFT warping loudness law
−−−> samples

200
frames

All pole Spectrum


Cepstrum modelling IDFT process.
........

Lower bands

recognizer

Cepstrum All pole IDFT Spectrum


modelling process. Upper bands

Spectrum Freq. Cepstrum


concat. response

recognizer

Figure 2.17: Scheme of selective LPC approach incorporated into PLP analysis.

Selective all-pole modeling is usually applied in term of different modeling of different part of
signal spectrum. This means the whole signal spectrum will be modeled, but not only using one
unique all-pole model.

25
Important note is that since we assume the availability of the discrete signal spectrum P (ω m ), a
desired frequency shaping or filtering can be done directly to the signal spectrum before LP analysis
is performed.
70

65

60
2
−−−> |P|

55

50

45

40
5 10 15 20 25
−−−> critical bands

Figure 2.18: Application of selective LP in PLP analysis: original warped spectrum (solid line), one
all-pole model approximation (no selective LP) (dashed line).

70

65

60
2
−−−> |P|

55

50

45

40
5 10 15 20 25
−−−> critical bands

Figure 2.19: Application of selective LP in PLP analysis: original warped spectrum (solid line), two
all-pole models approximation (dashed line, p1 = 12, p2 = 6). All-pole models are concatenated in
19th band.

70

65

60
2
−−−> |P|

55

50

45
2 4 6 8 10 12 14 16 18
−−−> critical bands

Figure 2.20: Application of selective LP in PLP analysis: original warped spectrum (solid line), lower
all-pole model approximation (dashed line, p1 = 12) for 1st − 18th frequency band.

Fig. 2.18-2.21 show power spectra obtained after application of ordinary LP analysis and selective
LP analysis computed for perceptually warped spectra. In the most of experiments with PLP, the
warped spectrum consists of 25 frequency bands (including repetition of first and last band). The
application of all-pole modeling to a voiced frame (8kHz sampled speech) is shown in Fig. 2.18. In
selective LP approach the spectrum is divided into lower and upper part (in our case 1 st − 18th band-
lower part, 19th − 23th band-upper part of spectrum (before repetition of side bands)). Each lower
and upper spectra are processed independently as in the standard all-pole modeling, and two different
all-pole models are obtained (Fig. 2.20 and 2.21). After concatenation of these two all-pole models,
the power spectrum (shown in Fig. 2.19) is obtained.
There can be several reasons to use selective linear prediction in speech recognition or coding
instead of classical approach. For instance, in speech recognition the main region of interest is the

26
54

52

50

48

2
−−−> |P|
46

44

42

40
20 20.5 21 21.5 22 22.5 23 23.5 24 24.5 25
−−−> critical bands

Figure 2.21: Application of selective LP in PLP analysis: original warped spectrum (solid line), upper
all-pole model approximation (dashed line, p2 = 6) for 19th − 25th frequency band.

0 − 5kHz region (it is well known that the first two formants of all vowels lie approximately in
the frequency range up to 3kHz). The spectrum of upper frequencies is important mainly for the
recognition of fricatives, in which case the total energy in that region might be sufficient. In LP
analysis the spectral matching process performs uniformly over the whole frequency range, which is
not desirable in this case. All-pole assumption for many speech sounds is less applicable for frequencies
greater than 5kHz. Therefore, instead of modeling the whole spectrum, we use selective LP to model
the lower part by a lower order all-pole spectrum. Then we can fit a very low order all-pole spectrum
to the upper frequency region.
The interesting problem is to attempt to do the same analysis in the time domain (of course not
in PLP case). Many down-sampling operations with sharp filtering would have had to be performed.
The problems would increased, if we wanted to choose an arbitrary divided frequency regions.

Experiments:
Many experiments have been done with application of selective LP analysis in PLP feature extraction.
The preprocessing operations applying before all-pole modeling have not been changed. So that the
perceptually warped power spectrum was approximated by EQL curve with application of power law. In
standard PLP analysis, such spectrum is processed to be conjugate symmetric and the autocorrelation
coefficients are computed using Eq. 2.9. In selective LP approach:

– The spectrum is split into upper and lower part. The first frequency band of lower part of spectrum
and the last frequency band of upper part of spectrum are repeated, respectively. Those two part
are modified to be symmetric and two sets of real autocorrelation coefficients are obtained.
– Lower and upper parts of spectrum are modeled independently by an all-pole modeling and two
all-pole models P̂lower (ω) and P̂upper (ω) are obtained.
– In order to get features for speech recognition, we were experimented with two possibilities:
∗ Compute LPCs (and then cepstrum or LSF, . . . ) for the both all-pole models independently.
Concatenate the features to obtain one feature set.
∗ Concatenate the P̂lower (ω) and P̂upper (ω) and using discrete filter least squares method to get
the parameters of one all-pole model. One set of features can be easily performed from the
resulting all-pole model. It is important to note that P̂lower (ω) and P̂upper (ω) are continuous
spectra (spectra with high frequency resolution) so that their approximation is accurate.

In selective all-pole modeling there are more parameters that have to be chosen such as the order of
all-pole models and the frequency band where the warped power spectrum is to be split into lower and
upper part. Only cepstral coefficients have been taken for speech recognition experiments.

– PLP12: Until spectrum processing block, the same algorithm as in PLP7 experiment. Then the
spectrum is split into lower and upper part (23 bands is divided into 15 + 8 bands). Two all-pole
models are computed with plower = 12, pupper = 5. Number of cepstral streams N c: N clower = 9,
N cupper = 5. These two cepstral streams are linearly merged to create final stream 0 th cepstral
coefficient is the log of mean of warped power spectrum as in PLP7.

27
– PLP13: Until spectrum processing block, the same algorithm as in PLP7 experiment. Then the
spectrum is split into lower and upper part (23 bands is divided into 15 + 8 bands. p lower = 10,
pupper = 6. Lower and upper frequency responses of all-pole models are concatenated together,
LPCs of such frequency response are computed (p = 14). LPCs are transferred into 14 cepstral
coefficients. 0th cepstral coefficient is the log of mean of warped power spectrum as in PLP7.
– PLP14: Similar to PLP12, plower = 8, pupper = 4.
– PLP30: Similar to PLP12, 23 bands are divided into 18 + 5 bands. N clower = 10, N cupper = 4.
– PLP31: Similar to PLP12, 23 bands are divided into 18 + 5 bands. N clower = 10, N cupper = 4.
plower = 10, pupper = 4.
– PLP32: Similar to PLP12, 23 bands are divided into 15 + 8 bands. N clower = 10, N cupper = 4.
plower = 10, pupper = 5.
– PLP33: Similar to PLP30. Slightly different repetition of side frequency bands of upper spectrum.
– PLP34: Similar to PLP13, plower = 12, pupper = 5. Interpolation between last spectral sample
of lower spectrum and first sample of upper spectrum for better transition (d f(end) = (d f(end-
1)+d s(1))/2).

SDC-Accuracy [%] Italian


test hm mm wm
PLP12 44.38 81.3 93.49
PLP13 35.38 81.22 93.23
PLP14 44.3 82.82 92.94
PLP30 45.56 83.78 93.55
PLP31 43.88 83.22 93.97
PLP32 45.35 82.58 92.88
PLP33 41.99 83.14 94.00
PLP34 34.7 81.82 93.00

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
PLP30 45.56 83.78 93.55 48.62 63.54 90.18 42.23 75.79 87.55
PLP7 38.14 85.18 94.26 41.2 65.73 91.86 38.92 73.01 87.15

Table 2.6: Word recognition results.

2.2.7 Discrete all-pole modeling


The goal of linear prediction (used in PLP analysis) is to predict, according to an error criterion, the
present value of a signal based on its previous p values. In standard LP, the error criterion used is a
least squares distance measure between the actual and predicted values. In the frequency domain, the
LP error criterion for discrete spectra is given in Eq. 2.13. If we suppose that gain factor is in P̂ (ω)
incorporated in the coefficients of filter, as described in sect. 2.2.2.4, the LP error criterion is:
N
1 X P (ωm )
ELP = , (2.68)
N
m=1 P̂ (ωm )

and P̂ (ω) is then given by Eq. 2.41. Generally, the frequencies ωm , which includes both positive
and negative frequencies, can be arbitrary and do not have to be equally spaced. Note that the
predictor coefficients are specified {dk } to separate them from standard LP coefficients having G not
incorporated. The minimization of ELP with respect to the {dk } results into a well-known set of linear
equations 2.47 and 2.48. In these equations Rk is the autocorrelation of the discrete signal spectrum

28
P (ωm ). When minimizing ELP , we are matching the autocorrelation of the continuous LP envelope
P̂ (ω) to the autocorrelation of the given discrete spectrum:
R̂LPi = Ri , 0 ≤ i ≤ p, (2.69)
where π
1
Z
R̂LPi = P̂ (ω)ejωi dω. (2.70)
2π −π

25

20

15

10
−−−> |P(ω)|

−5

−10

−15

−20
0 500 1000 1500 2000 2500 3000 3500 4000
−−−> f [ Hz]

Figure 2.22: The LP envelopes (p = 10) computed for different numbers of discrete frequencies of
signal spectrum. Very solid line: LP envelope computed for 1024 frequency points of original spectrum.
Solid lines: LP envelopes for 640, 512, 256, 128, 64 frequency points. Dashed line: LP envelope for 32
frequency points. p = 10.
A typical behavior of LP spectral analysis of discrete voiced spectrum is shown in Fig. 2.22. The
spectral envelope of “continuous” signal spectrum (here not warped signal spectrum) shown in very
solid line does not match the LP envelopes of discrete spectra (differ in number of discrete frequency
points). It has been shown that for discrete spectra the LP error measure given in Eq. 2.68 contains
an error cancellation property.
Let us define Rorg to be the autocorrelation corresponding to the original all-pole filter with
spectrum P (ω). Their relation is given by inverse Fourier transform:
Z π
1
Rorgi = P (ω)ejωi dω (2.71)
2π −π
and

X
P (ω) = Rorgl e−jωl . (2.72)
l=−∞
The autocorrelation R corresponding to the discrete samples of the LP envelope is defined in general
as:
N −1
1 X
Ri = P (ωm )eiωm , (2.73)
N
m=0
and in the case of symmetric shape of P (ωm ) is given in Eq. 2.9. By substituting Eq. 2.72 in 2.73 we
obtain:
N −1 ∞
1 X X
Ri = Rorgl e−jωm (l−i) . (2.74)
N
m=0 l=−∞
This equation shows why LP analysis cannot recover the original envelope from the discrete spectral
samples, because of the aliasing that occurs in the autocorrelation domain whenever a spectral envelope
is sampled at a discrete set of frequencies. If we consider the periodic excitation case with the
frequencies spaced equally at ωm = 2π(m − 1)/N , then Eq. 2.74 reduces to:

X
Ri = Rorg(i−lN ) , f or all i. (2.75)
l=−∞

29
7
x 10
1

[i]
org
−−−> R
0 a)

−1
0 6 20 40 60 80 100 120
x 10
5
−−−> R [i]

0 b)

−5
0 6 20 40 60 80 100 120
x 10
5
[i]
LP
−−−> R

0 c)

−5
0 20 40 60 80 100 120
1000
−−−> h [i]

500
d)
0

−500
−60 −40 −20 0 20 40 60
−−−> samples

Figure 2.23: The autocorrelation sequences and impulse response sequence of 10 th LP model:
a) Autocorrelation Rorgi for lags 0 ≤ i ≤ 120.
b) Autocorrelation Ri , 0 ≤ i ≤ 120 for N = 30 (spectral samples)-corresponding to the discrete
spectrum.
c) Autocorrelation R̂LPi , 0 ≤ i ≤ 120 corresponding to the LP envelope.
d) Discrete frequency sampled impulse response ĥi , −60 ≤ i ≤ 60.

Some examples of autocorrelation sequences are shown in Fig. 2.23. Note that R i is obtained
by aliasing Rorgi . According to Eq. 2.69, LP matches the autocorrelation of the continuous model
spectrum to that of the given spectrum. Therefore:


X
R̂LPi = Ri = Rorg(i−lN ) , 0≤i≤p
l=−∞ (2.76)
6= Rorgi , 0 ≤ i ≤ p.

Since the autocorrelation corresponding to the LP envelope will always equal an aliased version of R org
(for the discrete spectrum case), the LP envelope will not equal the original envelope. LP analysis
produces a unique all-pole model given a set of autocorrelations, which means that the original all-pole
model is not a possible solution to Eq. 2.47. The LP criterion given in Eq. 2.68 does not take into
account the aliasing that has occurred in the discrete spectrum, and matches the autocorrelation of
continuous all-pole model to the autocorrelation of the given signal.
It is well known that LP analysis fails in case of modeling of high pitched speech, where the spectral
harmonics are widely separated (problems in formant tracking). Hence the peaks of LP spectral
estimates are highly biased towards the pitch harmonics. It has been shown that the drawbacks of
LP are inherent to its error criterion given in Eq. 2.68. This problem of high-pitched speech modeling
is relevant to the fact of autocorrelation aliasing. As the pitch increases, we have fewer and fewer

30
harmonics (spectral samples) and the autocorrelation aliasing becomes more and more severe, which
leads to worse LP models.
The previous examples show that LP is the wrong approach to envelope estimation for discrete
spectra since it does not account for the aliasing caused by spectral sampling. There have been
proposed several methods to improve the LP estimate. One of them called discrete all-pole modeling
(DAP) is in general superior to the standard LP analysis. DAP uses different error measure- Itakura-
Saito (I-S) measure. It was originally defined for continuous spectra. After its adaptation to the
discrete case we obtain:
N
1 X h P (ωm ) P (ωm ) i
EIS = − ln −1 , (2.77)
N P̂ (ωm ) P̂ (ωm )
m=1

where P (ωm ) is the given discrete spectrum defined at N frequencies, and P̂ (ωm ) is the all-pole model
spectrum (Eq. 2.41) defined at the same frequencies. EIS is always nonnegative and is equal to zero
only when P (ωm ) = P̂ (ωm ) for all ωm (P (ωm ) = P̂ (ωm ) gives a minimum for EIS , but not necessarily
for ELP ).
It is important to note that in continuous form of this error measure as part of maximum likelihood
approach to linear prediction, the same result was produced as standard LP for continuous spectra
(the optimal all-pole model is the same as the one produced by LP). Hence, by using I-S error measure,
we do not loose any of the advantages or performance of LP in unvoiced segments of speech, where
LP behaves very well.
A spectral flatness interpretation of this discrete (I-S) error measure makes it a very reasonable
choice for the problem of fitting an envelope to a set of discrete spectral values. Minimizing the error
in Eq. 2.77 is equivalent to maximizing the spectral flatness of the error spectrum P (ω m )/P̂ (ωm ),
where the spectral flatness is defined as the geometric mean of the spectral samples divided by their
arithmetic mean. In our case it means that our optimal model is the one which makes the residual
(error) spectrum as flat as possible.
If EIS is small, the I-S error approximates the mean-squared distance between the log spectra:
p
EdB = 6.142 EIS
v
u
u1 X N h i2 (2.78)
=t 10log10 P (ωm ) − 10log10 P̂ (ωm ) , f or small EIS .
N
m=1

2.2.7.0.1 I-S error minimization


In order to minimize the error measure in Eq. 2.77 with P̂ (ω) expressed as:
1
P̂ (ω) = Pp , (2.79)
k=0 gk cos ωk

we set ∂EIS /∂gi = 0 for i = 0, . . . , p. {gk } can be shown to be equal to:


p
X
g0 = d2k (2.80)
k=0
p−i
X
gi = 2 dk dk+i , 1≤i≤p (2.81)
k=0

(compare with Eq. 2.39 and 2.40), where {dk } is a set of coefficients with d0 6= 1 and with G incorpo-
rated in the coefficients. Note that g0 is equal to the zero-lag autocorrelation. The result yields to a
set of standard correlation matching conditions, given by:

R̂i = Ri , 0 ≤ i ≤ p, (2.82)

31
where Ri is the autocorrelation corresponding to the given discrete spectrum (given in Eq. 2.73) and
R̂i is the autocorrelation corresponding to the all-pole model sampled at the same discrete frequencies
as the given spectrum:
N −1
1 X
R̂i = P̂ (ωm ) cos(iωm ). (2.83)
N
m=0
Eq. 2.82 looks similar to the usual LP autocorrelation matching condition (Eq. 2.69). However, the
difference is that R̂LP is the autocorrelation of the continuous all-pole spectrum P̂ (ω). Here R̂ is the
autocorrelation of a discrete sampling of the all-pole spectrum.
DAP requires matching the given aliased autocorrelation to the autocorrelation of the all-pole
aliased in the same manner (Eq. 2.82). It is this improved correlation matching condition, which
incorporates the autocorrelation aliasing, that makes DAP more suitable than LP for modeling voiced
frames and discrete spectra in general.

2.2.7.0.2 Parameters of the optimal all-pole model


In order to obtain the all-pole model we set ∂EIS /∂di = 0, i = 0, . . . , p. It yields to the following
set of equations relating the predictor coefficients {dk } to the autocorrelations of the given discrete
spectrum and the sampled all-pole model:
p
X £ ¤
2 dk Ri−k − R̂i−k = 0, 0 ≤ i ≤ p. (2.84)
k=0

This equation can be rewritten into matrix form:

Rd = R̂d, (2.85)

where d is the column vector of predictor coefficients, and R and R̂ are symmetric Toeplitz matrices.
Hence the next step is to solve a set of p + 1 nonlinear equations in p + 1 unknowns. The equations
are nonlinear because R̂ is a function of d.
To simplify the solution of nonlinear problem given in Eq.2.84, we use the following property of
sampled all-pole filters:
Xp
dk R̂i−k = ĥ−i , 0 ≤ i ≤ p, (2.86)
k=0

where ĥ−i is the (time-reversed) impulse response of the discrete frequency sampled all-pole model,
given by:
N
1 X e−jωm i
ĥ−i = , (2.87)
N D(ωm )
m=1
where:
1 1
P̂ (ωm ) = |Ĥ(ωm )|2 = = ¯ Pp ¯ . (2.88)
|D(ωm )|2 d e −jωm k ¯2
k
¯
k=0
We want to prove a property given in Eq 2.86. Hence we start with the identity:

Ĥ(ωm )D(ωm ) = 1. (2.89)

Multiplying the Ĥ ∗ (ωm ), and substituting Eq. 2.88, we get:

P̂ (ωm )D(ωm ) = Ĥ ∗ (ωm ), (2.90)

and by expanding D(ωm ):


p
X
dk P̂ (ωm )e−jωm k = Ĥ ∗ (ωm ). (2.91)
k=0

32
Multiplying the both sides by ejωm i , averaging over all ωm and applying the definition of R̂i from
Eq. 2.83, we obtain:
p N −1
X 1 X
dk R̂i−k = Ĥ(ωm )e−jωm i = ĥ−i , f or all i. (2.92)
N
k=0 m=0

Now we can substitute Eq. 2.86 into the minimization condition in Eq. 2.84, we obtain the following
set of equations that are related to the all-pole predictor coefficients to the given autocorrelation
sequence:
Xp
dk Ri−k = ĥ−i , 0 ≤ i ≤ p. (2.93)
k=0

In matrix form, the Eq. 2.93 will be:


Rd = ĥ, (2.94)

where ĥ is a column vector with elements ĥ−i , 0 ≤ i ≤ p.


In case of continuous spectrum, we have ĥi = 0 for i < 0 (or ĥ−i = 0 for i > 0) and this set of
equations reduces to that of regular linear prediction given in Eq. 2.47 and 2.48. However, LP assumes
ĥi = 0 for i < 0 for both discrete and continuous spectra. But for discrete spectrum case ĥi is nonzero
in general (property of discrete Fourier series).
Because of similarities between DAP and standard LP for the continuous spectrum case, while LP
does not reduce to DAP for the discrete spectrum case, LP is the special case of DAP modeling where
number of spectral samples N → ∞.

2.2.7.0.3 Iteratively solved set of nonlinear equations


The goal of the DAP is to find the parameters of all-pole model that fit even the given discrete
spectrum. This means to solve the set of equations Eq. 2.93, that are non-linear ( ĥi depends on the
values of the all-pole coefficients {dk }). It can be solved using straightforward algorithm that involves
two steps repeated iteratively:

• Given an estimate of the predictor, evaluate ĥ−i using Eq. 2.87.

• Given the new estimate of ĥ−i , solve the now “linear” set of Eqs. 2.93 for a new estimate of
predictors.

Experiments:

We have tried to incorporate the DAP into the standard PLP based analysis. First the warped power
spectrum is computed (with application of power law and possible equal loudness approximation). After
the spectrum processing, the IDFT is applied and autocorrelation sequence is obtained. For such autocor-
relation sequence, {dk } coefficients are computed using standard LP analysis. Iteratively, new set of {d k }
is obtained with respect to new estimate of ĥ−i . This iterative approach is computationally expensive so
that we have used the smallest number of iterations.

– PLP38: Application of discrete all-pole modeling. First standard LP analysis is done. For such
all-pole model, the impulse response is computed. Then, iteratively, the set of non-linear equations
is solved (number of iterations is 2).
– PLP39: Similar to PLP38, number of iterations is 5.
– PLP40: Similar to PLP38, number of iterations is 1.

33
SDC-Accuracy [%] Italian
test hm mm wm
PLP38 38.4 84.58 93.26
PLP39 37.69 80.54 93.30

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
PLP40 39.27 85.06 93.20 37.56 72.3 90.26 41.2 71.63 88.39

Table 2.7: Word recognition results.

2.2.7.0.4 Potential approaches


As shown above, the all-pole spectrum resulting from standard (not DAP) all-pole modeling does
not yield the spectrum we desire, which is in general caused by the inequality the autocorrelation of
the periodic signal and the autocorrelation of the impulse response of an all-pole model, which from
Eq 2.16 should result in a different solution.
One of the proposed approach to compute the accurate all-pole model is with respect to the fact
that the autocorrelation of an all-zero spectrum with p0 zeros is equal to zero for lags |k| > p0 . Then
the possible approach is as follows:

• Compute the inverse of the discrete (line) spectrum:


1
P −1 (ωm ) = . (2.95)
P (ωm )

• Compute the IDFT of P −1 (ωm ) using Eq. 2.73 that yields the autocorrelation coefficients.

• Compute the all-zero spectrum P −1 (ω) for a large number of frequencies.

• P (ω) = P −1 (ω).

However, the mentioned algorithm expects the fact that original P (ω m ) is an all-pole spectrum given
a finite number of equally spaced points on it. Therefore, its application to more general cases has
anticipated problems. On the other hand, if the initial discrete spectrum is all-pole, the LP analysis
ensures that the solution will be unique (only one correct solution). The only restriction is that the
number of harmonics in the spectrum is to be at least equal to the number of poles. The previous
algorithm can be applied in cases where for example the harmonic all-pole but noisy spectrum is given
(e.g. as a result of quantization).
Interesting note is that in case of application Analysis by Synthesis (AbS) approach to estimate
an all-pole model. The AbS model spectrum will be identical to desired discrete spectrum (not true
for LP).

34
Chapter 3

Feature normalization

The variability in acoustic signal is caused by several sources. There is a “useful” variability that is
necessary to discriminate between different speech units (e.g. phonemes). However, there are also
“harmful” sources of variability involved which are irrelevant for the speech recognition process. The
“harmful” sources of variability are for example varying transmissions channels, different speakers,
different speaking styles or accent, channel noise, . . . . We focus here on the variability caused by
different type of noise. Hence we assume that the speech is corrupted by unknown additive and/or
convolutional noise.

3.1 Additive noise


Let y[n] be a discrete noisy sequence:

y[n] = x[n] + b[n], (3.1)

where x[n] is the desired signal and b[n] is the unwanted background noise. For next steps of derivations
we assume that x[n] and b[n] to be wide-sense stationary, uncorrelated random processes. P x (ω) and
Pb (ω) are their power spectral density functions, respectively. From the property of linearity of Fourier
transform we can write:
Py (ω) = Px (ω) + Pb (ω). (3.2)
In speech processing we apply Short-Time Fourier Analysis (STFT), so that we work with the short
segments given by:
ypL [n] = w[pL − n](x[n] + b[n]), (3.3)
where L is the frame length and p is the integer, which in the frequency domain is expressed as:

Y (pL, ω) = X(pL, ω) + B(pL, ω). (3.4)

Y (pL, ω), X(pL, ω) and B(pL, ω) are the STFTs of y[n], x[n] and b[n], respectively, computed at the
frame interval L. The STFT magnitude squared of y[n] is given:

|Y (pL, ω)|2 = |X(pL, ω)|2 + |B(pL, ω)|2 + X ∗ (pL, ω)B(pL, ω) + X(pL, ω)B ∗ (pL, ω). (3.5)

Obviously, the goal is to obtain an estimate of |X(pL, ω)|2

3.2 Convolutional distortion


Here we assume that a desired signal x[n] is passed through a linear time-invariant distortion g[n]
resulting in sequence:
y[n] = x[n] ∗ g[n]. (3.6)

35
We want to recover x[n] from y[n] without any a priori knowledge of g[n] (sometimes referred as blind
deconvolution). In short-time analysis we assume that the window w[n] is long and smooth relative
to the distortion g[n], so that a short-time segment of y[n] can be written as:
ypL [m] = w[pL − m](x[n] ∗ g[n])
(3.7)
∼ (x[m]w[pL − m]) ∗ g[m].
The STFT of degraded signal is:

X
Y (pL, ω) = w[pL − m](x[m] ∗ g[m])e−jωm
m=−∞

X (3.8)
∼ (w[pL − m]x[m]) ∗ g[m]e−jωm
m=−∞
= X(pL, ω)G(ω),
where G(ω) is the Fourier transform of the distortion.

3.3 Spectral subtraction


Spectral subtraction is the method for recovering a desired sequence x[n] from the noisy sequence y[n]
in Eq. 3.1. The assumption is that we are given an estimate of the noise power spectrum, denoted
by Py (ω), that is usually obtained by averaging over multiple frames of a known noise segment. We
also assume that x[n] and b[n] are uncorrelated. Then an estimate of the desired short-time squared
spectral magnitude with short-time analysis is suggested from Eq. 3.2:
|X̂(pL, ω)|2 = |Y (pL, ω)|2 − P̂b (ω), if |Y (pL, ω)|2 − P̂b (ω) ≥ 0
(3.9)
= 0, otherwise.
Spectral subtraction can be viewed as a filtering operation, where high signal-to-noise (SNR) regions
of the measured spectrum are attenuated less than low SNR regions. Such formulation can be given
in terms of an “instantaneous” SNR defined as:
|X(pL, ω)|2
R(pL, ω) = , (3.10)
P̂b (ω)
resulting in spectral magnitude estimate:
|X̂(pL, ω)|2 = |Y (pL, ω)|2 − P̂b (ω)
h P̂b (ω) i
= |Y (pL, ω)|2 1 − (3.11)
|Y (pL, ω)|2
h 1 i−1
∼ |Y (pL, ω)|2 1 +
R(pL, ω)
There has been used the approximation |Y (pL, ω)|2 ∼ |X(pL, ω)|2 + P̂b (ω). Approximately given
time-varying suppression filter (applied to the STFT measurement) is:
h 1 i−1
Hs (pl, ω) = 1 + . (3.12)
R(pL, ω)
We shall again remind that filter attenuates low SNR signals more than high SNR signals. Another
important property of noise suppression by spectral subtraction, as well as other STFT-based sup-
pression techniques, is that attenuation characteristics change with length of the analysis window. A
limitation of spectral subtraction is the artifact of “musicality” that results from rapid coming and
going of sine waves over successive frames. Many smoothing techniques have been proposed to reduce
such annoying fluctuations.

36
3.4 Spectral Mean Subtraction
Now we consider the problem of recovering a desired signal x[n] from the convolution y[n] = x[n]∗g[n].
With respect to Eq. 3.8 we apply the nonlinear logarithm operator to the STFT of y[n] to obtain:
£ ¤ £ ¤
Y (pL, ω) ∼ log X(pL, ω) + log G(ω) . (3.13)
The distortion g[n] is time-invariant. Therefore STFT views log[G(ω)] at each frequency as fixed
along the time index variable p. If we assume that the mean of speech component log[X(pL, ω)] is
zero in time dimension, then the convolutional distortion g[n] can be removed, whereas the speech
contribution is kept unaffected. This can be achieved in the quefrency domain by computing cepstra,
along each STFT time trajectory:
³ £ ¤´ ³ £ ¤´
ŷ[n, ω] ∼ Fp−1 log X(pL, ω) + Fp−1 log G(ω)
= x̂[n, ω] + ĝ[n, ω] (3.14)
= x̂[n, ω] + ĝ[0, ω]δ[n],
where Fp−1 denotes the inverse Fourier transform of sequences along the time dimension p. Applying
a cepstral lifter, we get:
x̂[ n, ω] ∼ l[n]ŷ[n, ω], (3.15)
where l[n] = 0 at n = 0 and unity elsewhere. This method is called cepstral mean subtraction, because
the 0th value of the cepstrum equals the mean of log[Y (pL, ω)] for each ω (along the time dimension).
Despite the fact that this technique is limited due to strictness of the assumption of a zero-mean speech
contribution, it has significant advantages in feature extraction for speech recognition. The mean is
in practice computed over a finite number of framers, we can think of cepstral mean subtraction as a
high pass, non-causal filtering operation.

3.5 Wiener filtering


It is an alternative approach to spectral subtraction for recovering a desired sequence x[n], corrupted
by additive noise b[n]. The Wiener filter is based on the least-squares principle, i.e. it finds that
filter which minimizes the error between actual output and the desired output of the filter, as shown
schematically in Fig. 3.1.
x[n]

+
(x^[n] − x[n])^2
square

x[n] + b[n] x^[n]


Filter h[n]

Figure 3.1: The Wiener least squares filter.

The goal is to find a linear filter h[n] such that the sequence x̂[n] = y[n] ∗ h[n] minimizes the
¡ ¢2
expected value of x̂[n] − x[n] . With remembering that the signals x[n] and b[n] are uncorrelated
and stationary, the frequency-domain solution to this stochastic optimization problem is given:
Px (ω)
Hs (ω) = , (3.16)
Px (ω) + Pb (ω)

37
which is referred as the Wiener filter. When x[n] and b[n] meet the conditions under which the Wiener
filter is derived (interactively uncorrelated and stationary), the filter provides noise suppression without
considerable distortion in the signal (object) estimate and background residual. The required power
spectra Px (ω) and Pb (ω) can be estimated by averaging over multiple frames when sample functions of
x[n] and b[n] are provided. Typically, however, the desired signal and background are non stationary
in the sense that their power spectra change over the time , i.e. they can be expressed as time-varying
functions Px (n, ω) and Pb (n, ω). Therefore, ideally, each frame of the STFT is processed by a different
Wiener filter. For the simplifying case of a stationary background, we can express the time-varying
Wiener filter as:
P̂x (pL, ω)
Hs (pL, ω) = , (3.17)
P̂x (pL, ω) + P̂b (ω)
where P̂x (pL, ω) is an estimate of the time-varying power spectrum of x[n], Px (n, ω), on each frame,
and P̂x (pL, ω) is an estimate of the power spectrum of a stationary background, P b (n, ω). The time-
varying Wiener filter as:
h 1 i−1
Hs (pl, ω) = 1 + , (3.18)
R(pL, ω)
with a signal-to-noise ratio R(pL, ω) = P̂x (n, ω)/P̂b (n, ω). It can be shown, when the suppression
curves for spectral subtraction and Wiener filtering are computed, that the attenuation of low SNR
regions relative to the high SNR regions is to be stronger for the Wiener filter. A second important
difference from spectral subtraction is that the Wiener filter does not invoke an absolute thresholding.

3.6 Mean and variance normalization for robust speech recognition


High recognition performance is related to the good matching of training and testing conditions.
Typically, however, the additive and convolutive noises differ in training and testing data set, that
cause huge performance degradation. Mean and variance normalization (MVN) is one of the feature
adaptation techniques, that try to adapt trained models to the testing conditions (attempt to make
training and testing set distribution similar).
In our experiments we were dealing with on line mean and variance normalization that estimate
a local mean and variance of the data. The assumption for our MVN is that the speech is corrupted
by unknown additive as well as convolutive noise:
Py (ω) = Px (ω)Pg (ω) + Pb (ω), (3.19)
where Py (ω) is the warped power spectrum (critical band energy) of the degraded speech, P g (ω) is
the power spectrum of the channel, Px (ω) is the warped power spectrum of desired signal and Pb (ω)
is the warped power spectrum of the noise. Applying the nonlinear logarithm:
¡ Pb (ω) ¢
Py (ω) = Px (ω)Pg (ω) 1 + (3.20)
Px (ω)Pg (ω)
¡ Pb (ω) ¢
log(Py (ω)) = log(Px (ω)) + log(Pg (ω)) + log 1 + . (3.21)
Px (ω)Pg (ω)
A steady or slowly varying additive component log(Pg (ω)) is brought by frequency response of the
channel. In the speech parts, the value of log(Px (ω)) is dominated in log(Py (ω)), and in the silent
parts (low energy regions), the value of log(Pg (ω)) is dominated, respectively. The variance of cepstral
trajectories are reduced. The smooth frequency response of the channel most of the time affects the
lower cepstral coefficients by shifting their mean.
By applying cepstral mean subtraction we are not able to compensate reduced variance of cepstral
trajectories. The MVN should be rather applied on uncorrelated data (cepstral coefficients), otherwise
MVN can cause a degradation of performance.

38
In feature extraction it is required to do MVN on line (we cannot afford to compute mean and
variance from one long sentence). Therefore, the estimation of mean and variance have to be time-
dependent. Let Ψ̄t be the normalized cepstral features, where t denotes time. Then mean m̄ t and
variance σ̄t are computed:
¡ ¢
m̄t = m̄t−1 + α C̄t−1 − m̄t−1 (3.22)
2 2 2 2
¡ ¢
σ̄t = σ̄t−1 + α (C̄t−1 − m̄t−1 ) − σ̄t−1 , (3.23)

where α is the updating factor that needs to be somehow chosen. In our experiments α has been
found to be 0.02. The reestimation of normalized cepstral features is done as follows:
C̄t − m̄t
Ψ̄t = . (3.24)
σ̄t + ø
The additive parameter ø is set to 1, and prevents the denominator in Eq. 3.24 to be very small,
that may happen in long silent regions, where the estimation of variance is → 0. In such regions
the normalized cepstral features contains only the noise that would be amplified so that in result
recognition performance would significantly degrade.

Experiments:
We have done many experiments related to mean and variance normalization either applied on MFCC
or PLP coefficients. First important issue that highly affect speech recognition performance is the global
mean and variance which has to be computed prior application of MVN.
We have tried to compute global mean and variance for the whole training data set:

– MFCC ONLN1: For TI-digits (noisy training) and Italian DB-global mean and variance were trained
on the both training data set together. For SP and FIN DB-global mean and variance were trained
on each training part of MFCC features separately. MFCC analysis is to be taken from MFCC
experiment.
– PLP7 ONLN1: Global mean and variance was trained similarly as in MFCC ONLN1 (trained
obviously on PLP7 ). MVN was applied on PLP-cepstrum (similar to experiment PLP7 ).

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
MFCC ONLN1 73.83 74.35 92.00 59.12 59.37 88.91 73.68 86.57 92.01
PLP7 ONLN1 56.9 84.26 94.21 62.61 86.18 93.73 74.02 81.14 89.76
MFCC MEAN 52.34 82.78 93.20 53.18 86.53 92.76 69.47 78.30 89.08
PLP7 MEAN 51.21 82.22 93.62 52.58 85.84 93.10 69.35 77.59 89.49

Aurora 2-WER [%] noisy train clean train


test “A” “B” “C” “A” “B” “C”
MFCC ONLN1 10.38 9.63 10.55 18.58 18.74 17.34
PLP7 ONLN1 11.29 10.41 11.10 21.83 19.62 22.64
MFCC MEAN 12.65 11.7 12.82 25.46 22.64 26.0
PLP7 MEAN 11.90 11.16 12.04 26.41 23.47 27.98

Table 3.1: Word recognition results.

Baseline MVN experiments shows that even though PLP7 gives better recognition performance (overall),
after application on line normalization the improvement is not such significant as in case of MFCC.
Therefore, we have tried to employ mean normalization and variance normalization separately. In case
of mean normalization, Eq. 3.24 has changed:

Ψ̄t = C̄t − m̄t . (3.25)

39
In case of variance normalization only:
C̄t
Ψ̄t = . (3.26)
σ̄t + ø

– MFCC MEAN: Only mean normalization given by Eq. 3.25. The rest is similar to MFCC ONLN1.
– PLP7 MEAN: Only mean normalization given by Eq. 3.25. The rest is similar to PLP7 ONLN1.

The MVN has been applied on the top of PLP6 nocoder and on the top of subfea 10 frame orig noonline nocoder
(see sect. 2.2.2.11):

– PLP6 onln nocoder new2: PLP analysis similar to PLP6 nocoder. MVN applied (global mean and
variance trained on IT DB and TIdigits (noisy) training data set of PLP6 nocoder (features at the
end of front-end (no voice activity detection))).
– subfea 10 frame orig nocoder new: MFCC analysis similar to subfea 10 frame orig noonline nocoder.
MVN applied (global mean and variance trained IT DB and TIdigits (noisy) training data set of
subfea 10 frame orig noonline nocoder (features at the end of front-end (no voice activity detec-
tion))).

SDC-Accuracy [%] Italian


test hm mm wm
subfea 10 frame orig nocoder new 80.66 92.81 96.94
PLP6 onln nocoder new2 79.03 92.09 96.94
subfea 10 frame orig nocoder new4 82.89 92.57 96.75
PLP6 onln nocoder new5 83.57 91.23 97.0

Aurora 2-WER [%] noisy train clean train


test “A” “B” “C” “A” “B” “C”
subfea 10 frame orig nocoder new 9.07 9.44 10.22 16.75 16.09 17.5
PLP6 onln nocoder new2 9.49 9.38 10.66 18.77 17.19 19.3
subfea 10 frame orig nocoder new4 9.15 9.52 10.31 16.9 16.22 17.83
PLP6 onln nocoder new5 9.05 9.01 10.16 17.57 16.05 18.13

Table 3.2: Word recognition results.

The results of MFCC’s experiments and the experiments of PLP analysis show that in general PLP
analysis gives better recognition performance. However, when MVN is then used, the improvement is not
such significant in PLP’s experiments.
When only mean normalization is used (MFCC MEAN and PLP7 MEAN ), its behavior in MFCC and
PLP analysis is similar. The difference is given by variance normalization, which works better in MFCC’s
experiments. All-pole model, which is employed in PLP analysis, smooths the power spectrum of signal
in general. Therefore, the global variance is typically smaller than in MFCC analysis. In our experiments
we have been dealing with a different values of global variance. The results show that in case of PLP
analysis, noisy data should be taken for training of global variance, because of its smaller value. Then
the recognition improvement with application of MVN is similar to that achieved in MFCC analysis:

– PLP6 onln nocoder new5: PLP processing is similar to PLP6 nocoder. Global mean and variance
are trained on front-end features (so that no VAD), where noise suppression was switched off (noisy
data used). TIdigits (noisy) and IT training DB were used. In feature extraction, noise suppression
was obviously switched on.
– subfea 10 frame orig nocoder new4: MFCC analysis is similar to subfea 10 frame orig noonline nocoder.
Global mean and variance was trained on clean data of front-end features (noise suppression was
switched on). TIdigits (clean+noisy) and IT training DB were used. It seems to be better to use
clean training data for training of global mean and variance in MFCC’s experiments.

40
speech Power Noise Mel
Framing Window spectrum DCoffset Suppress. warping

Noise
Estimation

MFCC analysis
DCT LDA Log

PLP analysis
Spectrum Power Equal exp
process. Law Loudness

IDFT Levinson− PLP−


Durbin cepstrum

VAD

Down Feature Bit stream


sampling compress. Framing
Error protection CHANNEL

Coder

Figure 3.2: Terminal feature extraction block diagram with either MFCC analysis or PLP analysis
employed (feature extraction only for 0 − 4kHz).

3.6.1 PLP-LPC, PLP-LSF, PLP-Refl, PLP-LAR with application of MVN


The experiments with other possible feature interpretation than cepstrum only in PLP analysis have
been observed. First experiments have been done with PLP-LSF. And the immediate conclusion for
such features was that as far as we do not want to touch speech recognizer (tune the thresholds),
we have to apply MVN on top of PLP-LSF filters. Otherwise the problems with system pruning
will appear. Since MVN is required, decorrelation transformation (PCA) is to be applied, because as
mentioned in section 3.6, MVN should be used on decorrelated features.
On the other hand, the problems with system pruning did not appear in other types of features.
Experiments:
The experiments were run on all SDC databases:

– Italian database:
∗ LSF1: LSF coefficients computed from all-pole model (all-pole model is obtained totally same
way as in PLP7 experiment).
∗ LSF1 PCA: On the top of LSF1, PCA transformation was applied (covariance matrix is com-
puted on TIdigits noisy training DB).
∗ LSF1 ONLN: On the top of LSF1 PCA, MVN was applied (global mean and variance is trained
on IT train DB).
∗ REFL1: Reflection coefficients computed from all-pole model (all-pole model is obtained totally
same way as in PLP7 experiment).
∗ REFL1 PCA: Application of PCA on top of REFL1 (covar. matrix computed from IT-training
DB).
∗ REFL1 ONLN: Application of MVN on top of REFL1 PCA (global mean and variance is
trained on IT train DB).
∗ LAR1: Log-area-ratio coefficients computed from all-pole model (all-pole model is obtained
totally same way as in PLP7 experiment).

41
Decoder
Bitstream Feature Up Lowpass
CHANNEL decoding decomp. sampling filter
IDCT DCT

Frame
RECOGNIZER dropping
Delta MVN

Up Median
sampling Treshold filter
silence prob.

Figure 3.3: Server feature extraction block diagram.

∗ LAR1 PCA: Application of PCA on top of LAR1 (covar. matrix computed from IT-training
DB).
∗ LAR1 ONLN: Application of MVN on top of LAR1 PCA (global mean and variance is trained
on IT train DB).
∗ PLP16: Linear predictive coefficients (all-pole model) obtained totally same way as in PLP7
experiment.
∗ PLP16 PCA:
∗ PLP16 ONLN:
– Spanish and Finish database:
∗ LSF1: LSF coefficients computed from all-pole model (all-pole model is obtained totally same
way as in PLP7 experiment), Fstart = 64Hz.
∗ LSF1 PCA: On the top of LSF1, PCA transformation was applied (covariance matrix is com-
puted on SP training DB, and FIN training DB, separately).
∗ LSF1 ONLN: On the top of LSF1 PCA, MVN was applied (global mean and variance is trained
on SP training DB, and FIN training DB, separately).
∗ REFL1: Reflection coefficients computed from all-pole model (all-pole model is obtained totally
same way as in PLP7 experiment), Fstart = 0Hz.
∗ REFL1 PCA: Application of PCA on top of REFL1 (covariance matrix is computed on SP
training DB, and FIN training DB, separately).
∗ REFL1 ONLN: Application of MVN on top of REFL1 PCA (global mean and variance is
trained on SP training DB, and FIN training DB, separately).
∗ LAR1: Log-area-ratio coefficients computed from all-pole model (all-pole model is obtained
totally same way as in PLP7 experiment), Fstart = 64Hz.
∗ LAR1 PCA: Application of PCA on top of LAR1 (covariance matrix is computed on SP train-
ing DB, and FIN training DB, separately).
∗ LAR1 ONLN: Application of MVN on top of LAR1 PCA (global mean and variance is trained
on SP training DB, and FIN training DB, separately).
∗ PLP16:

3.7 Correlation index


One of the important properties of the features is their correlation. Reason why cepstral coefficients
are so popular is that they are largely uncorrelated. However, we would like to somehow determine the
correlation of several types of features. The correlation between features can be quantified as follows.
Let C be the matrix with correlation coefficients between feature i and feature j. Then:
Σi,j
Ci,j = p , (3.27)
Σi,i Σj,j

42
SDC-Accuracy [%] Italian Finish Spanish
test hm mm wm hm mm wm hm mm wm
LSF1 ONLN 64.72 86.9 94.09 61.31 86.39 94.8 70.68 82.21 90.83
REFL1 32.81 82.62 92.48 37.56 58.48 90.15 29.32 58.33 85.38
REFL1 PCA 39.95 82.18 93.95 32.9 51.37 85.09 29.05 68.15 82.1
REFL1 ONLN 56.88 69.12 89.16 56.25 84.13 93.32 62.83 77.06 88.52
LAR1 32.73 83.10 92.85 40.85 55.2 91.41 36.12 69.87 85.08
LAR1 PCA 43.75 83.62 93.41 48.8 61.76 89.88 32.09 66.67 83.24
LAR1 ONLN 61.18 86.78 93.64 61.52 85.84 94.48 65.95 80.45 88.88
PLP16 29.92 76.47 91.15 34.95 55.4 88.25 30.26 66.1 84.86
PLP16 PCA 35.20 74.83 90.50 40.60 57.25 90.18 31.07 64.49 82.10
PLP16 ONLN 60.29 82.50 93.26 59.82 83.65 93.78 65.35 78.98 89.75

Table 3.3: Word recognition results.

where Σ is the feature covariance matrix. The norm of C (called correlation index) is then a measure
for the correlation between the features. If ||C|| = 1, the features are perfectly uncorrelated. As
||C|| is increasing, the features are more correlated. Low correlation between features results in a
better (more unique) description of the speech characteristics. Moreover, the covariance matrix of
the features can be approximated by a diagonal matrix, which means a substantial reduction in the
number of model parameters.
Fig. 3.4 and 3.5 shows ||C|| for the whole training data set of Italian database (2951 files) and for
several different type of features. In order to evaluate the correlation index of the whole data set, we
have chosen three potentialities (see Tab. 3.4):

• C1 - mean value of correlation index over all sentences (mean value in histogram).

• C2 - mean value of correlation index over sentences with correlation index in the range of 1 − 50.

• C3 - the center value of histogram’s bin with maximal correlation index.

600

400
a)
200

0
0 10 20 30 40 50 60
600

400
b)
200

0
0 10 20 30 40 50 60
600

400
c)
200

0
0 10 20 30 40 50 60
600

400
d)
200

0
0 10 20 30 40 50 60

Figure 3.4: Histogram of correlation index of sentences of IT training DB for experiments: a) MFCC,
b) MFCC ONLN1, c) PLP7, d) PLP7 ONLN1.

43
30

20
a)
10

0
0 10 20 30 40 50 60
600

400
b)
200

0
0 10 20 30 40 50 60

400
c)
200

0
0 10 20 30 40 50 60
600

400
d)
200

0
0 10 20 30 40 50 60

Figure 3.5: Histogram of correlation index of sentences of IT training DB for experiments: a) LSF1,
b) LSF1 PCA, c) REFL1, d) REFL1 PCA.

3.8 Compensation of the noise using linear and non-linear transfor-


mations
Since one of the side effect of the additive noise is a shift of the mean of the probability distributions
of the features, the mean normalization (cepstral mean normalization) partially compensates the
mismatch caused by the noise. Combination of mean normalization with variance normalization
(MVN) improves the compensation of the mismatch with respect to mean normalization, so that the
robustness of features to the noise is higher. However, these methods present the limitation that
cannot compensate the non-linear effects caused by the noise. One can say that on line MVN, where
mean and variance is updated every speech frame, is able to compensate the nonlinear effect of the
noise, but updating factor α in Eq. 3.23 is very small.
In order to compensate the non-linear effects, the histogram equalization technique can be used.
This technique is commonly applied for image processing. The goal is to provide such transformation,
that converts the probability distribution of the noisy speech into a reference probability corresponding
to the clean speech. Or it can be said, that the probability distribution of the testing data are converted
into a probability corresponding to the training data.

3.9 Histogram based normalization of the features


Histogram equalization (HEQ) is a novel feature mapping approach that is robust to channel mismatch,
additive noise and in part to channel distortion. The acoustic signal contains a lot of variability, which
is on one hand necessary to discriminate between different speech units (e.g. phonemes). On the other
hand, there are also variations in the speech signal that are irrelevant to the speech recognition tasks
(different speakers, styles, accents, different communication channel). In the general view this can be
regarded as a mismatch between training and testing conditions that causes the degradation of the
performance.
In order to minimize the mismatch between the training and testing conditions, either the acoustic
vectors or the acoustic models may be transformed. In case of transforming acoustic vectors, the oper-
ation is called normalization. There are different ways to classify normalization algorithms. According

44
Corr. index [-] C1 C2 C3
MFCC 25.75 17.55 9.97
MFCC ONLN1 17.16 14.05 7.47
PLP7 30.25 20.96 9.98
PLP7 ONLN1 23.72 18.82 7.49
LSF1 262.7 31.44 29.85
LSF1 PCA 17.06 16.19 9.97
LSF1 ONLN 12.39 12.2 7.43
REFL1 40.16 20.77 12.49
REFL1 PCA 17.97 17.14 9.98
REFL1 ONLN 13.71 13.45 7.48
LAR1 40.24 20.77 9.99
LAR1 PCA 18.11 17.25 9.97
LAR1 ONLN 13.30 13.06 7.40

Table 3.4: Evaluation of ||C|| over all Italian training data set.

to the normalization schemes derived from physical model, we can distinguish between model based
and data distribution based normalization.
In the first case the normalization is based on some model for speech production, transmission,
or perception. A small number of model parameters are estimated on the test data, and used, with
respect to the given model, to normalize the acoustic vectors. Channel and environment normalization
techniques (e.g. cepstral mean normalization) as well as noise suppression techniques that rely on an
accurate SNR estimation belong to this category.
In the distribution based normalization the acoustic vectors are transformed to the domain more
familiar to speech recognition. The transformation parameters are obtained from the distribution of
the training and testing data. The goal of such approach is to transform test vectors such that their
distribution matches the distribution of the training data.
The assumption behind application of histogram normalization, that simplify the problem, is that
each feature space dimension can be normalized independently of the others. Due to this assumption
histogram equalization can account for any type non-linear distortion of each feature space dimension
(scaling, shifting). But it cannot rotate the feature space.

Experiments:

3.9.1 Training data as the target histogram


In our experiments we have attempted several approaches of histogram equalization. The most straight-
forward approach is to compute (for each dimension independently) the distribution p(x) (histogram) of
the training data. Because of the fact that testing data are unknown, we can work for example with small
amount of testing data (one unseen sentence), and compute its distribution p(x̃). For both distributions
Ry R ỹ
we can derive a cumulative histogram P (y) = −∞ p(x)dx and P (ỹ) = −∞ p(x̃)dx̃. Then the test data
distribution is transformed to the training data distribution. Each test value ỹ t is replaced by the value
yt that corresponds to the same point in the cumulative training data distribution so that P (ỹ t ) = P (yt ).
On one hand, such operation is in term of computational complexity effortless because it can be imple-
mented by a simple lookup table. On the other hand, we have to compute the transferring lookup table
for each testing sentence, which highly degrade the effectivity of the approach:

– MFCC HIST1: HEQ based normalization, where each speech sentence is equalized independently.
The normalized data were MFCCs. As the target histogram: For IT-SDC: italian training DB.
Number of bins for histogram computation: Nb = 300. For SP-SDC: spanish training DB used as
the target histogram, similarly for FIN-SDC.

45
2000

1000 c0

0
40 50 60 70 80 90 100 110
4000

2000 c1

0
−15 −10 −5 0 5 10 15 20
4000

2000 c2

0
−15 −10 −5 0 5 10 15
4000

2000 c3

0
−12 −10 −8 −6 −4 −2 0 2 4 6 8
4000

2000 c4

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
−−−> cepstral coefficients

Figure 3.6: Mismatch between training and testing data set for Italian DB (100 sentences). Histograms
are plotted for MFCCs (dot-dashed line: test data, solid line: train data) c 0 − c4 .

– MFCC HIST3: Similar to MFCC HIST1, Nb = 500.


– PLP HIST9: HEQ similar to MFCC HIST1, instead of MFCCs as the source features, we have
used PLP cepstral coefficients (PLP7 ) as the normalizing features. The target distributions are the
same as in MFCC HIST1 for IT-SDC, SP-SDC, FIN-SDC.

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
MFCC HIST1 67.27 70.32 91.24 69.08 70.79 82.63 67.34 75.17 85.66
PLP7 HIST9 64.12 66.36 91.49 70.74 70.11 82.5 69.71 75.65 85.92

SDC-Accuracy [%] Italian


test hm mm wm
MFCC HIST3 65.64 70.4 91.25

Table 3.5: Word recognition results.

Conclusion:
In our experiments we took into account many parameters appearing in HEQ algorithm:

– Number of bins Nb should not be less than 300, however its increasing is not related to better per-
formance (it is caused by insufficiency of amount of data (only one sentence is used for computation
of a source distribution)).
– HEQ behaved the best when applied on untransformed data (MFCCs, PLP). The performance
rapidly decreased in experiments where MFCCs were first processed by cepstral mean subtraction.
– On contrary to some previous experiments, the performance was the highest in case of application
on uncorrelated data (e.g. cepstral coefficients). Transformation of log-energies did not bring good
results.

3.9.2 Gaussian distribution as the target histogram


In another experiments, the considered reference (target) probability density function is a Gaussian
probability distribution with zero mean and unity variance. The compensation method is applied for

46
1000

500 c0

0
20 30 40 50 60 70 80 90 100
1000

500 c1

0
−20 −15 −10 −5 0 5 10 15 20 25
1000

500 c2

0
−15 −10 −5 0 5 10 15
1000

500 c3

0
−10 −8 −6 −4 −2 0 2 4 6 8 10
1000

500 c4

0
−10 −5 0 5
−−−> cepstral coefficients

Figure 3.7: Mismatch between clean and noisy (SN R = 10 dB) data set for TI-digits DB (100 files).
Histograms are plotted for MFCCs (dot-dashed line: clean data, solid line: noisy data) c 0 − c4 .

training as well as testing data. The great advantage of such approach is that we compute the transferring
lookup table only once at the beginning of the normalization algorithm. The transferring function is
computed between training data distribution and Gaussian distribution with zero mean and variance
(obviously independently for each feature stream):

– MFCC HIST5: HEQ with Gaussian distribution (reference distribution). The data processed by
on line mean normalization were used for HEQ (MFCC MEAN ). The both training and testing
data were transformed the same way (one lookup table). The whole training data set is used for
computation of a source distribution. The range of the Gaussian distribution is −3, +3 (values out
of the range are considered to be 0). Number of bins Nb = 1000;
– MFCC HIST6: Same as MFCC HIST5. HEQ normalization applied on (MFCC ONLN1.
– MFCC HIST7: Same as MFCC HIST5. HEQ normalization applied on log energies (23 bands).
Then mean normalization and DCT transformation (23 × 15) were used.
– MFCC HIST9: Same as MFCC HIST5. For computation of a source distribution only training data
recorded with hands-free microphone (*.it1 files for IT-SDC) were used.
– MFCC HIST10: Same as MFCC HIST5. For computation of a source distribution only training
data recorded with close-talk microphone (*.it0 files for IT-SDC) were used.
– MFCC HIST12: Same as MFCC HIST5. HEQ normalization applied on (MFCC ).
– MFCC HIST16: Same as MFCC HIST9. MVN applied on top of HEQ normalized data.
– MFCC HIST18: Same as MFCC HIST5. Range of Gaussian distribution is −4, +4.
– MFCC HIST20: Same as MFCC HIST18. For IT-SDC two lookup tables were computed separately:
for close-talk microphone recorded files (*.it0) and for hands-free microphone recorded files (*.it1).
– MFCC HIST22: Same as MFCC HIST18. HEQ applied on cepstral coefficients (MFCC MEAN )
except c0 (energy). The reason is that c0 stream is more or less bi-modal, and the application of
HEQ with unity Gaussian distribution can cause degradation. For c0 stream MVN normalization
has been used.
– PLP HIST1: Same as MFCC HIST5. Instead of MFCCs, PLP-cepstrum processed by mean nor-
malization (PLP7 MEAN ) has been used.
– PLP HIST3: Same as PLP HIST1. PLP7 used as a source distribution.
– PLP HIST4: Same as PLP HIST1. PLP7 ONLN1 used as a source distribution.

47
1.band 2. band 3.band
−−−> log ene (noisy data) 18 20 22

−−−> log ene (noisy data)

−−−> log ene (noisy data)


16 18 20

14 16 18

12 14 16

10 12 14

8 10 12

6 8 10
0 10 20 0 10 20 0 10 20 30
−−−> log ene (clean data) −−−> log ene (clean data) −−−> log ene (clean data)

4.band 5.band 6.band


22 24 25
−−−> log ene (noisy data)

−−−> log ene (noisy data)

−−−> log ene (noisy data)


20 22

18 20 20

16 18

14 16 15

12 14

10 12 10
0 10 20 30 0 10 20 30 0 10 20 30
−−−> log ene (clean data) −−−> log ene (clean data) −−−> log ene (clean data)

Figure 3.8: Distribution of clean data (log energies after Mel filter bank application) versus the same
data corrupted by noise (TI-digits) with SN R = 10 dB.

– PLP HIST5: Same as PLP HIST1. Hands-free microphone recorded training data used as a source
distribution.
– PLP HIST8: Same as PLP HIST1. HEQ applied on cepstral coefficients (PLP7 MEAN ) except c 0
(energy). The reason is that c0 stream is more or less bi-modal, and by application of HEQ with
unit Gaussian distribution can cause degradation. For c0 stream MVN normalization has been used.
Range of Gaussian distribution is −4, +4.

48
22

20

18
−−−> log ene (noisy signal)

16

14

12

10

6
0 5 10 15 20 25
−−−> log ene (clean signal)

Figure 3.9: Distribution of clean data (first log energy band) versus the same data corrupted by noise
(TI-digits) with SN R = 10 dB. Solid curve shows transformation obtained by histogram equalization,
solid line represents mean normalization transformation, dashed line represents MVN transformation,
dotted line-no compensation.

SDC-Accuracy [%] Italian Finish Spanish


test hm mm wm hm mm wm hm mm wm
MFCC HIST22 67.9 87.69 94.62 66.54 87.02 95.1 76.39 86.2 91.22
PLP7 HIST8 59.32 87.22 94.06 66.50 86.66 95.06 75.52 86.0 90.81

SDC-Accuracy [%] Italian


test hm mm wm
MFCC HIST5 55.83 87.42 94.04
MFCC HIST6 69.32 76.15 93.42
MFCC HIST7 40.29 83.5 91.33
MFCC HIST9 57.85 86.3 94.52
MFCC HIST10 54.91 86.62 93.66
MFCC HIST12 38.45 77.27 91.99
MFCC HIST16 68.24 76.15 93.65
MFCC HIST18 56.75 87.57 93.86
MFCC HIST20 63.33 86.3 94.73
MFCC HIST22 67.9 87.69 94.62
PLP HIST1 51.55 86.38 93.18
PLP HIST3 37.56 77.27 89.94
PLP HIST4 56.54 87.57 93.75
PLP HIST5 55.62 86.98 92.99
PLP HIST8 59.32 87.22 94.06

Table 3.6: Word recognition results.

49
2000 4000 4000
1 2 3
1000 2000 2000

0 0 0
0 50 100 150 −20 0 20 −20 0 20
5000 5000 4000
4 6
5
2000

0 0 0
−20 −10 0 10 −10 0 10 −10 0 10
5000 5000 5000

7 8 9

0 0 0
−10 0 10 −10 0 10 −10 0 10
5000 5000 5000

10 11 12

0 0 0
−10 0 10 −10 −5 0 5 −10 −5 0 5
5000 5000 10000
15
13 14 5000

0 0 0
−10 0 10 −5 0 5 10 −10 0 10
−−−−−−−−> distribution of cepstral coefficients

Figure 3.10: Histogram of 15 cepstral coefficients c0 − c14 (MFCCs from MFCC feature extraction).

1 1 1
1 2 3
0.5 0.5 0.5

0 0 0
0 10 20 30 −1 0 1 −1 0 1
1 1 1
4 5 6
0.5 0.5 0.5

0 0 0
−1 −0.5 0 0.5 −1 −0.5 0 0.5 −0.5 0 0.5
1 1 1
7 8 9
0.5 0.5 0.5

0 0 0
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
1 1 1
10 11 12
0.5 0.5 0.5

0 0 0
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
1 1 1
13 14 15
0.5 0.5 0.5

0 0 0
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
−−−−−−−−> distribution of cepstral coefficients

Figure 3.11: Cumulative histograms of 15 cepstral coefficients c0 − c14 (PLP-cepstrum from PLP7 fea-
ture extraction). Solid lines represent cumulative histograms over IT-SDC training data set (reference
histogram). Dashed lines represent cumulative histograms over one of IT-SDC sentence.

50

You might also like