Professional Documents
Culture Documents
Recognition
Report of PhD research internship in ASP Group, OGI-OHSU,
2001/2002
Petr Motlı́ček
1 Introduction 4
1.1 Speech recognition tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 SpeechDat-Car databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 AURORA2-TIdigits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Feature normalization 35
3.1 Additive noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Convolutional distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Spectral subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2
3.4 Spectral Mean Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Wiener filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Mean and variance normalization for robust speech recognition . . . . . . . . . . . . . 38
3.6.1 PLP-LPC, PLP-LSF, PLP-Refl, PLP-LAR with application of MVN . . . . . . 41
3.7 Correlation index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Compensation of the noise using linear and non-linear transformations . . . . . . . . . 44
3.9 Histogram based normalization of the features . . . . . . . . . . . . . . . . . . . . . . . 44
3.9.1 Training data as the target histogram . . . . . . . . . . . . . . . . . . . . . . . 45
3.9.2 Gaussian distribution as the target histogram . . . . . . . . . . . . . . . . . . . 46
3
Chapter 1
Introduction
This document is dealing with the feature extraction techniques generally used in speech recognition
tasks. The most popular features used in speech recognition, such as Mel-Filter bank cepstral co-
efficients (MFCCs) and Perceptual Linear Prediction (PLP) coefficients, were taken as the baseline
features.
A significant amount of effort has been devoted to establishing speech feature extraction schemes
which enable robust and high performance speech recognition in a range of operating environments.
Each scheme consists of several processing stages that will be subjected to the theoretical analysis as
well as their contribution to the resulting performance of the whole system.
In many descriptions, feature extraction is considered as comprising three different stages:
• namely static feature extraction
• feature normalization
• inclusion of temporal information.
In our work we have concentrated on first two stages. In all our speech recognition experiments delta
and acceleration components (∆, ∆∆ coefficients) were applied as the temporal derivatives.
4
1.1.2 AURORA2-TIdigits
Furthermore Aurora 2 (noisy TI-digits) database, that is fully described in [?], was used for the eval-
uation in some of our experiments too. The experiment task was the speaker independent recognition
of digit sequences with sampling frequency 8 kHz. The database consist of clean speech as well as
speech with noise artificially added at several SNRs (20dB, 15dB, 10 dB, 5dB, 0dB). Four noises were
used in training part: recording inside a subway, babble, car noise, recording in an exhibition hall.
The conditions are divided into multi-condition training and clean training. Three different sets
of speech data were taken for the recognition:
• Set “b”: Different types of noises were used for this test (restaurant, street, airport, train
station), which should represent realistic scenarios for application in a mobile terminal.
• Set “c”: The same noise as in test “a” and “b” were used. But the different frequency charac-
teristics were applied (speech transmitted through a different channel).
For SDC as well as Aurora2 experiments, the reference recognizer is based on HTK software
package (version 2.2 from Entropic). In order to compare recognition results when applying different
feature extraction algorithms, the training and recognition parameters are well defined. There is no
restriction in the string length of recognized numbers. The digits are recognized as whole word HMMs
with 16 states per word (plus 2 dummy states at the beginning and end). The number of states has
been chosen with respect to commonly used frame rate 10ms. The HMMs are simple left-right models
without skips over states. The diagonal covariance matrix is considered (only the variances of all
acoustic coefficients). Mixture of 3 Gaussians per state is computed.
A vector size of 15 coefficients plus delta and acceleration coefficients is defined. The vector size
may be changed. Two pause models are defined:
• “sil”: consists of 3 states (mixture of 6 Gaussian) with a special transition structure. It models
the pauses before and after utterance.
• “sp”: used to to model pauses between words. It consists of a single state which is tied with the
middle state of the first pause model.
The Training is done in several steps by applying the embedded Baum-Welch reestimation scheme
(HTK tool HERest).
5
Chapter 2
The purpose of feature extraction (often referred to as signal modeling algorithms) is to transform
audio data into a space where the observations from the same class will be grouped together and
observations of different classes will be pushed apart. For their derivations, psychological studies of
the human auditory and articulatory systems were used.
The short-time Fourier spectrum is usually examined as the first preprocessing block of the feature
extraction. The length of analyzed frames is 25ms with 10ms time shift. The frames are weighted
by a Hamming window that provide spectral analysis with a flatter pass band and significantly less
stop band ripple. This property with the fact that hamming window normalize the signal, so that the
energy of the signal will be unchanged through the operation, play an important role for obtaining
smoothly varying parametric estimates.
In most of our experiments the absolute energy of the spectrum of a given frame is considered
as the one of the feature. There is several possibilities to compute it from the spectrum and will be
mentioned later.
The other part of spectral measurement can be in general considered as the measurement at
specific frequencies, which corresponds to the standard behavior of the hair cells in a cochlea in a
human auditory system.
Most feature extraction methods use cepstral analysis to extract this vocal tract component from
the speech signal. Many algorithms have been proposed to compute the cepstrum. The most successful
methods also include the attributes of the psychological processes of human hearing into analysis.
Nowadays MFCC analysis is considered as the standard method for feature extraction in speech
recognition tasks. It uses a bank of Mel-filters (Fig. 2.2) modeling the hair spacing along the basilar
membrane of the ear.
Fig. 2.1 illustrates classical MFCC feature extraction. Spectral trajectories behind each particular
processing stage for voiced frame of speech is shown there.
Experiments:
The 64 Hz to 4 kHz frequency range is applied for computing the Mel-warped spectrum. Standard
Mel-scale warping function is used:
¡ f ¢
FM el (f ) = 2595.log10 1 + . (2.1)
700
Applying the short-time Fourier transform on the input speech, an input power spectrum is obtained:
6
where s(n) is input speech and w(n) is weighting window. P (ω) is transformed into 23 spectral subbands
equidistant in the Mel frequency scale. Natural logarithm is performed on the outputs of the Mel filter
bank. 15 cepstral coefficients are obtained applying DCT on the log-energies f i of 23 spectral subbands:
23
X ¡ πi ¢
ci = fj .cos (j − 0.5) , 0 <= i <= 14. (2.3)
j=1
23
5 5
x 10 x 10
4.5 10
4 9
3.5 8
7
3
−−−> P(omega)
2
6
−−−> |S|
2.5
5
2
4
1.5
3
1
2
0.5
1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20
−−−> f [kHz] −−−> freq. bands
2000
1500
1000
500
−500
−1000
−1500
0 2000 4000 6000 8000 10000 12000
Power
−−−> samples
frames
13
12
11
10
7
5 10 15 20
−−−> freq. bands
Figure 2.1: MFCC analysis with frequency trajectories of voiced speech forF sampl = 8kHz.
7
In order to compensate the unequal sensitivity of human hearing at different frequencies, the next
processing stage in PLP analysis simulates equal loudness curve, such as:
In MFCC analysis, preemphasis is applied in the time-domain using first-order high pass filter:
H(z) = 1 − αz −1 . (2.6)
Equal loudness (EQL) compensation is applied on the power spectrum P (ωm ) computed using Eq. 2.2.
Next processing stage called intensity-loudness power law models the non-linear relation between
the intensity of sound and its perceived loudness. In PLP analysis a cubic root compensation of
critical band energies is applied. This type of compensation is related to the logarithm applied on
the Mel-filter bank channels in MFCC analysis. The power spectrum after application of power law
is denoted as Pp (ωm ), m =< 1, N >, where N is the number of discrete frequencies (let us keep the
same notation of Pp (ωm ) as P (ωm ).
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4
−−−> f [kHz]
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000 3500 4000
8
5 5
x 10 x 10
4.5 4.5
4 4
3.5 3.5
3 3
−−−> P(omega)
−−−> |S|2
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20
−−−> f [kHz] −−−> freq. bands
−−−> time
8 25
60
−−−> P(omega)
7 20
−−−> P(omega)
50
6
−−−> R
15
5 40
10
4 30
5
3
20
0
2
−5 10
1
0 −10 0
5 10 15 20 5 10 15 20 25 30 35 40 45 5 10 15 20
−−−> freq. bands −−−> time −−−> freq. bands
RCs
CEPs
LSFs
Post All pole IDFT Spectrum
LARs
...
processing modelling process.
80
80
70 70
60 60
−−−> P(omega)
−−−> P(omega)
50 50
40 40
30 30
20 20
10 10
0
2 4 6 8 10 12 14 16 18 20 22
0
5 10 15 20 25 30 35 40 45
−−−> freq. bands −−−> freq. bands
Figure 2.4: PLP analysis with frequency trajectories of voiced speech forF sampl = 8kHz.
where for the monent the filter gain is ignored. In order to reasonably cover the natural speech,
the cascade of several such filters would be needed. This cascade approach has been used in many
synthesis applications. Multiplying through all of these sections, a direct-form implementation of the
spectra is obtained:
1
Ĥ(z) = Pp , (2.8)
1 + k=1 ak z −k
In the all-pole model we assume a frequency transfer function given in Eq. 2.8, but already with
gain factor G:
G
Ĥ(z) = Pp , (2.10)
1 + k=1 ak z −k
where
p
X
A(z) = 1 + ak z −k . (2.11)
k=1
9
ak are the predictor coefficients, and p is the order of the model. The problem is to determine these
three factors describing all-pole model. The model power spectrum is then given:
G2
P̂ (ω) = ¯ ¯ . (2.12)
¯1 + P p −jkω ¯2
k=1 ak e
A linear prediction (LP) uses error measure ELP between P (ωm ) and P̂ (ωm ) for discrete spectra:
N N
G2 X P (ωm ) 1 X ¯2
P (ωm )¯A(ejωm )¯ .
¯
ELP = = (2.13)
N N
m=1 P̂ (ωm ) m=1
ELP can be interpreted as the total energy of the “error signal” that can be obtained by passing of
the hypothetical signal sn through the inverse filter A(z) and can be proved using Perceval’s theorem.
Important fact is that ELP is defined to be independent of G, as can be seen from Eq. 2.13. The gain
factor will be determined later from energy considerations.
The parameters {ak } are determined by minimizing ELP in Eq. 2.13 with respect to each of the
parameters. Hence we have:
∂ELP
= 0, 1 ≤ i ≤ p. (2.14)
∂ai
It can be shown that:
p
∂ELP h X i
= 2 Ri + ak Ri−k , (2.15)
∂ai
k=1
where Ri is determined by Eq. 2.9. Therefore,
p
X
ak Ri−k = −Ri , 1 ≤ i ≤ p. (2.16)
k=1
The minimum error is obtained by substituting Eq. 2.9 and 2.16 in Eq. 2.13:
p
X
ELPp = R0 + ak Rk , (2.17)
k=1
where ELPp reflect the dependence of ELP on order of all-pole model. Coupling Eq. 2.16, 2.17 we can
get:
p n−R , 1 ≤ i ≤ p,
i
X
ak Ri−k = (2.18)
ELPp , i = 0.
k=1
The problem is reduced into the task of solving a set of p equations with p unknowns. There exists
several standard methods for determining unknowns (performing the necessary computations). From
Eq. 2.16 with respect to the covariance method of deriving all-pole model, it can be noted that the
matrix of coefficients in each case is a covariance matrix that is symmetric and positive semidefinite
(in practice definite). Therefore it can be solved by the square-root method. Very efficient method of
determining {ak } from Eq. 2.16 is proposed by Levinson by noting that p × p autocorrelation matrix
is symmetric and the elements along each diagonal are same (a Toeplitz matrix) as well as the the
column vector on the right side of the following equation is to be a general column vector:
R0 R1 R2 . . . Rp−1 a1 −R1
R1 R0 R1 . . . Rp−2 a2 −R2
R2 R1 R0 . . . Rp−3 a3 −R3
. . . . . = . (2.19)
. . . . . .
. . . . . .
Rp−1 Rp−2 Rp−3 . . . R0 ap −Rp
10
Another method proposed by Durbin highly increasing the efficiency of the previous algorithm use the
fact that the right side column vector contains the same elements found in the autocorrelation matrix.
Durbin’s recursive algorithm is specified as follows:
ELP0 = R0 , (2.20)
Pi−1 (i−1)
−Ri + j=1 aj Ri−j
ki = , (2.21)
ELPi
(i) (i) (i−1) (i−1)
ai = k i , aj = a j + ki aj−1 , 1 ≤ j ≤ i − 1, (2.22)
The interesting property of Eq. 2.19 is that the solution is not affected if all the autocorrelation
coefficients are scaled by a constant, such as their normalization by dividing R 0 . The minimum total
error ELPi in Eq. 2.23 decreases (or does not change) as the order of the predictor increases. E LPi is
all the time positive, since it is a squared error:
The derivations of parameters such as Autocorrelation or Covariance methods are derived using an
intuitive least squares approach, assuming that sn is a deterministic signal. In method of least squares,
there is made an assumption that the input un is totally unknown. Therefore the signal can be
predicted only approximately from a linearly weighted summation of past samples (s̃ n approximated
sn ):
Xp
s̃n = ak sn−k , (2.27)
k=1
and the error between the actual value sn and the predicted value s̃n is given by:
p
X
en = sn − s̃n = sn + ak sn−k . (2.28)
k=1
Since we assume in least square approach that the the input is unknown, the sense of computation
is quite uncertain. However, let Eq.2.28 to be rewritten to the form:
p
X
sn = − ak sn−k + en . (2.29)
k=1
The only signal that will result in the signal sn as output is that, where Gun = en (input signal is
proportional to the error signal). However, if we consider for any input u n , the energy in the output
must equal that of the original signal sn , then we can determine the total energy in the input signal.
11
Since the filter H(z) is fixed, the total energy in the input signal, Gun must equal to the total energy
in the error signal, which is given by ELPp in Eq. 2.17.
For specification of G, let the input to the all-pole filter H(z) be an impulse (unit sample) at n = 0,
i.e. un = δn0 . Then the output of H(z) is its impulse response hn , where:
p
X
hn = − ak hn−k + δn0 . (2.30)
k=1
If we determine the autocorrelation R̂i of the impulse response hn , we can find out interesting rela-
tionship to the autocorrelation Ri of the signal sn :
p
X
ak R̂i−k = −R̂i , 1 ≤ |i| ≤ ∞, (2.31)
k=1
p
X
ak R̂k + G2 = −R̂0 . (2.32)
k=1
To satisfy the condition of equalness of the total energies of hn and sn , we must have:
R̂0 = R0 , (2.33)
because the zeroth autocorrelation coefficient is equal to the total energy in the signal. From Eq. 2.33,
2.16 and 2.31 we can write;
R̂i = Ri , 0 ≤ i ≤ p. (2.34)
This says that the first p + 1 autocorrelation coefficients of the impulse response of H(z) are equal
to the corresponding autocorrelation coefficients of the signal. The all-pole modeling problem can
then be reformulated so that we want to find a filter of the form H(z) in Eq. 2.10, whose first p + 1
values of the autocorrelation of its impulse response are equal to the first p + 1 values of the signal
autocorrelation. Now the determination of G is very easy. With respect to the Eq. 2.32, 2.17 and 2.34
the gain is equal to:
X p X p
G2 = ELPp = R0 + ak Rk = ak Rk , (2.35)
k=1 k=0
N
1 X P (ωm )
= 1, (2.36)
N
m=1 P̂ (ωm )
The property from previous Eq. 2.36 is satisfied for all values of p (even for case p → ∞).
If P (ω) is an all-pole spectrum with p0 poles, the Eq. 2.36 becomes identity, too. Then:
12
Another property of P̂ (ω) is that its slope:
∂ P̂ (ω)
= 0, ω = 0, π, (2.38)
∂ω
which can be seen by rewriting Eq. 2.12 as:
G2
P̂ (ω) = Pp , (2.39)
b0 + 2 k=1 bk cos(kω)
where:
p−|k|
X
bk = an an+|k| , a0 = 1, 0 ≤ k ≤ p, (2.40)
n=0
∂ P̂ (ω)
and taking ∂ω . b0 are the autocorrelation coefficients of the impulse response of the inverse filter
A(z).
Here the G is incorporated in the coefficients of denominator, i.e. d0 is not restricted to 1. Constant k
is unknown, but dependent on {dk }. We will try to derive the coefficients {dk } (d0 6= 1) of such filter
from ordinary {ak } coefficients (a0 = 1). Let the both sides of Eq. 2.35 divide by G2 :
p
X ¯ 1
G2 = ak Rk , (2.42)
¯
¯ 2
G
k=0
p
X ak
1= Rk . (2.43)
G2
k=0
dk
ak = (2.45)
d0
1
G= √ . (2.46)
d0
13
Although this way of solution all-pole modeling task seems to be quite weird, we will use such set of
equations given by Eq. 2.44 later. If we rewrite the matrix Eq. 2.44 into the standard form we have:
p
X
dk Ri−k = 0, 1≤i≤p (2.47)
k=0
and
p
X 1
ak Rk = . (2.48)
d0
k=0
0.98
0.96
0.94
−−−> Vp [−]
unvoiced frame
0.92
Vmin = 0.92
0.9
unvoiced frame (preemphasis)
Vmin = 0.875
0.86
0 5 10 15 20 25 30
−−−> number of poles p
Figure 2.5: Normalized error curves Vm in for unvoiced and voiced frames of speech without/with
application of preemphasis.
14
70
65
60
50
45
2 4 6 8 10 12 14 16 18 20 22
−−−> frequency bands
Figure 2.6: Spectral smoothing of a warped spectrum (solid line) by all-pole model (dashed line) with
p = 14.
The numerator can be seen as an geometric mean of P , whereas the denominator represents an
arithmetic mean. Such as ration is equal to one if all the data are equal and the value decreases as the
spread of the data increases. Observing Eq. 2.49 and 2.53 we can see that V min depends only on the
shape of the signal spectrum, but Vp is completely related to the shape of the approximate spectrum
(all-pole model). This fact is important in interpreting the properties of V p curve for the spectra of
different sounds. As can be seen from Fig. 2.5 the error curve for voiced frames are much lower than
for unvoiced. Hence Vp can be suggested as a possible parameter for the detection of voicing (but V p is
only dependent on the shape of the spectrum and has nothing to do with the fact of voicing itself). It
is easy to prove that if the spectrum is flat then Vmin = 1 and the error curve is the highest possible.
This means that we are not able to approximate the flat spectrum by an all-pole model. On the other
hand, if all the energy is concentrated in certain regions of the spectrum and the rest is zero, then
Vmin = 0, and the error curve is the lowest possible. In general, voiced frames have most of the energy
concentrated in one region at low frequencies, resulting in low error curves. Unvoiced frames have the
energy more spread out across the spectrum.
In case of applying any distortion to the input signal (such as preemphasis employing Eq. 2.6)
that influence the shape of spectrum, it can largely affect the error curves, as can be seen in Fig. 2.5.
The problems can be found when telephone speech, where much of the low frequency energy has been
filtered out with sharp reduction of dynamic range, is to be used as the input of the voicing detector.
Again from Fig. 2.5 we can estimate the value of p such that P̂ (ω) approximates the envelope of P (ω)
optimally (in case of LP modeling of speech spectrum, the formant structure is to be evident). The
error curve starts at value 1 and monotonically decreases to its own V min as p → ∞. Mostly in error
curve we can observe the “knee” that is the value of p when the curve starts to slope very slowly toward
its asymptote (for curves in Fig. 2.5, p = 8 and p = 14 (voiced, unvoiced spectra). Those values of p
are optimal to approximate the signal spectrum. A lower value of p results in a grosser approximation
to the spectral envelope, whereas a larger value of p will add detailed spectral information on the top
of the spectral envelope.
Linear system approximating the envelope of given signal spectrum (in our case perceptually warped
spectrum) is fully described by the set of linear prediction coefficients {a(k)} (LPC) with additional
information about the total energy represented by the gain factor. Such a linear system can be either
defined by different type of coefficients. Once we have estimated LPC we are unrestricted to use this
one type of representation of linear system. This is great advantage of PLP versus MFCC.
15
The derivation of cepstral coefficients from given set of LPC is simple because there exists direct
transformation from LPC to cepstral coefficients. For its derivation, first we apply the logarithm on
the Ĥ(z):
¡ ¢ h G i
log Ĥ(z) = log Pp , (2.54)
1 + k=1 ak z −k
If A(z) (denominator of Eq. 2.54) is of pth order and all the poles are inside the unit circle (it is the
property of an all-pole model) and A(z → ∞) = 1, we can apply the Taylor expansion of log( Ĥ(z))
to;
∞
X
log Ĥ(z) = c0 + c1 z −1 + c2 z −2 + · · · = ci z −i ,
¡ ¢
(2.55)
i=0
where {ci } are the cepstral coefficients of LPC. If we substitute Ĥ(z) and apply the derivation on both
sides of Eq. 2.55 to get rid of the logarithm we will obtain:
p
X ∞ p
£X ¤ £X
kak z −k = ici z −i . ak z −k .
¤
− (2.56)
k=1 i=1 k=0
With respect to a0 = 1 and by comparison the terms of the same root of the left and right side of
Eq. 2.56, we can easily derive the cepstral coefficients of LPC:
c0 = 0,
c1 = −a1 ,
k−1
X i
ck = −ak − ci ak−i , 2 ≤ k ≤ p, (2.57)
k
i=1
p
X k−i
ck = − ck−i ai , k = p + 1, p + 2, . . . .
k
i=1
Cepstral coefficients derived in previous equations are related to the spectral envelope established by
linear prediction analysis, and so they differ in general to cepstral coefficients computed directly from
power spectrum (such as in MFCC analysis).
Experiments:
We have made many experiments with PLP-cepstrum as features in AURORA2 and SDC tasks. In
Fig. 2.4 the processing scheme for PLP feature extraction is given. It is clear that there is more algorithm
parameters than in MFCC analysis, so that more experiments is to be done in order to find optimal
parameters:
– PLP7: Mel warping 129 → 23 frequency bands, Fstart = 0Hz, c0 = mean(log(P (ωm ))), p = 14.
Because of all-pole model property (Eq. 2.38), it seems to be reasonable to repeat the first and last
frequency band of warped spectrum before its symmetrization, as can be seen in Fig. 2.7. c 1 . . . c13
cepstral coefficients computed. root (Power law) = 0.33, EQL not applied.
– PLP5: Similar to PLP7, c0 = G, Fstart = 64Hz.
– PLP6: Similar to PLP6, p = 20.
– PLP8: Similar to PLP5, not repetition of the first and the last freq. band (just symmetrization of
spectrum by flipping operation).
– PLP9: Similar to PLP5, c0 = log(G).
– PLP10: Similar to PLP7, p = 12.
In another experiment related to PLP7, where Fstart = 64Hz, the recognition performance was almost
the same.
16
Aurora 2-WER [%] noisy train clean train
test “A” “B” “C” “A” “B” “C”
PLP7 12.29 12.75 13.84 44.30 49.11 35.76
where q is usually equal to 2. The term rms log spectral measure is used when the log 10 is replaced
by the natural logarithm. The mean absolute log spectral measure is obtained by setting q = 1. For
the limiting case as q approaches infinity, the term peak log spectral difference is used. The P orig (ωm )
17
is in our case the original spectrum obtained with the original values of ci , whereas the Pch (ωm ) is the
power spectrum related to the varied ci .
The Fig. 2.8 shows the spectral sensitivity of different cepstral coefficients. The original cepstral
values have been changed from the original value ciorig to the value (0.5 − 1.5) × ciorig . The symmetric
shape of spectral sensitivity curve comes from the property of inverse Fourier transform which is
employed in relationship between spectrum and cepstrum and from the fact that q = 2. It is also clear
from the following equation:
X X³ ´³ ´T
(corig − cch )2 = (Porig − Pch )Tr Tr(Porig − Pch )
n n
X (2.60)
= (Porig − Pch )2 ,
n
where
• Pch is vector of power spectra obtained from modified cepstrum cch (vector 1 × n),
Eq. 2.60 in general follows the property of symmetry of spectral distortion curve. The spectral distor-
tion of power spectra is not sensitive to the direction of variance of c orig .
80
70
60
50
−−−> P(omega)
40
30
20
10
0
0 5 10 15 20 25 30 35 40 45
−−−> frequency bands
Figure 2.7: Repetition of the first and the last frequency band and spectrum symmetrization to get
better matching of an all-pole model to the signal spectrum.
0.5
0.45
2
4
0.4 11
6
3
0.35 9
−−−> Spectral sensitivity [dB]
15
5
0.3 14
7
10
0.25 8
13
12
0.2
0.15
0.1
0.05
0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
−−−> × of original value
Figure 2.8: Spectral sensitivity curves for the cepstral coefficients of a 14 th order PLP analysis. Only
sensitivity of c1 . . . c14 is computed.
18
2.2.2.10 Application of different value of root (Power law)
Experiments:
The different values of root have been examined in PLP used for Aurora 2 project. In Aurora 2 experiments
a noise estimation with noise subtraction algorithm have been employed (the system which performed
the best in MFCCs type of feature extraction (sub fea10 frame orig nocoder ), F start = 64Hz of MFB,
DC offset in spectral domain, temporal LDA filter (last data-driven filter coefficients), voice activity
detector (VAD) (submitted version). We did not use mean and variance normalization (MVN) and the
final coding/decoding algorithm, (no TRAPS, no feature-net).
PLP part of code: MFB warping (23 bands), 15 output cepstral coefficients, c 0 = log(G), No EQL
transformation:
For comparison of PLP feature extraction we performed MFCC analysis with untouched noise estimation
and suppression system (sub fea10 frame orig nocoder ), the same TLDA and VAD were used (no MVN
and the final coding/decoding):
– subfea 10 frame orig noonline nocoder: Aurora 2 evaluation, MFCC analysis employed.
Experiments:
In first experiments we have used the same experimental setup as in case of PLP4 nocoder described in
sect. 2.2.2.10, but obviously with application of EQL approximation:
19
20
10
−10
−20
−40
−50
−60
−70
−80
0 500 1000 1500 2000 2500 3000 3500 4000
−−−> f [Hz]
Figure 2.9: EQL curve given by Eq. 2.5 (solid line) and Eq. 2.6 (dashed line).
– PLP6 nocoder: Application of 40 dB EQL approximation computed for central frequencies of each
Mel filter bank (Fstart = 0Hz) using Eq. 2.5.
– PLP8 nocoder: Application of 40 dB EQL approximation computed for central frequencies of each
Mel filter bank (Fstart = 64Hz) using Eq. 2.5.
– PLP9 nocoder: Application of preemphasis using Eq. 2.6 for central frequencies of each Mel filter
bank (Fstart = 64Hz).
We have tried to employ EQL in experiments with basic PLP analysis, as described in sect. 2.2.2.7. The
experiments were the same as PLP7 :
– PLP20: Application of 40 dB EQL approximation computed for central frequencies of each Mel
filter bank (Fstart = 0Hz for all SDC-db) using Eq. 2.5.
2.2.2.12 Conclusion
Experiments with PLP based feature extraction were run on AURORA 2 DB. The purpose of this work
was to compare the results with MFC based feature extraction and attempt to find the advantages
and the drawbacks of them. Due to the fact that PLP analysis needs more parameters to be set, we
have experimented with raw PLP feature extraction and later tried to impact such optimized code
into full AURORA 2 code, where the section related to MFC analysis was replaced by PLP.
20
SDC-Accuracy [%] Italian Finish Spanish
test hm mm wm hm mm wm hm mm wm
PLP7 38.14 85.18 94.26 41.2 65.73 91.86 38.92 73.01 87.15
PLP20 36.3 84.5 93.47 33.85 58.69 91.27 45.80 68.92 88.27
Some of the parameters of PLP analysis were estimated experimentally as well as analytically.
Some of them only experimentally.
One of the most important parameters in PLP analysis is the order of an all-pole model used to
approximate the power spectrum of input speech. Analytically (sect. 2.2.2.6 on randomly selected
unvoiced frame (unvoiced frame needs higher order of an all-pole model to approximate its power
spectrum) we have obtained p = 14 that corresponds to the results mentioned in Tab. 2.2.
Intensity-loudness power law is employed in PLP to approximate the power law of hearing and
to simulate the non-linear relation between the intensity of sound and its perceived loudness. This
operation also reduces the spectral amplitude variation of the warped power spectrum so that the
approximating all-pole model can be used with a relatively low model order (essentially says that the
all-pole model can better fit given spectrum). In standard PLP analysis power root constant is 0.33.
Its increase towards square root (0.5) almost does not affect the performance, but the performance
starts degrading with smaller values of root.
Interesting observations come from EQL block used in PLP to approximate the non-equal sensi-
tivity of hearing at different frequencies. From our experiments it follows that EQL does not play
important role in PLP analysis (see results in Tab. 2.5). However bigger robustness of PLP base fea-
ture extraction (with EQL preemphasis) is obtained in full AURORA 2 feature extraction approach
(Tab. 2.4), where the noise suppression algorithm has been employed. Then, better performance is
mainly obtained in high mismatch experiments, but there can be noticed some improvement either
for well-matched conditions.
Applying EQL preemphasis with somehow optimized parameters of such PLP based feature ex-
traction approach, over all performance for AURORA 2 speech recognition task is higher than for
MFCC based feature extraction approach, where all preprocessing block were untouched.
21
where A(z) is an inverse all-pole model given by Eq. 2.11 which minimizes the residual energy. There
are several important properties of P (z) and Q(z):
• Minimum phase property of A(z) is easily preserved after quantization of the zeros of P (z) and
Q(z).
• The LSF coefficients allow interpretation in terms of formant frequencies. If two neighboring
LSFs are close in frequency, it is likely that they correspond to a narrow bandwidth spectral
resonance in that frequency region. Otherwise, they usually contribute to the overall tilt of the
spectrum.
• Shifting the line spectral frequencies has a localized spectral effect – quantization errors in an
LSF will primarily affect the region of the spectrum around that frequency.
The first two properties are useful for finding the zeros of P (z) and Q(z). The third property ensures
the stability of the synthesis filter. Straightforward computation of the LSFs is not efficient due to the
extraction of the complex errors of a high order polynomial. However, there were proposed a methods
applying a discrete cosine transform or Chebyshev polynomials.
We have been observing the behavior of LSFs in terms of their use in PLP based feature extrac-
tion for speech recognition. Therefore, the spectrum was at the beginning warped, and other PLP
processing operations were applied.
22
20
18
16
−−−> frequency bands
14
12
10
20 40 60 80 100 120
−−−> frames
Figure 2.10: Trajectories of LSFs for spectrum analyzed by PLP (23 critical bands, F sampl = 8kHz).
Imaginary Part
Imaginary Part
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
Real Part Real Part Real Part
Figure 2.11: Poles of A(z), P(z), Q(z) for 10th order of A(z) in “z” plane.
22
100
50
−50
0 500 1000 1500 2000 2500 3000 3500 4000
−−−> f [Hz]
¯ 1 ¯
¯ (solid curve), ¯ 1 ¯ (dashed curve) and ¯ 1 ¯ (dash-dotted
¯ ¯ ¯ ¯
Figure 2.12: Frequency responses of ¯ A(z) P (z) Q(z)
curve). Vertical lines are LSFs (in range of (0-4000) Hz).
3.5
3
−−−> Spectral sensitivity [dB]
2.5
14
12
10
2 8
6
4
2
1.5
1 15
13
11
9
0.5 7
5
3
0
0 0.5 1 1.5 2 2.5 3
−−−> f [radians]
Figure 2.13: Spectral sensitivity of LSFs of 14th order PLP analysis for normalized frequency range
0 − π.
where the index i starts from p and decrements at each iteration until i = 1. The coefficients k i
correspond to the gain factors in the lattice structure implementation of the LP analysis filter A(z)
(see Fig. 2.14). The lattice and transversal structures yield the same output, except in the time-varying
case - the memory/initial conditions of the filters being the cause of this difference. The LP analysis
filter is guaranteed to be minimum phase when |ki | < 1 f or i = 1, . . . p. Another advantage is that
(p) (q)
changing the order of the filter does not affect the coefficients computed; i.e., k i = ki for i = 1, . . . p,
23
(p) (q)
where ki and ki are the reflections coefficients for a pth and q th order predictor, respectively, and
p ≤ q.
f(0)[n] f(1)[n] f(p−1)[n] + f(p)[n]
+ + Residual
error e(n)
_
k(1) k(p)
s[n]
speech
k(1) k(p) _
b(0)[n] b(1)[n] b(p−1)[n] b(p)[n]
1/z + 1/z + +
Figure 2.14: Lattice structure of an all-pole LPC filter. The signals f (i)[n] and b(i)[n] are the ith
order forward and backward prediction errors, respectively. Reflection coefficients k(i) refer to {k i } in
the text.
1.8
1.6
1.4
−−−> Spectral sensitivity [dB]
1.2
0.8
0.6
0.4
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−−−> range of [−1, +1]
Figure 2.15: Spectral sensitivity for the reflection coefficients {ki } of 14th order PLP analysis for the
range −1, +1.
24
0.5
0.45
0.4
2
4
0.35 3
6
0.1
0.05
0
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
−−−> × of original value
Figure 2.16: Spectral sensitivity of LPC of 14th order PLP analysis. Only sensitivity of a1 . . . a14 is
computed.
2000
1500
1000
500
0
Power Frequency Equal Power
−500
−1000
−1500
0 2000 4000 6000 8000 10000 12000
1
... FFT warping loudness law
−−−> samples
200
frames
Lower bands
recognizer
recognizer
Figure 2.17: Scheme of selective LPC approach incorporated into PLP analysis.
Selective all-pole modeling is usually applied in term of different modeling of different part of
signal spectrum. This means the whole signal spectrum will be modeled, but not only using one
unique all-pole model.
25
Important note is that since we assume the availability of the discrete signal spectrum P (ω m ), a
desired frequency shaping or filtering can be done directly to the signal spectrum before LP analysis
is performed.
70
65
60
2
−−−> |P|
55
50
45
40
5 10 15 20 25
−−−> critical bands
Figure 2.18: Application of selective LP in PLP analysis: original warped spectrum (solid line), one
all-pole model approximation (no selective LP) (dashed line).
70
65
60
2
−−−> |P|
55
50
45
40
5 10 15 20 25
−−−> critical bands
Figure 2.19: Application of selective LP in PLP analysis: original warped spectrum (solid line), two
all-pole models approximation (dashed line, p1 = 12, p2 = 6). All-pole models are concatenated in
19th band.
70
65
60
2
−−−> |P|
55
50
45
2 4 6 8 10 12 14 16 18
−−−> critical bands
Figure 2.20: Application of selective LP in PLP analysis: original warped spectrum (solid line), lower
all-pole model approximation (dashed line, p1 = 12) for 1st − 18th frequency band.
Fig. 2.18-2.21 show power spectra obtained after application of ordinary LP analysis and selective
LP analysis computed for perceptually warped spectra. In the most of experiments with PLP, the
warped spectrum consists of 25 frequency bands (including repetition of first and last band). The
application of all-pole modeling to a voiced frame (8kHz sampled speech) is shown in Fig. 2.18. In
selective LP approach the spectrum is divided into lower and upper part (in our case 1 st − 18th band-
lower part, 19th − 23th band-upper part of spectrum (before repetition of side bands)). Each lower
and upper spectra are processed independently as in the standard all-pole modeling, and two different
all-pole models are obtained (Fig. 2.20 and 2.21). After concatenation of these two all-pole models,
the power spectrum (shown in Fig. 2.19) is obtained.
There can be several reasons to use selective linear prediction in speech recognition or coding
instead of classical approach. For instance, in speech recognition the main region of interest is the
26
54
52
50
48
2
−−−> |P|
46
44
42
40
20 20.5 21 21.5 22 22.5 23 23.5 24 24.5 25
−−−> critical bands
Figure 2.21: Application of selective LP in PLP analysis: original warped spectrum (solid line), upper
all-pole model approximation (dashed line, p2 = 6) for 19th − 25th frequency band.
0 − 5kHz region (it is well known that the first two formants of all vowels lie approximately in
the frequency range up to 3kHz). The spectrum of upper frequencies is important mainly for the
recognition of fricatives, in which case the total energy in that region might be sufficient. In LP
analysis the spectral matching process performs uniformly over the whole frequency range, which is
not desirable in this case. All-pole assumption for many speech sounds is less applicable for frequencies
greater than 5kHz. Therefore, instead of modeling the whole spectrum, we use selective LP to model
the lower part by a lower order all-pole spectrum. Then we can fit a very low order all-pole spectrum
to the upper frequency region.
The interesting problem is to attempt to do the same analysis in the time domain (of course not
in PLP case). Many down-sampling operations with sharp filtering would have had to be performed.
The problems would increased, if we wanted to choose an arbitrary divided frequency regions.
Experiments:
Many experiments have been done with application of selective LP analysis in PLP feature extraction.
The preprocessing operations applying before all-pole modeling have not been changed. So that the
perceptually warped power spectrum was approximated by EQL curve with application of power law. In
standard PLP analysis, such spectrum is processed to be conjugate symmetric and the autocorrelation
coefficients are computed using Eq. 2.9. In selective LP approach:
– The spectrum is split into upper and lower part. The first frequency band of lower part of spectrum
and the last frequency band of upper part of spectrum are repeated, respectively. Those two part
are modified to be symmetric and two sets of real autocorrelation coefficients are obtained.
– Lower and upper parts of spectrum are modeled independently by an all-pole modeling and two
all-pole models P̂lower (ω) and P̂upper (ω) are obtained.
– In order to get features for speech recognition, we were experimented with two possibilities:
∗ Compute LPCs (and then cepstrum or LSF, . . . ) for the both all-pole models independently.
Concatenate the features to obtain one feature set.
∗ Concatenate the P̂lower (ω) and P̂upper (ω) and using discrete filter least squares method to get
the parameters of one all-pole model. One set of features can be easily performed from the
resulting all-pole model. It is important to note that P̂lower (ω) and P̂upper (ω) are continuous
spectra (spectra with high frequency resolution) so that their approximation is accurate.
In selective all-pole modeling there are more parameters that have to be chosen such as the order of
all-pole models and the frequency band where the warped power spectrum is to be split into lower and
upper part. Only cepstral coefficients have been taken for speech recognition experiments.
– PLP12: Until spectrum processing block, the same algorithm as in PLP7 experiment. Then the
spectrum is split into lower and upper part (23 bands is divided into 15 + 8 bands). Two all-pole
models are computed with plower = 12, pupper = 5. Number of cepstral streams N c: N clower = 9,
N cupper = 5. These two cepstral streams are linearly merged to create final stream 0 th cepstral
coefficient is the log of mean of warped power spectrum as in PLP7.
27
– PLP13: Until spectrum processing block, the same algorithm as in PLP7 experiment. Then the
spectrum is split into lower and upper part (23 bands is divided into 15 + 8 bands. p lower = 10,
pupper = 6. Lower and upper frequency responses of all-pole models are concatenated together,
LPCs of such frequency response are computed (p = 14). LPCs are transferred into 14 cepstral
coefficients. 0th cepstral coefficient is the log of mean of warped power spectrum as in PLP7.
– PLP14: Similar to PLP12, plower = 8, pupper = 4.
– PLP30: Similar to PLP12, 23 bands are divided into 18 + 5 bands. N clower = 10, N cupper = 4.
– PLP31: Similar to PLP12, 23 bands are divided into 18 + 5 bands. N clower = 10, N cupper = 4.
plower = 10, pupper = 4.
– PLP32: Similar to PLP12, 23 bands are divided into 15 + 8 bands. N clower = 10, N cupper = 4.
plower = 10, pupper = 5.
– PLP33: Similar to PLP30. Slightly different repetition of side frequency bands of upper spectrum.
– PLP34: Similar to PLP13, plower = 12, pupper = 5. Interpolation between last spectral sample
of lower spectrum and first sample of upper spectrum for better transition (d f(end) = (d f(end-
1)+d s(1))/2).
and P̂ (ω) is then given by Eq. 2.41. Generally, the frequencies ωm , which includes both positive
and negative frequencies, can be arbitrary and do not have to be equally spaced. Note that the
predictor coefficients are specified {dk } to separate them from standard LP coefficients having G not
incorporated. The minimization of ELP with respect to the {dk } results into a well-known set of linear
equations 2.47 and 2.48. In these equations Rk is the autocorrelation of the discrete signal spectrum
28
P (ωm ). When minimizing ELP , we are matching the autocorrelation of the continuous LP envelope
P̂ (ω) to the autocorrelation of the given discrete spectrum:
R̂LPi = Ri , 0 ≤ i ≤ p, (2.69)
where π
1
Z
R̂LPi = P̂ (ω)ejωi dω. (2.70)
2π −π
25
20
15
10
−−−> |P(ω)|
−5
−10
−15
−20
0 500 1000 1500 2000 2500 3000 3500 4000
−−−> f [ Hz]
Figure 2.22: The LP envelopes (p = 10) computed for different numbers of discrete frequencies of
signal spectrum. Very solid line: LP envelope computed for 1024 frequency points of original spectrum.
Solid lines: LP envelopes for 640, 512, 256, 128, 64 frequency points. Dashed line: LP envelope for 32
frequency points. p = 10.
A typical behavior of LP spectral analysis of discrete voiced spectrum is shown in Fig. 2.22. The
spectral envelope of “continuous” signal spectrum (here not warped signal spectrum) shown in very
solid line does not match the LP envelopes of discrete spectra (differ in number of discrete frequency
points). It has been shown that for discrete spectra the LP error measure given in Eq. 2.68 contains
an error cancellation property.
Let us define Rorg to be the autocorrelation corresponding to the original all-pole filter with
spectrum P (ω). Their relation is given by inverse Fourier transform:
Z π
1
Rorgi = P (ω)ejωi dω (2.71)
2π −π
and
∞
X
P (ω) = Rorgl e−jωl . (2.72)
l=−∞
The autocorrelation R corresponding to the discrete samples of the LP envelope is defined in general
as:
N −1
1 X
Ri = P (ωm )eiωm , (2.73)
N
m=0
and in the case of symmetric shape of P (ωm ) is given in Eq. 2.9. By substituting Eq. 2.72 in 2.73 we
obtain:
N −1 ∞
1 X X
Ri = Rorgl e−jωm (l−i) . (2.74)
N
m=0 l=−∞
This equation shows why LP analysis cannot recover the original envelope from the discrete spectral
samples, because of the aliasing that occurs in the autocorrelation domain whenever a spectral envelope
is sampled at a discrete set of frequencies. If we consider the periodic excitation case with the
frequencies spaced equally at ωm = 2π(m − 1)/N , then Eq. 2.74 reduces to:
∞
X
Ri = Rorg(i−lN ) , f or all i. (2.75)
l=−∞
29
7
x 10
1
[i]
org
−−−> R
0 a)
−1
0 6 20 40 60 80 100 120
x 10
5
−−−> R [i]
0 b)
−5
0 6 20 40 60 80 100 120
x 10
5
[i]
LP
−−−> R
0 c)
−5
0 20 40 60 80 100 120
1000
−−−> h [i]
500
d)
0
−500
−60 −40 −20 0 20 40 60
−−−> samples
Figure 2.23: The autocorrelation sequences and impulse response sequence of 10 th LP model:
a) Autocorrelation Rorgi for lags 0 ≤ i ≤ 120.
b) Autocorrelation Ri , 0 ≤ i ≤ 120 for N = 30 (spectral samples)-corresponding to the discrete
spectrum.
c) Autocorrelation R̂LPi , 0 ≤ i ≤ 120 corresponding to the LP envelope.
d) Discrete frequency sampled impulse response ĥi , −60 ≤ i ≤ 60.
Some examples of autocorrelation sequences are shown in Fig. 2.23. Note that R i is obtained
by aliasing Rorgi . According to Eq. 2.69, LP matches the autocorrelation of the continuous model
spectrum to that of the given spectrum. Therefore:
∞
X
R̂LPi = Ri = Rorg(i−lN ) , 0≤i≤p
l=−∞ (2.76)
6= Rorgi , 0 ≤ i ≤ p.
Since the autocorrelation corresponding to the LP envelope will always equal an aliased version of R org
(for the discrete spectrum case), the LP envelope will not equal the original envelope. LP analysis
produces a unique all-pole model given a set of autocorrelations, which means that the original all-pole
model is not a possible solution to Eq. 2.47. The LP criterion given in Eq. 2.68 does not take into
account the aliasing that has occurred in the discrete spectrum, and matches the autocorrelation of
continuous all-pole model to the autocorrelation of the given signal.
It is well known that LP analysis fails in case of modeling of high pitched speech, where the spectral
harmonics are widely separated (problems in formant tracking). Hence the peaks of LP spectral
estimates are highly biased towards the pitch harmonics. It has been shown that the drawbacks of
LP are inherent to its error criterion given in Eq. 2.68. This problem of high-pitched speech modeling
is relevant to the fact of autocorrelation aliasing. As the pitch increases, we have fewer and fewer
30
harmonics (spectral samples) and the autocorrelation aliasing becomes more and more severe, which
leads to worse LP models.
The previous examples show that LP is the wrong approach to envelope estimation for discrete
spectra since it does not account for the aliasing caused by spectral sampling. There have been
proposed several methods to improve the LP estimate. One of them called discrete all-pole modeling
(DAP) is in general superior to the standard LP analysis. DAP uses different error measure- Itakura-
Saito (I-S) measure. It was originally defined for continuous spectra. After its adaptation to the
discrete case we obtain:
N
1 X h P (ωm ) P (ωm ) i
EIS = − ln −1 , (2.77)
N P̂ (ωm ) P̂ (ωm )
m=1
where P (ωm ) is the given discrete spectrum defined at N frequencies, and P̂ (ωm ) is the all-pole model
spectrum (Eq. 2.41) defined at the same frequencies. EIS is always nonnegative and is equal to zero
only when P (ωm ) = P̂ (ωm ) for all ωm (P (ωm ) = P̂ (ωm ) gives a minimum for EIS , but not necessarily
for ELP ).
It is important to note that in continuous form of this error measure as part of maximum likelihood
approach to linear prediction, the same result was produced as standard LP for continuous spectra
(the optimal all-pole model is the same as the one produced by LP). Hence, by using I-S error measure,
we do not loose any of the advantages or performance of LP in unvoiced segments of speech, where
LP behaves very well.
A spectral flatness interpretation of this discrete (I-S) error measure makes it a very reasonable
choice for the problem of fitting an envelope to a set of discrete spectral values. Minimizing the error
in Eq. 2.77 is equivalent to maximizing the spectral flatness of the error spectrum P (ω m )/P̂ (ωm ),
where the spectral flatness is defined as the geometric mean of the spectral samples divided by their
arithmetic mean. In our case it means that our optimal model is the one which makes the residual
(error) spectrum as flat as possible.
If EIS is small, the I-S error approximates the mean-squared distance between the log spectra:
p
EdB = 6.142 EIS
v
u
u1 X N h i2 (2.78)
=t 10log10 P (ωm ) − 10log10 P̂ (ωm ) , f or small EIS .
N
m=1
(compare with Eq. 2.39 and 2.40), where {dk } is a set of coefficients with d0 6= 1 and with G incorpo-
rated in the coefficients. Note that g0 is equal to the zero-lag autocorrelation. The result yields to a
set of standard correlation matching conditions, given by:
R̂i = Ri , 0 ≤ i ≤ p, (2.82)
31
where Ri is the autocorrelation corresponding to the given discrete spectrum (given in Eq. 2.73) and
R̂i is the autocorrelation corresponding to the all-pole model sampled at the same discrete frequencies
as the given spectrum:
N −1
1 X
R̂i = P̂ (ωm ) cos(iωm ). (2.83)
N
m=0
Eq. 2.82 looks similar to the usual LP autocorrelation matching condition (Eq. 2.69). However, the
difference is that R̂LP is the autocorrelation of the continuous all-pole spectrum P̂ (ω). Here R̂ is the
autocorrelation of a discrete sampling of the all-pole spectrum.
DAP requires matching the given aliased autocorrelation to the autocorrelation of the all-pole
aliased in the same manner (Eq. 2.82). It is this improved correlation matching condition, which
incorporates the autocorrelation aliasing, that makes DAP more suitable than LP for modeling voiced
frames and discrete spectra in general.
Rd = R̂d, (2.85)
where d is the column vector of predictor coefficients, and R and R̂ are symmetric Toeplitz matrices.
Hence the next step is to solve a set of p + 1 nonlinear equations in p + 1 unknowns. The equations
are nonlinear because R̂ is a function of d.
To simplify the solution of nonlinear problem given in Eq.2.84, we use the following property of
sampled all-pole filters:
Xp
dk R̂i−k = ĥ−i , 0 ≤ i ≤ p, (2.86)
k=0
where ĥ−i is the (time-reversed) impulse response of the discrete frequency sampled all-pole model,
given by:
N
1 X e−jωm i
ĥ−i = , (2.87)
N D(ωm )
m=1
where:
1 1
P̂ (ωm ) = |Ĥ(ωm )|2 = = ¯ Pp ¯ . (2.88)
|D(ωm )|2 d e −jωm k ¯2
k
¯
k=0
We want to prove a property given in Eq 2.86. Hence we start with the identity:
32
Multiplying the both sides by ejωm i , averaging over all ωm and applying the definition of R̂i from
Eq. 2.83, we obtain:
p N −1
X 1 X
dk R̂i−k = Ĥ(ωm )e−jωm i = ĥ−i , f or all i. (2.92)
N
k=0 m=0
Now we can substitute Eq. 2.86 into the minimization condition in Eq. 2.84, we obtain the following
set of equations that are related to the all-pole predictor coefficients to the given autocorrelation
sequence:
Xp
dk Ri−k = ĥ−i , 0 ≤ i ≤ p. (2.93)
k=0
• Given the new estimate of ĥ−i , solve the now “linear” set of Eqs. 2.93 for a new estimate of
predictors.
Experiments:
We have tried to incorporate the DAP into the standard PLP based analysis. First the warped power
spectrum is computed (with application of power law and possible equal loudness approximation). After
the spectrum processing, the IDFT is applied and autocorrelation sequence is obtained. For such autocor-
relation sequence, {dk } coefficients are computed using standard LP analysis. Iteratively, new set of {d k }
is obtained with respect to new estimate of ĥ−i . This iterative approach is computationally expensive so
that we have used the smallest number of iterations.
– PLP38: Application of discrete all-pole modeling. First standard LP analysis is done. For such
all-pole model, the impulse response is computed. Then, iteratively, the set of non-linear equations
is solved (number of iterations is 2).
– PLP39: Similar to PLP38, number of iterations is 5.
– PLP40: Similar to PLP38, number of iterations is 1.
33
SDC-Accuracy [%] Italian
test hm mm wm
PLP38 38.4 84.58 93.26
PLP39 37.69 80.54 93.30
• Compute the IDFT of P −1 (ωm ) using Eq. 2.73 that yields the autocorrelation coefficients.
• P (ω) = P −1 (ω).
However, the mentioned algorithm expects the fact that original P (ω m ) is an all-pole spectrum given
a finite number of equally spaced points on it. Therefore, its application to more general cases has
anticipated problems. On the other hand, if the initial discrete spectrum is all-pole, the LP analysis
ensures that the solution will be unique (only one correct solution). The only restriction is that the
number of harmonics in the spectrum is to be at least equal to the number of poles. The previous
algorithm can be applied in cases where for example the harmonic all-pole but noisy spectrum is given
(e.g. as a result of quantization).
Interesting note is that in case of application Analysis by Synthesis (AbS) approach to estimate
an all-pole model. The AbS model spectrum will be identical to desired discrete spectrum (not true
for LP).
34
Chapter 3
Feature normalization
The variability in acoustic signal is caused by several sources. There is a “useful” variability that is
necessary to discriminate between different speech units (e.g. phonemes). However, there are also
“harmful” sources of variability involved which are irrelevant for the speech recognition process. The
“harmful” sources of variability are for example varying transmissions channels, different speakers,
different speaking styles or accent, channel noise, . . . . We focus here on the variability caused by
different type of noise. Hence we assume that the speech is corrupted by unknown additive and/or
convolutional noise.
where x[n] is the desired signal and b[n] is the unwanted background noise. For next steps of derivations
we assume that x[n] and b[n] to be wide-sense stationary, uncorrelated random processes. P x (ω) and
Pb (ω) are their power spectral density functions, respectively. From the property of linearity of Fourier
transform we can write:
Py (ω) = Px (ω) + Pb (ω). (3.2)
In speech processing we apply Short-Time Fourier Analysis (STFT), so that we work with the short
segments given by:
ypL [n] = w[pL − n](x[n] + b[n]), (3.3)
where L is the frame length and p is the integer, which in the frequency domain is expressed as:
Y (pL, ω), X(pL, ω) and B(pL, ω) are the STFTs of y[n], x[n] and b[n], respectively, computed at the
frame interval L. The STFT magnitude squared of y[n] is given:
|Y (pL, ω)|2 = |X(pL, ω)|2 + |B(pL, ω)|2 + X ∗ (pL, ω)B(pL, ω) + X(pL, ω)B ∗ (pL, ω). (3.5)
35
We want to recover x[n] from y[n] without any a priori knowledge of g[n] (sometimes referred as blind
deconvolution). In short-time analysis we assume that the window w[n] is long and smooth relative
to the distortion g[n], so that a short-time segment of y[n] can be written as:
ypL [m] = w[pL − m](x[n] ∗ g[n])
(3.7)
∼ (x[m]w[pL − m]) ∗ g[m].
The STFT of degraded signal is:
∞
X
Y (pL, ω) = w[pL − m](x[m] ∗ g[m])e−jωm
m=−∞
∞
X (3.8)
∼ (w[pL − m]x[m]) ∗ g[m]e−jωm
m=−∞
= X(pL, ω)G(ω),
where G(ω) is the Fourier transform of the distortion.
36
3.4 Spectral Mean Subtraction
Now we consider the problem of recovering a desired signal x[n] from the convolution y[n] = x[n]∗g[n].
With respect to Eq. 3.8 we apply the nonlinear logarithm operator to the STFT of y[n] to obtain:
£ ¤ £ ¤
Y (pL, ω) ∼ log X(pL, ω) + log G(ω) . (3.13)
The distortion g[n] is time-invariant. Therefore STFT views log[G(ω)] at each frequency as fixed
along the time index variable p. If we assume that the mean of speech component log[X(pL, ω)] is
zero in time dimension, then the convolutional distortion g[n] can be removed, whereas the speech
contribution is kept unaffected. This can be achieved in the quefrency domain by computing cepstra,
along each STFT time trajectory:
³ £ ¤´ ³ £ ¤´
ŷ[n, ω] ∼ Fp−1 log X(pL, ω) + Fp−1 log G(ω)
= x̂[n, ω] + ĝ[n, ω] (3.14)
= x̂[n, ω] + ĝ[0, ω]δ[n],
where Fp−1 denotes the inverse Fourier transform of sequences along the time dimension p. Applying
a cepstral lifter, we get:
x̂[ n, ω] ∼ l[n]ŷ[n, ω], (3.15)
where l[n] = 0 at n = 0 and unity elsewhere. This method is called cepstral mean subtraction, because
the 0th value of the cepstrum equals the mean of log[Y (pL, ω)] for each ω (along the time dimension).
Despite the fact that this technique is limited due to strictness of the assumption of a zero-mean speech
contribution, it has significant advantages in feature extraction for speech recognition. The mean is
in practice computed over a finite number of framers, we can think of cepstral mean subtraction as a
high pass, non-causal filtering operation.
+
(x^[n] − x[n])^2
square
The goal is to find a linear filter h[n] such that the sequence x̂[n] = y[n] ∗ h[n] minimizes the
¡ ¢2
expected value of x̂[n] − x[n] . With remembering that the signals x[n] and b[n] are uncorrelated
and stationary, the frequency-domain solution to this stochastic optimization problem is given:
Px (ω)
Hs (ω) = , (3.16)
Px (ω) + Pb (ω)
37
which is referred as the Wiener filter. When x[n] and b[n] meet the conditions under which the Wiener
filter is derived (interactively uncorrelated and stationary), the filter provides noise suppression without
considerable distortion in the signal (object) estimate and background residual. The required power
spectra Px (ω) and Pb (ω) can be estimated by averaging over multiple frames when sample functions of
x[n] and b[n] are provided. Typically, however, the desired signal and background are non stationary
in the sense that their power spectra change over the time , i.e. they can be expressed as time-varying
functions Px (n, ω) and Pb (n, ω). Therefore, ideally, each frame of the STFT is processed by a different
Wiener filter. For the simplifying case of a stationary background, we can express the time-varying
Wiener filter as:
P̂x (pL, ω)
Hs (pL, ω) = , (3.17)
P̂x (pL, ω) + P̂b (ω)
where P̂x (pL, ω) is an estimate of the time-varying power spectrum of x[n], Px (n, ω), on each frame,
and P̂x (pL, ω) is an estimate of the power spectrum of a stationary background, P b (n, ω). The time-
varying Wiener filter as:
h 1 i−1
Hs (pl, ω) = 1 + , (3.18)
R(pL, ω)
with a signal-to-noise ratio R(pL, ω) = P̂x (n, ω)/P̂b (n, ω). It can be shown, when the suppression
curves for spectral subtraction and Wiener filtering are computed, that the attenuation of low SNR
regions relative to the high SNR regions is to be stronger for the Wiener filter. A second important
difference from spectral subtraction is that the Wiener filter does not invoke an absolute thresholding.
38
In feature extraction it is required to do MVN on line (we cannot afford to compute mean and
variance from one long sentence). Therefore, the estimation of mean and variance have to be time-
dependent. Let Ψ̄t be the normalized cepstral features, where t denotes time. Then mean m̄ t and
variance σ̄t are computed:
¡ ¢
m̄t = m̄t−1 + α C̄t−1 − m̄t−1 (3.22)
2 2 2 2
¡ ¢
σ̄t = σ̄t−1 + α (C̄t−1 − m̄t−1 ) − σ̄t−1 , (3.23)
where α is the updating factor that needs to be somehow chosen. In our experiments α has been
found to be 0.02. The reestimation of normalized cepstral features is done as follows:
C̄t − m̄t
Ψ̄t = . (3.24)
σ̄t + ø
The additive parameter ø is set to 1, and prevents the denominator in Eq. 3.24 to be very small,
that may happen in long silent regions, where the estimation of variance is → 0. In such regions
the normalized cepstral features contains only the noise that would be amplified so that in result
recognition performance would significantly degrade.
Experiments:
We have done many experiments related to mean and variance normalization either applied on MFCC
or PLP coefficients. First important issue that highly affect speech recognition performance is the global
mean and variance which has to be computed prior application of MVN.
We have tried to compute global mean and variance for the whole training data set:
– MFCC ONLN1: For TI-digits (noisy training) and Italian DB-global mean and variance were trained
on the both training data set together. For SP and FIN DB-global mean and variance were trained
on each training part of MFCC features separately. MFCC analysis is to be taken from MFCC
experiment.
– PLP7 ONLN1: Global mean and variance was trained similarly as in MFCC ONLN1 (trained
obviously on PLP7 ). MVN was applied on PLP-cepstrum (similar to experiment PLP7 ).
Baseline MVN experiments shows that even though PLP7 gives better recognition performance (overall),
after application on line normalization the improvement is not such significant as in case of MFCC.
Therefore, we have tried to employ mean normalization and variance normalization separately. In case
of mean normalization, Eq. 3.24 has changed:
39
In case of variance normalization only:
C̄t
Ψ̄t = . (3.26)
σ̄t + ø
– MFCC MEAN: Only mean normalization given by Eq. 3.25. The rest is similar to MFCC ONLN1.
– PLP7 MEAN: Only mean normalization given by Eq. 3.25. The rest is similar to PLP7 ONLN1.
The MVN has been applied on the top of PLP6 nocoder and on the top of subfea 10 frame orig noonline nocoder
(see sect. 2.2.2.11):
– PLP6 onln nocoder new2: PLP analysis similar to PLP6 nocoder. MVN applied (global mean and
variance trained on IT DB and TIdigits (noisy) training data set of PLP6 nocoder (features at the
end of front-end (no voice activity detection))).
– subfea 10 frame orig nocoder new: MFCC analysis similar to subfea 10 frame orig noonline nocoder.
MVN applied (global mean and variance trained IT DB and TIdigits (noisy) training data set of
subfea 10 frame orig noonline nocoder (features at the end of front-end (no voice activity detec-
tion))).
The results of MFCC’s experiments and the experiments of PLP analysis show that in general PLP
analysis gives better recognition performance. However, when MVN is then used, the improvement is not
such significant in PLP’s experiments.
When only mean normalization is used (MFCC MEAN and PLP7 MEAN ), its behavior in MFCC and
PLP analysis is similar. The difference is given by variance normalization, which works better in MFCC’s
experiments. All-pole model, which is employed in PLP analysis, smooths the power spectrum of signal
in general. Therefore, the global variance is typically smaller than in MFCC analysis. In our experiments
we have been dealing with a different values of global variance. The results show that in case of PLP
analysis, noisy data should be taken for training of global variance, because of its smaller value. Then
the recognition improvement with application of MVN is similar to that achieved in MFCC analysis:
– PLP6 onln nocoder new5: PLP processing is similar to PLP6 nocoder. Global mean and variance
are trained on front-end features (so that no VAD), where noise suppression was switched off (noisy
data used). TIdigits (noisy) and IT training DB were used. In feature extraction, noise suppression
was obviously switched on.
– subfea 10 frame orig nocoder new4: MFCC analysis is similar to subfea 10 frame orig noonline nocoder.
Global mean and variance was trained on clean data of front-end features (noise suppression was
switched on). TIdigits (clean+noisy) and IT training DB were used. It seems to be better to use
clean training data for training of global mean and variance in MFCC’s experiments.
40
speech Power Noise Mel
Framing Window spectrum DCoffset Suppress. warping
Noise
Estimation
MFCC analysis
DCT LDA Log
PLP analysis
Spectrum Power Equal exp
process. Law Loudness
VAD
Coder
Figure 3.2: Terminal feature extraction block diagram with either MFCC analysis or PLP analysis
employed (feature extraction only for 0 − 4kHz).
– Italian database:
∗ LSF1: LSF coefficients computed from all-pole model (all-pole model is obtained totally same
way as in PLP7 experiment).
∗ LSF1 PCA: On the top of LSF1, PCA transformation was applied (covariance matrix is com-
puted on TIdigits noisy training DB).
∗ LSF1 ONLN: On the top of LSF1 PCA, MVN was applied (global mean and variance is trained
on IT train DB).
∗ REFL1: Reflection coefficients computed from all-pole model (all-pole model is obtained totally
same way as in PLP7 experiment).
∗ REFL1 PCA: Application of PCA on top of REFL1 (covar. matrix computed from IT-training
DB).
∗ REFL1 ONLN: Application of MVN on top of REFL1 PCA (global mean and variance is
trained on IT train DB).
∗ LAR1: Log-area-ratio coefficients computed from all-pole model (all-pole model is obtained
totally same way as in PLP7 experiment).
41
Decoder
Bitstream Feature Up Lowpass
CHANNEL decoding decomp. sampling filter
IDCT DCT
Frame
RECOGNIZER dropping
Delta MVN
Up Median
sampling Treshold filter
silence prob.
∗ LAR1 PCA: Application of PCA on top of LAR1 (covar. matrix computed from IT-training
DB).
∗ LAR1 ONLN: Application of MVN on top of LAR1 PCA (global mean and variance is trained
on IT train DB).
∗ PLP16: Linear predictive coefficients (all-pole model) obtained totally same way as in PLP7
experiment.
∗ PLP16 PCA:
∗ PLP16 ONLN:
– Spanish and Finish database:
∗ LSF1: LSF coefficients computed from all-pole model (all-pole model is obtained totally same
way as in PLP7 experiment), Fstart = 64Hz.
∗ LSF1 PCA: On the top of LSF1, PCA transformation was applied (covariance matrix is com-
puted on SP training DB, and FIN training DB, separately).
∗ LSF1 ONLN: On the top of LSF1 PCA, MVN was applied (global mean and variance is trained
on SP training DB, and FIN training DB, separately).
∗ REFL1: Reflection coefficients computed from all-pole model (all-pole model is obtained totally
same way as in PLP7 experiment), Fstart = 0Hz.
∗ REFL1 PCA: Application of PCA on top of REFL1 (covariance matrix is computed on SP
training DB, and FIN training DB, separately).
∗ REFL1 ONLN: Application of MVN on top of REFL1 PCA (global mean and variance is
trained on SP training DB, and FIN training DB, separately).
∗ LAR1: Log-area-ratio coefficients computed from all-pole model (all-pole model is obtained
totally same way as in PLP7 experiment), Fstart = 64Hz.
∗ LAR1 PCA: Application of PCA on top of LAR1 (covariance matrix is computed on SP train-
ing DB, and FIN training DB, separately).
∗ LAR1 ONLN: Application of MVN on top of LAR1 PCA (global mean and variance is trained
on SP training DB, and FIN training DB, separately).
∗ PLP16:
42
SDC-Accuracy [%] Italian Finish Spanish
test hm mm wm hm mm wm hm mm wm
LSF1 ONLN 64.72 86.9 94.09 61.31 86.39 94.8 70.68 82.21 90.83
REFL1 32.81 82.62 92.48 37.56 58.48 90.15 29.32 58.33 85.38
REFL1 PCA 39.95 82.18 93.95 32.9 51.37 85.09 29.05 68.15 82.1
REFL1 ONLN 56.88 69.12 89.16 56.25 84.13 93.32 62.83 77.06 88.52
LAR1 32.73 83.10 92.85 40.85 55.2 91.41 36.12 69.87 85.08
LAR1 PCA 43.75 83.62 93.41 48.8 61.76 89.88 32.09 66.67 83.24
LAR1 ONLN 61.18 86.78 93.64 61.52 85.84 94.48 65.95 80.45 88.88
PLP16 29.92 76.47 91.15 34.95 55.4 88.25 30.26 66.1 84.86
PLP16 PCA 35.20 74.83 90.50 40.60 57.25 90.18 31.07 64.49 82.10
PLP16 ONLN 60.29 82.50 93.26 59.82 83.65 93.78 65.35 78.98 89.75
where Σ is the feature covariance matrix. The norm of C (called correlation index) is then a measure
for the correlation between the features. If ||C|| = 1, the features are perfectly uncorrelated. As
||C|| is increasing, the features are more correlated. Low correlation between features results in a
better (more unique) description of the speech characteristics. Moreover, the covariance matrix of
the features can be approximated by a diagonal matrix, which means a substantial reduction in the
number of model parameters.
Fig. 3.4 and 3.5 shows ||C|| for the whole training data set of Italian database (2951 files) and for
several different type of features. In order to evaluate the correlation index of the whole data set, we
have chosen three potentialities (see Tab. 3.4):
• C1 - mean value of correlation index over all sentences (mean value in histogram).
• C2 - mean value of correlation index over sentences with correlation index in the range of 1 − 50.
600
400
a)
200
0
0 10 20 30 40 50 60
600
400
b)
200
0
0 10 20 30 40 50 60
600
400
c)
200
0
0 10 20 30 40 50 60
600
400
d)
200
0
0 10 20 30 40 50 60
Figure 3.4: Histogram of correlation index of sentences of IT training DB for experiments: a) MFCC,
b) MFCC ONLN1, c) PLP7, d) PLP7 ONLN1.
43
30
20
a)
10
0
0 10 20 30 40 50 60
600
400
b)
200
0
0 10 20 30 40 50 60
400
c)
200
0
0 10 20 30 40 50 60
600
400
d)
200
0
0 10 20 30 40 50 60
Figure 3.5: Histogram of correlation index of sentences of IT training DB for experiments: a) LSF1,
b) LSF1 PCA, c) REFL1, d) REFL1 PCA.
44
Corr. index [-] C1 C2 C3
MFCC 25.75 17.55 9.97
MFCC ONLN1 17.16 14.05 7.47
PLP7 30.25 20.96 9.98
PLP7 ONLN1 23.72 18.82 7.49
LSF1 262.7 31.44 29.85
LSF1 PCA 17.06 16.19 9.97
LSF1 ONLN 12.39 12.2 7.43
REFL1 40.16 20.77 12.49
REFL1 PCA 17.97 17.14 9.98
REFL1 ONLN 13.71 13.45 7.48
LAR1 40.24 20.77 9.99
LAR1 PCA 18.11 17.25 9.97
LAR1 ONLN 13.30 13.06 7.40
Table 3.4: Evaluation of ||C|| over all Italian training data set.
to the normalization schemes derived from physical model, we can distinguish between model based
and data distribution based normalization.
In the first case the normalization is based on some model for speech production, transmission,
or perception. A small number of model parameters are estimated on the test data, and used, with
respect to the given model, to normalize the acoustic vectors. Channel and environment normalization
techniques (e.g. cepstral mean normalization) as well as noise suppression techniques that rely on an
accurate SNR estimation belong to this category.
In the distribution based normalization the acoustic vectors are transformed to the domain more
familiar to speech recognition. The transformation parameters are obtained from the distribution of
the training and testing data. The goal of such approach is to transform test vectors such that their
distribution matches the distribution of the training data.
The assumption behind application of histogram normalization, that simplify the problem, is that
each feature space dimension can be normalized independently of the others. Due to this assumption
histogram equalization can account for any type non-linear distortion of each feature space dimension
(scaling, shifting). But it cannot rotate the feature space.
Experiments:
– MFCC HIST1: HEQ based normalization, where each speech sentence is equalized independently.
The normalized data were MFCCs. As the target histogram: For IT-SDC: italian training DB.
Number of bins for histogram computation: Nb = 300. For SP-SDC: spanish training DB used as
the target histogram, similarly for FIN-SDC.
45
2000
1000 c0
0
40 50 60 70 80 90 100 110
4000
2000 c1
0
−15 −10 −5 0 5 10 15 20
4000
2000 c2
0
−15 −10 −5 0 5 10 15
4000
2000 c3
0
−12 −10 −8 −6 −4 −2 0 2 4 6 8
4000
2000 c4
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
−−−> cepstral coefficients
Figure 3.6: Mismatch between training and testing data set for Italian DB (100 sentences). Histograms
are plotted for MFCCs (dot-dashed line: test data, solid line: train data) c 0 − c4 .
Conclusion:
In our experiments we took into account many parameters appearing in HEQ algorithm:
– Number of bins Nb should not be less than 300, however its increasing is not related to better per-
formance (it is caused by insufficiency of amount of data (only one sentence is used for computation
of a source distribution)).
– HEQ behaved the best when applied on untransformed data (MFCCs, PLP). The performance
rapidly decreased in experiments where MFCCs were first processed by cepstral mean subtraction.
– On contrary to some previous experiments, the performance was the highest in case of application
on uncorrelated data (e.g. cepstral coefficients). Transformation of log-energies did not bring good
results.
46
1000
500 c0
0
20 30 40 50 60 70 80 90 100
1000
500 c1
0
−20 −15 −10 −5 0 5 10 15 20 25
1000
500 c2
0
−15 −10 −5 0 5 10 15
1000
500 c3
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
1000
500 c4
0
−10 −5 0 5
−−−> cepstral coefficients
Figure 3.7: Mismatch between clean and noisy (SN R = 10 dB) data set for TI-digits DB (100 files).
Histograms are plotted for MFCCs (dot-dashed line: clean data, solid line: noisy data) c 0 − c4 .
training as well as testing data. The great advantage of such approach is that we compute the transferring
lookup table only once at the beginning of the normalization algorithm. The transferring function is
computed between training data distribution and Gaussian distribution with zero mean and variance
(obviously independently for each feature stream):
– MFCC HIST5: HEQ with Gaussian distribution (reference distribution). The data processed by
on line mean normalization were used for HEQ (MFCC MEAN ). The both training and testing
data were transformed the same way (one lookup table). The whole training data set is used for
computation of a source distribution. The range of the Gaussian distribution is −3, +3 (values out
of the range are considered to be 0). Number of bins Nb = 1000;
– MFCC HIST6: Same as MFCC HIST5. HEQ normalization applied on (MFCC ONLN1.
– MFCC HIST7: Same as MFCC HIST5. HEQ normalization applied on log energies (23 bands).
Then mean normalization and DCT transformation (23 × 15) were used.
– MFCC HIST9: Same as MFCC HIST5. For computation of a source distribution only training data
recorded with hands-free microphone (*.it1 files for IT-SDC) were used.
– MFCC HIST10: Same as MFCC HIST5. For computation of a source distribution only training
data recorded with close-talk microphone (*.it0 files for IT-SDC) were used.
– MFCC HIST12: Same as MFCC HIST5. HEQ normalization applied on (MFCC ).
– MFCC HIST16: Same as MFCC HIST9. MVN applied on top of HEQ normalized data.
– MFCC HIST18: Same as MFCC HIST5. Range of Gaussian distribution is −4, +4.
– MFCC HIST20: Same as MFCC HIST18. For IT-SDC two lookup tables were computed separately:
for close-talk microphone recorded files (*.it0) and for hands-free microphone recorded files (*.it1).
– MFCC HIST22: Same as MFCC HIST18. HEQ applied on cepstral coefficients (MFCC MEAN )
except c0 (energy). The reason is that c0 stream is more or less bi-modal, and the application of
HEQ with unity Gaussian distribution can cause degradation. For c0 stream MVN normalization
has been used.
– PLP HIST1: Same as MFCC HIST5. Instead of MFCCs, PLP-cepstrum processed by mean nor-
malization (PLP7 MEAN ) has been used.
– PLP HIST3: Same as PLP HIST1. PLP7 used as a source distribution.
– PLP HIST4: Same as PLP HIST1. PLP7 ONLN1 used as a source distribution.
47
1.band 2. band 3.band
−−−> log ene (noisy data) 18 20 22
14 16 18
12 14 16
10 12 14
8 10 12
6 8 10
0 10 20 0 10 20 0 10 20 30
−−−> log ene (clean data) −−−> log ene (clean data) −−−> log ene (clean data)
18 20 20
16 18
14 16 15
12 14
10 12 10
0 10 20 30 0 10 20 30 0 10 20 30
−−−> log ene (clean data) −−−> log ene (clean data) −−−> log ene (clean data)
Figure 3.8: Distribution of clean data (log energies after Mel filter bank application) versus the same
data corrupted by noise (TI-digits) with SN R = 10 dB.
– PLP HIST5: Same as PLP HIST1. Hands-free microphone recorded training data used as a source
distribution.
– PLP HIST8: Same as PLP HIST1. HEQ applied on cepstral coefficients (PLP7 MEAN ) except c 0
(energy). The reason is that c0 stream is more or less bi-modal, and by application of HEQ with
unit Gaussian distribution can cause degradation. For c0 stream MVN normalization has been used.
Range of Gaussian distribution is −4, +4.
48
22
20
18
−−−> log ene (noisy signal)
16
14
12
10
6
0 5 10 15 20 25
−−−> log ene (clean signal)
Figure 3.9: Distribution of clean data (first log energy band) versus the same data corrupted by noise
(TI-digits) with SN R = 10 dB. Solid curve shows transformation obtained by histogram equalization,
solid line represents mean normalization transformation, dashed line represents MVN transformation,
dotted line-no compensation.
49
2000 4000 4000
1 2 3
1000 2000 2000
0 0 0
0 50 100 150 −20 0 20 −20 0 20
5000 5000 4000
4 6
5
2000
0 0 0
−20 −10 0 10 −10 0 10 −10 0 10
5000 5000 5000
7 8 9
0 0 0
−10 0 10 −10 0 10 −10 0 10
5000 5000 5000
10 11 12
0 0 0
−10 0 10 −10 −5 0 5 −10 −5 0 5
5000 5000 10000
15
13 14 5000
0 0 0
−10 0 10 −5 0 5 10 −10 0 10
−−−−−−−−> distribution of cepstral coefficients
Figure 3.10: Histogram of 15 cepstral coefficients c0 − c14 (MFCCs from MFCC feature extraction).
1 1 1
1 2 3
0.5 0.5 0.5
0 0 0
0 10 20 30 −1 0 1 −1 0 1
1 1 1
4 5 6
0.5 0.5 0.5
0 0 0
−1 −0.5 0 0.5 −1 −0.5 0 0.5 −0.5 0 0.5
1 1 1
7 8 9
0.5 0.5 0.5
0 0 0
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
1 1 1
10 11 12
0.5 0.5 0.5
0 0 0
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
1 1 1
13 14 15
0.5 0.5 0.5
0 0 0
−0.5 0 0.5 −0.5 0 0.5 −0.5 0 0.5
−−−−−−−−> distribution of cepstral coefficients
Figure 3.11: Cumulative histograms of 15 cepstral coefficients c0 − c14 (PLP-cepstrum from PLP7 fea-
ture extraction). Solid lines represent cumulative histograms over IT-SDC training data set (reference
histogram). Dashed lines represent cumulative histograms over one of IT-SDC sentence.
50