You are on page 1of 4

COMPUTING MEL-FREQUENCY CEPSTRAL COEFFICIENTS

ON THE POWER SPECTRUM

Sirko Molau, Michael Pitz, Ralf Schliitel; and Hermann Ney

Lehrstuhl fur Informatik VI, Computer Science Department,


RWTH Aachen - University of Technology, 52056 Aachen, Germany
{molau, pitz, schlueter, ney} @informatik.rwth-aachen.de

Speech Waveform
ABSTRACT
In this paper we present a method to derive Mel-frequency cep-
stral coef cients directly from the power spectrum of a speech
signal. We show that omitting the ‘lterbank in signal analy-
sis does not affect the word error rate. The presented approach
simpli‘es the speech recognizer s front end by merging subse-
& Windowiny.

t
quent signal analysis steps into a single one. It avoids possible IFFTI*
interpolation and discretization problems and results in a com-
pact implementation. We show that frequency warping schemes r _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
; VTN Warping
like vocal tract normalization (VTN) can be integrated easily in
our concept without additional computational efforts. Recog-
nition test results obtained with the RWTH large vocabulary
speech recognition system are presented for two different cor-
pora: The German VerbMobil I1 dev99 corpus, and the English Filter Bank
North American Business News 94 20k development corpus.

1. INTRODUCTION

Most of today s automatic speech recognition (ASR) systems I DCT I

e,
are based on some type of Mel-frequency cepstral coef cients
(MFCCs), which have proven to be effective and robust under
various conditions. This paper describes an altemative concept
to derive MFCCs directly from the power spectrum of the speech
signal. A number of subsequent steps of the traditional signal Derivatives
analysis are integrated into the cepstrum transformation, whch
avoids possible discretization and interpolation errors. The new
concept yields equally good recognition performance without a
‘lterbank, thus reduces the number of parameters that need to be
optimized. Acoustic Vector
The remainder of this paper is organized as follows: In the
next section we will brie’y recapitulate the typical signal analy- Figure 1: Typical signal analysis front end
sis procedure. Then we discuss in detail implementational issues
of the traditional MFCC computation and present our integrated
approach. In section 4 we will demonstrate that frequency warp-
ing schemes like VTN can be easily integrated as well. Finally, MFCC vector. The highest cepstral coef cients are omitted to
we will present recognition test results for the VerbMobil I1 and smooth the cepstra and minimize the in’uenceof the pitch which
the North American Business News Corpus, and draw the con- is irrelevant for the speech recognition process. The mean of
clusions of our work. each cepstral component is subtracted, and the variance of each
component may also be normalized. Finally, the MFCC vector
is augmented with time derivatives. Additional transformations
2. SIGNAL ANALYSIS like linear discriminant analysis (LDA) may further increase the
temporal context and the discriminance of the acoustic vector.
Figure 1 shows the signal analysis front end of a typical ASR
As a result signal analysis provides every 10 ms an acoustic vec-
system. The speech waveform, sampled at 8 or 16 kHz, is ‘rst
tor, which is typically of dimension 25 to 50.
differentiated (preemphasis) and cut into a number of overlap-
ping segments (windowing), each 25 ms long and shifted by
10 ms. A Hamming window is multiplied and the Fourier trans- 3. COMPUTATION OF MFCCS
form (FFT) is computed for each frame. The power spectrum
is warped according to the Mel-scale in order to adapt the fre- We now want to have a closer look at the computation of cep-
quency resolution to the properties of the human ear. Then the stral coef cients from speech spectra, i.e. the signal analysis
spectrum is segmented into a number of critical bands by means steps between F I T and DCT. We will discuss problems of dif-
of a ‘lterbank. The ‘lterbank typically consists of overlapping ferent implementations, and ’nally present a method to compute
triangular ‘lters. A discrete cosine transformation (DCT) ap- MFCCs directly on the power spectrum. Both the traditional and
plied to the logarithm of the ‘lterbank outputs results in the raw the integrated approach suggested here are depicted in Figure 2.

0-7803-7041-4/01/$10.00 02001 IEEE 73


In all cases the logarithm of the ‘Iterbank output is cosine
transformed to obtain MFCCs.

t._ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ . ,
,..__________________ 3.2. Computing MFCCs Directly On The Power Spectrum
i VTN Warping
.....................
i
...___.__.._____.._.I

We have investigated an altemative method to compute Mel-


frequency warped cepstral coef cients directly on the power
spectrum and thereby avoid possible problems of the standard
approach.
Ignoring any spectral warping for a moment, cepstral coef -
cients C k can be derived by Eq. (1):
I Logarithm I

DCT
1
DCT with integrated VTN-
Ck = 2n
-
!L /
-?r
dw Ig lX(eJw)l . elwk (1)

Depending on whether a ‘Iterbank is used or not, IX(.)l


stands for either the ‘Iterbank outputs or the power spectrum.
Figure 2: Comparison of the traditional MFCC computation The sequential application of a monotone invertible fre-
(left) with the integrated approach (right) investigated here. quency warping function g : [-n, n] + [-r, n] and DCT can
be expressed as follows:

3.1. Traditional Filterbank Approach w -+ LJ =g(w)

Mel-frequency warping and the ‘Iterbank can be implemented Ck = dLjlg IX(eJg-’(’))l . elGk
easily in the frequency domain (see Figure 3). One method is 2a (2)
to transform the power spectrum, i.e. to compute a Mel-warped -77

spectrum by interpolation from the original discrete-frequency


To incorporate warping directly into the cosine transforma-
power spectrum. The advantage is that the following triangular
tion, we change the integration variable and use the derivative of
‘Iters all have the same shape and can be placed uniformly at
the warping function dLJ/dw (Eq. 3). The continuous integral is
the Mel-warped spectrum. On the other hand, the discretization
later approximated in the standard way by a discrete sum (Eq. 4):
may be especially critical due to the large dynamic range of the
power spectrum.
Ck = 1 dw lg IX(eJw)l. eJg(w)k. g’(w) (3)
2n
-7r

-
N

One speci’c type of frequency warping is the Mel-frequency


scaling p ( . ) , which is usually carried out according to formula
( 5 ) with the sampling frequency fs [ 6 ] :

Figure 3: Schematic plot of different triangular ‘Iterbank imple-


mentations. The ‘Iters are cither uniformly distributed at the For integration into the cosine transformation, the Mel-
Mel-warped spectrum, or non uniformly at the original spec- warping function needs to be normalized in order to meet the
trum. In the latter case, they should be asymmetric as well.
criterion b ( a ) = T .

Another way is to place the triangular ‘Iters non uniformly


at the unwarped spectrum and thereby implicitely incorporate
Mel-frequency scaling [ 11. However, discretization errors may
then occur if the spectral resolution is not appropriate. The low-
est ‘Iters could be placed at a very few spectral lines only, and with
the maximum of one of the ‘Iters may fall just inbetween two
spectral lines. In addition, the ‘Iters should not be triangular
and symmetric anymore, but bend according to the shape of the
Mel-function at the position of the ‘Iter.
Last but not least it is not clear how many ‘Iters are required Replacing g ( . ) in Eq. (4) by b ( . )leads to a compact imple-
and which ‘Iter shape is optimal. Triangular ‘Iters are occa- mentation of MFCC computation with only a few lines of code.
sionally replaced by trapezoidal or more complex shaped ones A look-up table for constants like the derivative and the cosine
derived from auditory models, and we sometimes observed bet- term can be precomputed, all that remains is a matrix multipli-
ter word error rates when using ‘Iters with cosine shape. cation on the logarithm of the power spectrum. Figure 4 shows

14
the effect of the modi'ed signal analysis on two cepstrum coef -
cients for a test sentence from the VerbMobil I1 corpus. Whereas
the lower order coef cients are almost identical, the difference
increases with higher coef cient orders due to the discarted '1-
terbank.

4000 I

V I
original frequency
8n

Figure 5: Warping function for piece-wise linear VTN

with parameters Pw and yw. Although these parameters formally


Baseline -
800 - Integrated - depend on w , they can take on only two values:
600 -

400

200
ow = { TT'n=ff2
ff w
w>wo
5 WO

I WO
0

200
Yw = { 0
7r.W
(ff-l).T--o w>wo
w

-400 Mel-warping is applied after the spectra are scaled according


-600 to VTN. Hence, the combination x ( . ) of Mel- and VTN warping
-800
becomes
-1000 L ' J
0 50 100 150 200 250 300 350 400
x(w) = 'CI(va(w))
Time Frame

Figure 4: Comparison of cepstrum coef cients 1 (upper curve)


and 15 (lower curve) for a test sentence from the VerbMobil I1 with the derivative:
corpus (baseline: traditional 'Iterbank approach; integrated:
DCT with integrated Mel-frequency warping).

Cepstrum coefcients with integrated VTN and Mel-


frequency w q i n g are obtained by replacing g(.) in Eq. (4)
4. INTEGRATION OF VTN
by x ( . ) .
Vocal tract length normalization (VTN) is a speaker normaliza-
tion scheme that also relies on warping the power spectrum. The 5. RECOGNITION TESTS
idea is to compensate for the shift of formants in speech spectra
caused by the speaker-speci'c length of the vocal tract. To evaluate the proposed signal analysis approach, we performed
It has been shown before that one possible VTN implemen- recognition tests with the RWTH large vocabulary speech recog-
tation is to modify the location of 'Iters in the 'Iterbank just as nition system (see [3] , [4],and [5] for detailed system descrip-
for Mel-frequency scaling [2]. From what we have presented in tions) on two different corpora. The VerbMobil I1 task (VM 11)
the previous section it is clear, however, that VTN can also be is German spontaneous speech with a I0k-word vocabulary, and
fully integrated into the cepstrum transformation. the North American Business News task (NAB) is clean read
The VTN warping function v, : [0,7r] t [0, 7r] needs to speech of Wall Street Joumal texts with a recognition vocabu-
be monotone and invertible as well. A simple choice is a piece- lary of 20k. Details of the training and test corpora are given in
wise linear warping function as shown in Figure 5 . The in'exion Table 1.
frequency W O at which the slope of the function changes depends
on c y : Table 1: Statistics of the training and test comora
corpus VerbMobil I1 )I Wall Street Joumal
11 CD1-4i I DEV99 1) WSJO+i 1 DEV-94 I
11% 27% 19%
In order to avoid complicated case distinctions for different # Speakers
warping factors and frequencies, we write the warping function # Sent.
I
w -+ va ( w ) in the following convenient form #Words
Trigram PP. 62.0 126.6
va(w) = pww+rw (7)

75
We found that the recognition performance of both meth-
ods is similar. In most cases the integrated approach performed
almost as good or slightly better than the traditional sequential
analysis with a ‘Iterbank. A similar behaviour was found on
smaller German and Italian telephone speech corpora (VerbMo-
bil and EuTrans).

6. CONCLUSIONS
In this paper we have presented an altemative signal analy-
sis approach that merges a number of subsequent analysis step
into one. Omitting the ‘Iterbank and integrating Mel-frequency
alpha warping into the cepstrum transformation simpli’es the signal
250
545 male speakers - analysis (no ‘Iterbank parameters need to be optimized), avoids
581 female speakers ------
possible interpolation and discretization problems, and leads to

2w t a compact implementation of the MFCC front end. We have


shown that concepts like VTN that rely on warping speech spec-
tra can be easily integrated as well. Recognition tests on the
VerbMobil I1 and the North American Business News corpus re-
vealed that the new approach performs as good as the traditional
signal analysis.

7. REFERENCES

088 092 096 1 104 158 112


S. B. Davis and P. Mermelstein: “Comparison of para-
alpha metric representations for monosyllabic word recognition
in continuously spoken sentences”, IEEE Transactions
Figure 6: Warping factor distribution of the VM I1 training on Acoustic, Speech, and Signal Processing, Vol. 28,
speakers. The upper histogram was obtained with Mel- and VTN NO.4, August 1980, pp. 357-366.
warped spectra obtained by linear interpolation, the lower his-
togram with integrated Mel-frequency and VTN warping. L. Lee and R. Rose: “Speaker normalization using
e f cient frequency warping procedures”, Proc. IEEE
Int. Conf on Acoustics, Speech and Signal Processing,
Atlanta, GA, Mai 1996, pp. 353-356.
The ‘rst result of using the integrated approach in VTN
training was a much smoother distribution of warping factors. H. Ney, L. Welling, S. Ortmanns, K. Beulen, and F. Wes-
Figure 6 shows the corresponding histograms for the VM I1 sel: “The RWTH Large Vocabulary Continuous Speech
training corpus. A closer inspection revealed that linear inter- Recognition System”, Proc. IEEE Int. Conf on Acous-
polation of spectral lines when transforming the power spectrum tics, Speech and Signal Processing, Seattle, WA, May
for VTN warping was the main reason for the erratic distribution 1998, pp. 853-856.
observed before. It tumed out, however, that the word error rate A. Sixtus, S. Molau, S. Kanthak, R. Schluter, andH. Ney:
(WER) was only marginally affected by this difference. “Recent Improvements of the RWTH Large Vocabulary
Next, we compared the recognition performance of the tra- Speech Recognition System on Spontaneous Speech”,
ditional signal analysis approach (baseline) with the integrated Proc. IEEE lnt. Conf on Acoustics, Speech and Signal
MFCC computation. Additional tests were carried out with two- Processing, Istanbul, Turkey, June 2000, pp. 167 1-1674.
pass and fast VTN as described in [4]. The best results of each L. Welling, N. Haberland, and H. Ney: “Acoustic Front-
setup are summarized in Table 2. End Optimization for Large Vocabulary Speech Recogni-
tion”, Proc. EUROSPEECH, Rhodes, Greece, September
1997, pp. 2099-2102.
S. J. Young: “HTK: hidden markov model toolkit V1.4”,
User Manual, Cambridge University Engineering De-
partment, Cambridge, England, February 1993.
Corp. VTN Cepstrum #Dns Overall [%I
[k] Del - Ins I WER

76

You might also like