You are on page 1of 4

An improved version of the SPACE algorithm for noise robust speech recognition

Khalid Daoudi and Christophe Cerisara

Abstract Recently we introduced a simple denoising algorithm, called SPACE, that yielded promising preliminary results in noise robust speech recognition. SPACE is essentially based on GMM modeling of clean an noisy speech and a Gaussian correspondence assumption. In this paper, we rst evaluate the performance of SPACE on the test set A of Aurora2 and show that they are globally not satisfactory. We argue that this is due to the non-fulllment of the Gaussian correspondence assumption. We then propose a new training procedure for the GMMs that achieves a better Gaussian correspondence. We evaluate the new denoising algorithm on the test set A of Aurora2. The results show that it not only outperforms SPACE but also the multistyle training system. They also show that it is robust to slow SNR change.

clean speech signal (additive or/and convolutional...). This assumption may be unrealistic which consequently may lead to erroneous estimations of corrupted speech distributions. Another weakness is their strong dependence on the frontend (MFCCs) and the probabilistic models (HMMs) which are used. Recently, we introduced in [7] a new and simple algorithm, named MAP-SPACE, which can be seen as an hybrid approach between a denoising and an adaptation technique (the paper can be down-loaded from SPACE stands for Stereobased Piecewise Afne Compensation for Environments, in reference to the SPLICE algorithm [8], [9], [10]. MAP stands for Maximum A Posteriori adaptation used to handle unknown environments. The rst step in SPACE is to model noisy speech by a GMM. Then, a clean speech GMM is learned using a minimum mean square criterion that attempts to make correspondence between pairs of clean and noisy Gaussians. That is, we try to make (a clean speech) GMMclustering such that the acoustic region modeled by a clean Gaussian is the same as the one modeled by the corresponding noisy Gaussian. The afne mapping that permits to transform a noisy Gaussian to its corresponding clean one is then used to build the SPACE denoiser. The SPLICE (Stereobased Piecewise Linear Compensation for Environments) can be actually seen as a special case of SPACE. The principle of MAP-SPACE is to rst use test observations in a MAP criterion to adapt the noisy GMM to the new environment. The initial noisy GMM parameters are then replaced by the new ones in the denoiser. MAP-SPACE has many advantages as compared to SPLICE and traditional adaptation techniques [7], and the reliminary experiments (reported in [7]) showed promising results. It was then natural to think about a more serious evaluation. In this paper, we rst evaluate the performance of SPACE and MAP-SPACE on the test set A of Aurora2. The results show that this performance is not satisfactory. We then argue that this is due to the non fulllment of the Gaussian correspondence assumption. We then propose a new training procedure based on the joint probability modeling of clean and noisy speech to achieve such correspondence. We evaluate the new denoising algorithm on the test set A of Aurora2. The results show that it not only signicantly outperforms SPACE and MAP-SPACE, but also outperforms the multistyle training (which is one of the most popular techniques in robust speech recognition).

I. I NTRODUCTION In real world applications of automatic speech recognition (ASR) systems, noise robustness is one of the most difcult and challenging issues. Many techniques [1] have been proposed to handle the difcult problem of mismatch between training and application conditions. These can be classied into two categories: signal processing and adaptation techniques. In the former, the input speech signal is processed in such a way to eventually achieve noise robustness (RASTA [2] or CMN for instance) or to denoise the signal (speech enhancement [3] for instance). In the latter, initial canonical models (generally clean speech models) are transformed to represent the new environment. In adaptation schemes such as MAP [4] and MLLR [5], the observed corrupted speech data is used to transform the initial models. One weakness of these schemes is that they require relatively a large amount of adaptation data to yield good estimates of the new models. A more important weakness is the need of speech data transcription in an unsupervised mode, this is usually obtained from the initial models which limits the performances if the transcription accuracy is poor. Adaptation schemes such as PMC [6] do not rely on observed speech data, they rather use noise observations and models to estimate the speech models in the new environment. The adapted model is a combination of the initial one and a (parametric) noise model for which the parameters are estimated from noise observations. The combination is obtained using a function (mismatch function) that hypothesizes the corruption of speech by noise sources [6]. A weakness of these schemes is the assumption made about the effects of the new acoustic environment on the clean speech, i.e., how the noise sources alter the
K. Daoudi is with the SAMOVA group of IRIT-CNRS, 31062 Toulouse, France C. Cerisara is with the SPEECH group of INRIA-LORIA, 54602 Villersles-Nancy, France





The rst step of SPACE is to use the ExpectationMaximization (EM) algorithm in order to model the noisy speech y by a mixture of I Gaussians:


A. Experimental setup Aurora2 is a standard corpus to compare noise robust speech recognition algorithms [11]. The training corpus is divided into two parts: clean and noisy corpus. The noisy training corpus is a copy of the clean one that has been corrupted by four noise types (subway, babble, car and exhibition hall) at ve SNRs (5, 10, 15, 20 dB plus the clean condition). The noisy training corpus is thus composed of 20 environmental conditions, each of which corresponds to a subset of the clean corpus. Such a corpus is called a stereo corpus, because every speech le occurs twice, in the clean and in one of the 20 noisy conditions. The test corpus is divided into three sets: A, B and C. The test set A is composed of clean speech corrupted with the four same noises than in the training corpus (subway, babble, car and exhibition hall), but at seven SNRs (-5, 0, 5, 10, 15, 20 dB plus clean speech). Thus, there is an SNR mismatch when testing on -5 dB and 0 dB because these SNRs are not seen in training. Test B is composed of clean speech corrupted with four different (unknown) noises, and test C is composed of clean speech corrupted with two noises of test B and further ltered to create convolutional mismatch. In our experiments, the whole database is parameterized with the standard HTK scripts, provided by the Aurora2 database. Each acoustic vector is thus composed of 13 + + = 39 MFCC coefcients, with the energy instead of c0 . Note that cepstral mean normalization is not used in this standard front-end. There are 11 digit models (including oh and zero), plus two models for the silence and the optional short pause. Every digit is modeled by a 18-states HMM, with 16 emitting states. The observation density in each state is modeled by a 3 mixtures GMM. The short pause is a 3-states model with skip transitions. It has one emitting state that is shared with the silence model. The silence model has 5 states total, and 3 emitting states with 6-mixtures GMM each. All Gaussians have diagonal covariances. The standard baseline and multistyle HMMs are respectively trained on the clean and the whole noisy training corpus. The recognition accuracy scores of the baseline and multistyle systems are shown in Figure 1 and 2, respectively. As for SPACE (and MAP-SPACE), for each noisy condition of the training corpus, a GMM with 128 Gaussian components (I = 128) is trained using the EM algorithm. The corresponding clean GMM is trained on the corresponding clean sentences using the MMSE criterion. This couple of GMMs is used to denoise this noisy training corpus (using (1)). Finally, the pseudo-clean acoustic HMMs are trained with the HTK scripts on all the denoised sentences of the training corpus. For each testing condition, we rst detect the closest training environment using a two-steps process. First, the SNR is estimated with the following algorithm: for each test sentence, the energy is computed on a sliding window of 64ms length. The window with the highest energy is

P (y) =

i Gi (y)

where Gi is a Gaussian of mean i and diagonal covariance Qi , and i is the prior of Gi . Then, using a Minimum Mean Square Error (MMSE) criterion, it models clean speech x by a mixture of I Gaussians:

P (x) =

i Gi (x)

(where Gi is a Gaussian of mean i and diagonal covariance Qi ) in a way that attempts to make each Gaussian Gi corresponds to the noisy Gaussian Gi . Such correspondences would yield an indication on how the clean features distribution is related to the noisy one in the acoustic region i. We assume that this relationship is afne in each acoustic region i. One thus obtains the mapping that transforms y N (i , Qi ) into x N (i , Qi ) as x = (Qi Q1 ) 2 (y i ) + i . i This mapping is the basis of SPACE, that is, we assume that: E[x|y, i] = (Qi Q1 ) 2 (y i ) + i . i Then clean feature estimates are given by: x(y) =
i G (y) where P (i|y) = Pi i (y) . In order to estimate the parami i Gi eters (i , Qi ) of the clean speech GMM from the training stereo data (xt , yt )1tT , we minimize the MMSE objective function:
1 1

P (i|y)E[x|y, i]


F =

E[(xt x(yt ))2 |yt ].

The parameters estimation formulas can be found in [7]. When test observations are given in a new environment, the basic idea of the MAP-SPACE algorithm is to use these observations in a MAP criterion to adapt the initial noisy speech GMM to the new environment. The expectation is that such adaptation would eventually keep correspondence between the initial and the new model parameters. That I is, if P (y) = i=1 i Gi (y) is the adapted noisy speech i (y) = N (y; i , Qi ), then each Gaussian Gi GMM, where G i (and thus to Gi ). Then, would correspond to the Gaussian G MAP-SPACE consists in replacing the original noisy GMM parameters (i , Gi ) by the new ones (i , Gi ) in (1).

assumed to represent speech, while the window with the lowest energy is assumed to represent noise. The SNR of each sentence is estimated from the ratio of the energies in both of these windows. The average SNR over all sentences is then computed. The closest corresponding SNR of the training corpus is found: four training conditions match this SNR. In the second step, the four noisy GMMs for these four training conditions are compared, and the one that maximizes the likelihood is chosen. The test corpus is then denoised using the parameters of this GMM and its corresponding clean GMM. After denoising, the test corpus is recognized by the pseudo-clean acoustic models. B. Results In this section, we evaluate the SPACE and MAP-SPACE algorithms on the test set A of Aurora2. The recognition accuracy scores of SPACE are shown in Figure 3. Even though SPACE outperforms the baseline system, its performance is globally not satisfactory when compared to the multistyle system (which is the reference system). In particular, the accuracy score on the 5 dB test is signicantly lower than the multistyle one even if this condition has been seen in training. The very poor performance on 0 dB and -5 dB could be explained (at a rst sight) by the fact that SPACE does not handle SNR mismatch, and that we should use MAP-SPACE instead. The recognition accuracy scores of MAP-SPACE are shown in Figure 4. One can see the performance of MAPSPACE is even worse that the one of SPACE. These results suggest that the assumptions made to conceive SPACE and MAP-SPACE, particularly the Gaussian correspondence, are not fullled. This means, that the MMSE criterion is not the best way to build such correspondences. We thus have to nd another way to train clean and noisy GMMs in order to achieve such correspondence. This is the purpose of the next section. IV. M ODELING OF THE JOINT PROBABILITY DISTRIBUTION OF CLEAN AND NOISY SPEECH A. The SPACE-JM algorithm Recall that stereo data (xt , yt )1tT are available. Thus, a simple idea to (hopefully) achieve a better correspondence between clean and noisy Gaussians is to perform a modeling of the joint distribution P (x, y) of the clean and noisy speech. Formally, using the EM algorithm, we model P (x, y) by a diagonal-covariance GMM:

Then, using the same principle as in SPACE, the new denoiser is given by: x (y) =

1 P (i|y) (Ri Ri ) 2 (y i ) + i


where P (i|y) = i N (y; i , Ri ) . j N (y; j , Rj )


We call this denoiser SPACE-JM, where JM stands for Joint Modeling. Of course the choice of a diagonal form for i is made for implementation simplicity. A zero cross-variance between x and y is denitely not a valid assumption. We use this assumption however only for the clustering purpose to achieve Gaussian correspondence between the clean and noisy GMMs. As we will see in the next section, SPACE-JM leads indeed to much better performance than SPACE. B. Evaluation of SPACE-JM In this section, we evaluate the performance of SPACEJM. The experimental setup of SPACE-JM is the same as SPACE, the only change is in the way the GMMs parameters are estimated. In SPACE-JM, (2) is used to estimate these parameters (instead of MMSE for SPACE) and the denoiser is given by (3). Figure 5 shows the recognition accuracy scores of SPACEJM on the test set A of Aurora2. One can see that SPACEJM signicantly outperforms SPACE and MAP-SPACE. This suggests that we are indeed achieving a better Gaussian correspondence (that in SPACE) by using the joint probability modeling. More importantly, SPACE-JM outperforms the multistyle system. In particular the improvement is signicant on the 0 dB test where there is an SNR mismatch (on the -5 dB test, the performance of all systems is poor, comparison does not make sense in this case). This shows that SPACE-JM is more robust to (relatively small) SNR change than multistyle. In addition, such performance should be improved if we use a good adaptation technique. Indeed, when there is a mismatch between training and test conditions, our strategy is to adapt the noisy GMM to the new environment and use the new GMM parameters in the denoiser. An idea would be to use MAP adaptation as in MAP-SPACE. Figure 6 shows the recognition scores of MAP-SPACE-JM, i.e., when MAP-adapting the noisy GMM given by SPACEJM using test observations. It is clear that it is a bad idea because MAP-SPACE-JM degrades the performances. On the other hand, this conrms that MAP adaptation destroys the Gaussian correspondence provided by SPACE-JM and shows the importance of such correspondence. We have thus to nd an alternative adaptation technique that would keep Gaussian correspondence. In principle, This would allow improvement of SPACE-JM performance in the presence of SNR and noise type mismatch (test set B and C).

P (x, y) =

i Ji (x, y)


where Ji is a Gaussian of mean mi and diagonal covariance i . Obviously we can always write mi = (i , i ) and i = Ri 0 in such way that 0 Ri

P (x) =

i N (x; i , Ri ) ; P (y) =

i N (y; i , Ri ).

V. C ONCLUSION We proposed a simple but yet effective denoising algorithm, SPACE-JM, for robust speech recognition. The evaluation on the test set A of Aurora2 shows that this algorithm outperforms the multistyle training system and is robust to small SNR mismatch. Moreover, many perspectives for further improvements are possible. For instance, a possibility is to improve the quality of denoising by using sparse covariance matrices for the joint probability distribution. This would allow rotations in the Gaussian mapping in addition to dilatations (only the latter is possible when using diagonal covariances). But denitely the most important perspective is to develop a GMM adaptation technique that would keep Gaussian correspondence and allow to handle unknown test environments (such as test set B and C). This will be the purpose of future work. R EFERENCES
Gong, Y., Speech recognition in noisy environments : A survey, Speech Communication, 16:261291, 1995. [2] Hermansky, H. and Morgan, N., RASTA processing of speech, IEEE Trans. Speech and Audio Processing, 2:578589, 1994. [3] Ephraim, Y., Gain-adapted hidden Markov models for recognition of clean and noisy speech, IEEE Trans. Signal Processing, 40:1303 1316, 1992. [4] Gauvain, J.L. and Lee, C.H., Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech and Audio Processing, 2:291298, 1996. [5] Leggetter, C.J. and Woodland, P.C., Maximum likelihood linear regression for speaker adaptation of continuous dens ity HMMs, Computer Speech and Language, 9:171186, 1995. [6] Gales M.J.F, Predictive model-based compensation schemes for robust speech recognition, Speech Communication, 25(1-3):4974, 1998. [7] K. Daoudi and C. Cerisara, The MAP-SPACE denoising algorithm for noise robust speech recognition, Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU2005), Nov 28-Dec 1, 2005. San Juan, Puerto Rico. [8] Deng, L. et al Large-Vocabulary Speech Recognition under Adverse Acoustic Environments, ICSLP, 2000. [9] Droppo, J.and Acero A. and Deng, L. Evaluation of the SPLICE Algorithm on the Aurora2 Database EUROSPEECH, 2001. [10] Droppo, J. and Deng, L. and Acero A. Evaluation of SPLICE on the Aurora 2 and 3 Tasks ICSLP, 2002. [11] Hirsch H.G. and Pearce D., The AURORA experimental framework for the performance evaluation of speech rec ognition systems under noisy conditions, ISCA ITRW ASR2000, Paris, 2000. [1]
Subway 98,59 97,64 96,75 94,38 88,42 65,67 26,01 81,07 Aurora2 test set A multistyle results Babble Car Exhibition 98,52 98,48 98,55 97,61 97,85 96,98 96,8 97,64 96,58 95,22 95,65 93,12 87,67 87,24 86,92 61,03 50,88 61,83 26,18 19,12 22,49 80,43 78,12 79,50 Average 98,54 97,52 96,94 94,59 87,56 59,85 23,45 79,78

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Average

Fig. 2. Recognition accuracy scores (in %) of the multistyle system on test set A

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Average

Subway 98,40 97,76 96,87 91,25 60,55 29,54 -1,44 67,56

Aurora2 test set A SPACE results Babble Car Exhibition 98,55 98,60 98,70 97,64 97,11 97,53 96,13 96,54 95,53 93,41 92,84 94,29 85,31 34,77 89,32 -3,84 10,23 73,46 -29,66 -11,12 36,44 62,51 59,85 83,61

Average 98,56 97,51 96,27 92,95 67,49 27,35 -1,45 68,38

Fig. 3.

Recognition accuracy scores (in %) of SPACE on test set A

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Average

Subway 93,46 95,64 71,14 64,57 47,62 26,99 11,88 58,76

Aurora2 test set A MAPSPACE results Babble Car Exhibition 96,01 95,82 94,88 97,07 96,03 97,32 94,83 95,32 63,1 93,41 91,17 82,38 83,1 42,26 62,63 13,24 18,25 47,02 7,26 7,04 26,32 69,27 63,70 67,66

Average 95,04 96,52 81,10 82,88 58,90 26,38 13,13 64,85

Fig. 4. Recognition accuracy scores (in %) of MAP-SPACE on test set A

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Average

Subway 98,77 97,91 97,91 95,3 93 80,69 -0,55 80,43

Aurora2 test set A SPACE-JM results Babble Car Exhibition 98,97 98,84 99,2 97,91 97,76 98,06 96,83 96,21 97,28 94,65 93,77 95,28 88,21 84,85 89,82 65,96 57,92 77,11 29,41 20,88 40,94 81,71 78,60 85,38

Average 98,95 97,91 97,06 94,75 88,97 70,42 22,67 81,53

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Average

Subway 98,83 96,93 92,91 78,72 53,42 27,33 12,5 65,81

Aurora2 test set A baseline results Babble Car Exhibition 98,97 98,81 99,14 89,99 96,84 96,2 73,43 89,5 91,85 49,03 66,18 75,1 26,93 33,58 43,44 11,82 13,15 15,83 5,38 8,41 7,56 50,79 58,07 61,30

Average 98,94 94,99 86,92 67,26 39,34 17,03 8,46 58,99

Fig. 5.

Recognition accuracy scores (in %) of SPACE-JM on test set A

Fig. 1. set A

Recognition accuracy scores (in %) of the baseline system on test

Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Average

Subway 93,83 92,11 90,39 92,05 87,69 73,01 7,15 76,60

Aurora2 test set A MAP-SPACE-JM results Babble Car Exhibition 93,89 94,69 96,2 97,1 96,96 92,84 95,62 95,38 84,67 92,44 92,66 87,35 82,65 86,85 81,83 54,87 66,78 68,56 21,67 30,81 35,02 76,89 80,59 78,07

Average 94,65 94,75 91,52 91,13 84,76 65,81 23,66 78,04

Fig. 6. set A

Recognition accuracy scores (in %) of MAP-SPACE-JM on test