• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
Download
 
ANEFFICIENTINTEGRATEDGENDERDETECTIONSCHEMEANDTIMEMEDIATEDAVERAGINGOFGENDERDEPENDENTACOUSTICMODELS
Peder A. Olsen and Satya Dharanipragada
IBM, T. J. Watson Research Center134 and Taconic ParkwayYorktown Heights, NY 10598
  
 pederao,satya
¡
@us.ibm.com
ABSTRACT
This paper discusses building gender dependent gaussian mixturemodels (GMMs) and how to integrate these with an efficient gen-der detection scheme. Gender specific acoustic models of half the size of a corresponding gender independent acoustic modelsubstantially outperform the larger gender independent acousticmodels. With perfect gender detection, gender dependent mod-eling should therefore yield higher recognition accuracy withoutconsuming more memory. Furthermore, as certain phonemes areinherently gender independent (e.g. silence) much of the maleand female specific acoustic models can be shared. This paperproposes how to discover which phonemes are inherently simi-lar for male and female speakers and how to efficiently share thisinformation between gender dependent GMMs. A highly accu-rate gender detection scheme is suggested that takes advantage of computations inherently done in the speech recognizer to detectthe gender at a computational cost that is negligible. By makingthe gender assignment probabilistic an increase in word error rate(WER) seen for erroneously gender labeled speakers is avoided.The method of gender detection and probabilistic use of gender isnovel and should be of interest beyond mere gender detection. Theonly requirement for the method to work is that the training databe appropriately labeled.
1. INTRODUCTION
Gender specificmodels areknowntoyieldimprovedaccuracy overgender independent models and have previously been consideredextensively in the literature. The most typical use is a two-pass ap-proach where in the first pass a gender-detection scheme is used todetect the gender of a speaker and in the second pass the speech isrecognized withthe corresponding gender specific acoustic model.See [1] for an example of sophisticated use of gender information.Other references are [2, 3].The experiments described in this paper was performed onan IBM internal database, citeDeligne:02,olsen-icassp2002. Thebaseline acoustic model consisted of a standard 39 dimensionalFFT-based MFCC frontend (13 dimensional cepstral vectors andcorresponding
¢
and
¢£¢
cepstral vectors spliced together). Dig-its are modeled by defining word specific digit phonemes, yield-ing word models for digits. In total 680 word internal triphonesare used to model acoustic context and the gaussian mixture mod-els used to model the individual allophones consisted of a total of 10253 gaussians. The number of gaussians assigned to each allo-phone was determined using theBayesian Information Criterionasdescribed in [6]. The database used for training was well balancedbetween the genders. It consisted of a total of 462388 utterancesout of which 228693 coresponded to female speakers and 233695corresponded tomale speakers. The test set was similarlywell bal-anced with a total of 73743 words out of which 36241 words wereuttered by female speakers and 37502 by male speakers.
2. COMPARISON OF GENDER DEPENDENT ANDGENDER INDEPENDENT MODELS
By a male, female or gender dependent GMM we mean a GMMbuilt from the portion of the training data uttered by a speaker of that specific gender. Since a gender dependent GMM is built fromroughly half of the training data, it is strictly speaking not obviousthat a gender dependent model will outperform a gender indepen-dent model built from the entire training data. One test of the use-fulness of gender isthat a gender dependent GMM of thesame sizeas the gender independent GMM should outperform the gender in-dependent model on test data for speakers of that same gender.Table 1 shows performance on diagonal covariance GMMs corre-sponding to the gender dependent and gender independent modelseach with a total of 10253 gaussians. Also, listed in Table 1 is theperformance for MLLT (semi-tied covariance) gaussians, [7, 8].Two points are worth noting in the table. Firstly, that the oracleyields a 29.7% and 28.5% relative improvement in the error rateover respecively the baseline diagonal or MLLT model. Secondly,the cross-gender performance, i.e. a female GMM decoding malespeech or a male GMM decoding female speech, is dramaticallyworse than the gender independent performance. The first pointimplies that there is a lot of room from improvement. The secondpoint implies that a gender classification error will be very costly.On the other hand, the high cross gender classification error indi-cates that the models are quite different thus one may suspect thatgender classification will be a simple task.In our target application memory is severely constrained.
 
0 100 200 300 400 500 600 700 800 900 100000.10.20.30.40.50.60.70.80.91T
      p
                f
   a  n   d 
     π
                f
p
f
and
π
f
as a function of T for 1000 vectorsp
f
π
f
Fig.1
. Graph of 
9`YI
and
 
9`YI
forthefirst1000 cepstral vectorsuttered by a female speaker.
9`YI¤v¦©¨¨ 9`YI¦©¨¨
is given in Table 4. This acous-tic model did not capture much of the gain inherently availablein the oracle model. Detailed analysis shows that this is due to
9`YI
and
9`YI
being very close to 0.5. This could mean that
9`YI
is not a good predictor that speech originated from a femalespeaker, but luckily this is not so.
9`YI
tend indeed to be greaterthan
¢¡£
for female speech as can be seen in Fig. 1. The cure thatis needed is a “sharpening” of the aposteriory probabilities
9`YI
and
9`YI
. Introduce the boosted gender detection probabilities
 
9`YI
and
 
9`YI
by
 
9`YI9`YI
¡
9`YI
¡
9`YI
¡
¡
(1)The larger
¢
the sharper the
 
9`YIE
 
9`YI
probabilities become.Table 4 shows results for decoding withthe model
 
9`YI¦©¨¨
 
9`YI¦©¨¨
for
¢
¤£
. As can be seen almost all of the gainin the oracle model, which has an error rate of 2.75%, is capturedby this acoustic model.Test baseline
f¤§¦©¨¨
 
f¤§¦©¨¨
Gender +
¤§¦©¨¨
+
 
¤§¦©¨¨
both 3.34% 3.29% 2.88%female 4.40% 4.26% 3.61%male 2.32% 2.34% 2.18%
Table 4
. Word error rates for time mediated averaging of the gen-der dependent diagonal GMMs.
4. SHARING OF GAUSSIANS BETWEEN GENDERDEPENDENT MODELS
It is clear that silence is inherently gender independent and thusmany of the gaussians modeling silence are bound to be unneces-sary. Possibly even some of the other phonemes are inherently notdifferent under gender variations too. If we share the gaussians forthe sounds that are inherently gender independent we may be ableto squeeze out some of the difference between the 10K oracle and5K oraclemodels. Tomeasure thedifference between twoacousticmodels for a phoneme
¥
we use the Kullback Leibler divergence
¦
9
¨§©©
I
"!$# §
@9
&%
I
9
&%
I
('§
@9
&%
I
0)%
¡
(2)If 
§
¦©¨¨ 9
¥
I
and
¦©¨¨ 9
¥
I
consists of a singlegaussian (2) can be computed exactly. Otherwise, the distancemust be computed numerically. Monte Carlo estimation can beused to compute the integral in the general case. Let
1%
r
3
grb%U
be
d
samples from the distribution
§
@9
&%
I
, then
4§
@9
&%
I
"!
9
&%
I
0)%65
!dgarb%U
"!
9
&%
rI¡
Using the Kullback Leibler distance we can now decide whichphonemes vary little between the genders. To take advantage of this we built gender dependent acoustic models with 6.3K gaus-sians and gender independent models with 7K gaussians. To com-bine these we computed the Kullback Leibler distance between allcontext dependent phonemes and sorted these. We can afford a to-tal of 10K gaussians. Combining the 6.3K male and female acous-tic models gives a total of 12.6K gaussians. To reduce the numberof gaussians we sort the context dependent phonemes accordingto the Kullback Leibler distance and replace with gaussians fromthe gender independent gaussians starting with the smallest dis-tance first. When the number comes below 10K we stop. Table 5shows the decoding results. Table 6 shows the list of phonemeswith smallest and largest Kullback Leibler distance.Test baseline
 
f¤¦©¨¨
Gender +
 
¤¦©¨¨
both 3.34% 2.80%female 4.40% 3.55%male 2.32% 2.07%
Table 5
. Word error rates for time mediated averaging of the gen-der dependent diagonal GMMs with shared gaussians.
¦
9
¨§©©7
I
phoneme
¦
9
¨§©©7
I
phoneme0.5059
849@
18.3031
ACBD
0.5322
ED
16.8553
F4GD
0.5626
EH
16.6865
F4I@
0.6652
E@
16.3531
F4P@
0.7608
GD
16.3488
F4GD
0.7662
QRTSD
16.3469
F4GH
Table 6
. Top few context dependent phonemes (allophones) withlargest and smallest Kullback Leibler distance.
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...