0 100 200 300 400 500 600 700 800 900 100000.10.20.30.40.50.60.70.80.91T
p
f
a n d
π
f
p
f
and
π
f
as a function of T for 1000 vectorsp
f
π
f
Fig.1
. Graph of
9`YI
and
9`YI
forthefirst1000 cepstral vectorsuttered by a female speaker.
9`YI¤v¦©¨¨
9`YI¦©¨¨
is given in Table 4. This acous-tic model did not capture much of the gain inherently availablein the oracle model. Detailed analysis shows that this is due to
9`YI
and
9`YI
being very close to 0.5. This could mean that
9`YI
is not a good predictor that speech originated from a femalespeaker, but luckily this is not so.
9`YI
tend indeed to be greaterthan
¢¡£
for female speech as can be seen in Fig. 1. The cure thatis needed is a “sharpening” of the aposteriory probabilities
9`YI
and
9`YI
. Introduce the boosted gender detection probabilities
9`YI
and
9`YI
by
9`YI9`YI
¡
9`YI
¡
9`YI
¡
¡
(1)The larger
¢
the sharper the
9`YIE
9`YI
probabilities become.Table 4 shows results for decoding withthe model
9`YI¦©¨¨
9`YI¦©¨¨
for
¢
¤£
. As can be seen almost all of the gainin the oracle model, which has an error rate of 2.75%, is capturedby this acoustic model.Test baseline
f¤§¦©¨¨
f¤§¦©¨¨
Gender +
¤§¦©¨¨
+
¤§¦©¨¨
both 3.34% 3.29% 2.88%female 4.40% 4.26% 3.61%male 2.32% 2.34% 2.18%
Table 4
. Word error rates for time mediated averaging of the gen-der dependent diagonal GMMs.
4. SHARING OF GAUSSIANS BETWEEN GENDERDEPENDENT MODELS
It is clear that silence is inherently gender independent and thusmany of the gaussians modeling silence are bound to be unneces-sary. Possibly even some of the other phonemes are inherently notdifferent under gender variations too. If we share the gaussians forthe sounds that are inherently gender independent we may be ableto squeeze out some of the difference between the 10K oracle and5K oraclemodels. Tomeasure thedifference between twoacousticmodels for a phoneme
¥
we use the Kullback Leibler divergence
¦
9
¨§©©
I
"!$#
§
@9
&%
I
9
&%
I
('§
@9
&%
I
0)%
¡
(2)If
§
¦©¨¨
9
¥
I
and
¦©¨¨
9
¥
I
consists of a singlegaussian (2) can be computed exactly. Otherwise, the distancemust be computed numerically. Monte Carlo estimation can beused to compute the integral in the general case. Let
1%
r
32
grb%U
be
d
samples from the distribution
§
@9
&%
I
, then
4§
@9
&%
I
"!
9
&%
I
0)%65
!dgarb%U
"!
9
&%
rI¡
Using the Kullback Leibler distance we can now decide whichphonemes vary little between the genders. To take advantage of this we built gender dependent acoustic models with 6.3K gaus-sians and gender independent models with 7K gaussians. To com-bine these we computed the Kullback Leibler distance between allcontext dependent phonemes and sorted these. We can afford a to-tal of 10K gaussians. Combining the 6.3K male and female acous-tic models gives a total of 12.6K gaussians. To reduce the numberof gaussians we sort the context dependent phonemes accordingto the Kullback Leibler distance and replace with gaussians fromthe gender independent gaussians starting with the smallest dis-tance first. When the number comes below 10K we stop. Table 5shows the decoding results. Table 6 shows the list of phonemeswith smallest and largest Kullback Leibler distance.Test baseline
f¤¦©¨¨
Gender +
¤¦©¨¨
both 3.34% 2.80%female 4.40% 3.55%male 2.32% 2.07%
Table 5
. Word error rates for time mediated averaging of the gen-der dependent diagonal GMMs with shared gaussians.
¦
9
¨§©©7
I
phoneme
¦
9
¨§©©7
I
phoneme0.5059
849@
18.3031
ACBD
0.5322
ED
16.8553
F4GD
0.5626
EH
16.6865
F4I@
0.6652
E@
16.3531
F4P@
0.7608
GD
16.3488
F4GD
0.7662
QRTSD
16.3469
F4GH
Table 6
. Top few context dependent phonemes (allophones) withlargest and smallest Kullback Leibler distance.
Leave a Comment