You are on page 1of 17

Adaptation Techniques in

Automatic Speech Recognition


Tor Andr Myrvoll
Telektronikk 99(2), Issue on Spoken Language
Technology in Telecommunications, 2003.

Goal and Objective
Make ASR robust to speaker and
environmental variability.
Model adaptation: Automatically adapt a
HMM using limited but representative
new data to improve performance.
Train ASRs for applications w/ insufficient
data.
What Do We Have/Adapt?
A HMM based ASR trained in the usual manner.
The output probability is parameterized by
GMMs.
No improvement when adapting state transition
probabilities and mixture weights.
Difficult to estimate K robustly.
Mixture means can be adapted optimally and
proven useful.
Adaptation Principles
Main Assumption: Original model is
good enough, model adaptation cant
be re-training!
Offline Vs. Online
If possible offline (performance
uncompromised by computational reasons).
Decode the
adaptation speech data
based on current
model.
Use this to estimate
the speaker-
dependent models
statistics.
Online Adaptation Using Prior
Evolution.
Present posterior
is the next prior.
( )
( ) ( )
( )
( ) ( ) ( )
( ) ( )
( ) ( )
( )
( ) ( )
( )
i i
Q Q K K
i i
i i
i i
i i
i i
i i i i
i
i i i i i i
i
i i
i
i i i i
i
i i
W O p
W O p W K Q O p
W O p
O W p W O p
W O p W O O p
W O p W O p W O O p
W O O p
W O p W O O p
W O p

, | ,

| , ,

|
,

| ,

, |

, | ,

, |

, ,
,

, ,

, |

, |
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1 1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1 1

e e




A A
=
A A
=
A A
=
A A
= A
MAP Adaptation
HMMs have no sufficient statistics =>
cant use conjugate prior-posterior pairs.
Find posterior via EM.
Find prior empirically (multi-modal, first
model estimated using ML training).
( ) ( ) O A A = A
A
| , | max arg g W O p
MAP
EMAP
All phonemes in every context dont occur in
adaptation data; Need to store correlations
between variables.
EMAP only considers correlation between mean
vectors under jointly Gaussian assumption.



For large model sizes, share means across
models.
( )( ) | |
T
E S
0 0 0
~ ~ ~ ~
=
Transformation Based Model
Adaptation
ML


MAP
( ) ( ) W T O p
SI ML
, | max arg A =
q
q
q
( ) ( ) ( ) | q q
q
q
| , | max arg g W T O p
SI MAP
A =
Estimate a transform T parameterized by q.
Bias, Affine and Nonlinear
Transformations
ML estimation of
bias.
Affine
transformation.
Nonlinear
transformation
(q may be a
neural network).
( ) m r m
b + =
( ) ( )
( ) ( ) m r m m r m
m r m m r m
A A
b A
E = E
+ =


( )
m m
g
q
=
MLLR
( ) ( )
m m
x b Ax x f E + = , ~ ;
| | b A = W
m m
W =
m m
T
m m
B H B

= E
( ) W , , | max arg

A = W O p W
W
Apply separate
transformations to
different parts of the
model (HEAdapt in
HTK).
SMAP
( )
( )
) , ( ~
, 0 ~
2 / 1
q v

N y
I N y
x y
mt
mt
m t mt
E =

t
m m m
m m
) (

2 / 1 2 / 1
2 / 1
E E = E
E + =
q
v
No mismatch
Mismatch

v and q estimated by usual
ML methods on adaptation
data.
Model the mismatch between the SI model (x) and
the test environment.
Adaptive Training
Gender dependent model selection
VTLN (in HTK using WARPFREQ)
Speaker Adaptive Training
Assumption: There exists a compact
model (A
c
), which relates to all speaker-
dependent model via an affine
transformation T (~MLLR). The model
and the transformation are found using
EM.
{ }
( ) ( )
( )
r
C r
r
R
r
T
C
W T O p T , | max arg ,
,
1
,
A [ = A
=
A
Cluster Adaptive Training
Group speakers in training set into
clusters. Now find the cluster closest to
the test speaker.

Use Canonical Models
| |
b M
M
r m m
C
m m
+ =
=

.....
1
m
Eigenvoices
Similar to Cluster Adaptive Training.
Concatenate means from R speaker dependent
model. Perform PCA on the resulting vector.
Store K << R eigenvoice vectors.
Form a vector of means from the SI model too.
Given a new speaker, the mean is a linear
combination of SI vector and eigenvoice vector.
Summary
2 major approaches: MAP (&EMAP) and
MLLR.
MAP needs more data (use of a simple
prior) than MLLR. MAP --> SD model.
Adaptive training is gaining popularity.
For mobile applications, complexity and
memory are major concerns.

You might also like