Professional Documents
Culture Documents
, | ,
| , ,
|
,
| ,
, |
, | ,
, |
, ,
,
, ,
, |
, |
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1 1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1 1
e e
A A
=
A A
=
A A
=
A A
= A
MAP Adaptation
HMMs have no sufficient statistics =>
cant use conjugate prior-posterior pairs.
Find posterior via EM.
Find prior empirically (multi-modal, first
model estimated using ML training).
( ) ( ) O A A = A
A
| , | max arg g W O p
MAP
EMAP
All phonemes in every context dont occur in
adaptation data; Need to store correlations
between variables.
EMAP only considers correlation between mean
vectors under jointly Gaussian assumption.
For large model sizes, share means across
models.
( )( ) | |
T
E S
0 0 0
~ ~ ~ ~
=
Transformation Based Model
Adaptation
ML
MAP
( ) ( ) W T O p
SI ML
, | max arg A =
q
q
q
( ) ( ) ( ) | q q
q
q
| , | max arg g W T O p
SI MAP
A =
Estimate a transform T parameterized by q.
Bias, Affine and Nonlinear
Transformations
ML estimation of
bias.
Affine
transformation.
Nonlinear
transformation
(q may be a
neural network).
( ) m r m
b + =
( ) ( )
( ) ( ) m r m m r m
m r m m r m
A A
b A
E = E
+ =
( )
m m
g
q
=
MLLR
( ) ( )
m m
x b Ax x f E + = , ~ ;
| | b A = W
m m
W =
m m
T
m m
B H B
= E
( ) W , , | max arg
A = W O p W
W
Apply separate
transformations to
different parts of the
model (HEAdapt in
HTK).
SMAP
( )
( )
) , ( ~
, 0 ~
2 / 1
q v
N y
I N y
x y
mt
mt
m t mt
E =
t
m m m
m m
) (
2 / 1 2 / 1
2 / 1
E E = E
E + =
q
v
No mismatch
Mismatch
v and q estimated by usual
ML methods on adaptation
data.
Model the mismatch between the SI model (x) and
the test environment.
Adaptive Training
Gender dependent model selection
VTLN (in HTK using WARPFREQ)
Speaker Adaptive Training
Assumption: There exists a compact
model (A
c
), which relates to all speaker-
dependent model via an affine
transformation T (~MLLR). The model
and the transformation are found using
EM.
{ }
( ) ( )
( )
r
C r
r
R
r
T
C
W T O p T , | max arg ,
,
1
,
A [ = A
=
A
Cluster Adaptive Training
Group speakers in training set into
clusters. Now find the cluster closest to
the test speaker.
Use Canonical Models
| |
b M
M
r m m
C
m m
+ =
=
.....
1
m
Eigenvoices
Similar to Cluster Adaptive Training.
Concatenate means from R speaker dependent
model. Perform PCA on the resulting vector.
Store K << R eigenvoice vectors.
Form a vector of means from the SI model too.
Given a new speaker, the mean is a linear
combination of SI vector and eigenvoice vector.
Summary
2 major approaches: MAP (&EMAP) and
MLLR.
MAP needs more data (use of a simple
prior) than MLLR. MAP --> SD model.
Adaptive training is gaining popularity.
For mobile applications, complexity and
memory are major concerns.