Chinese Acoustic Model

Study on Acoustic Modeling and
Parameter Tying Strategy for

Chinese Speech Recognition
()
()
SCSPM
1(SCSPM) HMM
MGCPM
SCSPM
MGCPM SCSPM
2 HTK HTK
3Context Dependent, CD
4CD-Phone
CD-Phone
10%
5CD Initial/Final,
CD-IF
Extended Initial/Final, XIF
XIF IF CD-XIF
80% 25%
SCSPM
-I-
- II -
Abstract
Acoustic Modeling is one of the key problems in the field of speech recognition.
In this paper, the techniques of acoustic modeling and parameter tying strategy are
deeply studied. Two main aspects are focused on: the proposition and
implementation of the Semi-Continuous Segmental Probability Model (SCSPM);
The study of Context Dependent (CD) acoustic modeling with decision tree based
state tying, and the implementation of CD phone modeling and Initial/Final (IF)
modeling respectively. Including:
1The SCSPM is proposed and implemented. It is based on the traditional
Hidden Markov Model (HMM) and the modified HMM namely Mixed Gaussian
Continuous Probability Model (MGCPM), the Vector Quantizaztion (VQ) technique
and the feature of continuous probability density distribution are integrated, and the
method of Tied Mixture is adopted to describe the probability distribution of each
state. Moreover, the mixture weight reduction method is studied and analyzed, and a
new effective method which prunes the small tying weight through the iterative
training process is proposed. Compared with the MGCPM, SCSPM can reduce the
model scale and computational complexity significantly with little degradation in
recognition accuracy.
2The HMM Tool Kit (HTK) platform is studied and analyzed. Based on HTK,
an effective method is implemented for acoustic model training and performance
evaluation.
3The Decision Tree (DT) based state tying strategy in the Context Dependent
(CD) acoustic modeling is deeply studied. Two different DT design methods are
analyzed, the design of question set and the DT node splitting strategy are discussed.
Furthermore, the method for treating the silence model is studied to enhance the
robustness.
4The CD-Phone model with decision tree based state tying is implemented.
The phone question set and different DT structure are designed. Compared with the
baseline syllable model, CD-Phone model reduces the syllable error rate (SER) by
about 10%.
5The CD Initial/Final (IF) model with decision tree based state tying is
implemented. To maintain the connection between Initial and Final, the Extend IF
(XIF) set is proposed by adding the Zero Initials to the prior standard IF set.
Experiments show that the XIF model outperforms the IF model. The syllable
correct rate of the implemented CD-XIF model can achieve over 80%, and the
CD-XIF model can reduce the SER by over 25% compared with the baseline
syllable model.
Keyword: Semi-Continuous Segmental Probability Model, Parameter Tying
Strategy, Decision Tree based State Tying, Context Dependent Phone
Model, Context Dependent Initial/Final Model
- III -
- IV -
1.1 1
1.1.1 1
1.1.2 2
1.1.3 2
1.2 4
1.2.1 4
1.2.2 5
1.3 6
1.4 7
1.5 8
11
2.1 HMM 11
2.1.1 HMM 11
2.1.2 HMM 12
2.1.3 HMM 15
2.1.4 HMM 17
2.2 MGCPM 18
2.3 20
23
3.1 SCSPM 23
3.1.1 (SPM) 23
3.1.2 SCHMM 23
3.1.3 SCSPM 24
3.2 24
3.2.1 25
3.2.2 27
3.3 27
3.4 SCSPM 28
3.5 29
-V-
3.5.1 29
3.5.2 30
3.5.3 SCSPM MGCPM 30
3.6 31
33
4.1 33
4.1.1 33
4.1.2 34
4.2 35
4.2.1 35
4.2.2 37
4.2.3 37
4.3 40
4.3.1 40
4.3.2 40
4.3.3 CD-Phone 41
4.4 42
43
5.1 43
5.2 44
5.3 45
5.4 46
5.4.1 IF XIF 46
5.4.2 SP 47
5.4.3 47
5.4.4 48
5.5 48
51
6.1
51
6.2 51
HTK 53
HTK 53
54
55
- VI -
57
61
63
63
1-1 ............................................................................4
3-1 SCSPM .........................................................................24
3-1 ......................................................................................30
3-2 ......................................................................................30
3-2 SCSPM MGCPM ............................................................31
4-1 ......................................................................................33
4-2 ..........................................................................34
4-3 ..................................................................................35
4-4 ..................................................................................35
4-1 ...............................................................................38
4-2 I ...................................................................39
4-3 II .................................................................39
4-5 CI-Phone ...............................................................................40
4-4 ..................................................41
4-6 CD-Phone .........................................................42
5-1 (Initial/Final, IF)...........................................................43
5-2 Extended Initial/Final, XIF ..................................44
5-1 sp ........................................................................46
5-3 IF XIF ......................................................................46
5-4 sp ..........................................................................47
5-5 ..................................................................47
5-2 ..................................................................................48
-1 .....................................................................55
-2 .....................................................................56
- VII -
1.1
1.1.1
50
1952 Davis
10 50
1960 Denes
70
LPC
DTW[Vintsjuk 1968]VQ
Hidden Markov Model, HMM

HMM
CMU Sphinx 80 90 90
IBM ViaVoice
-1-
1.1.2
Speaker Dependent, SD
Speaker Independent, SI
Small VocabularyMedium VocabularyLarge
VocabularyIsolated Word
Connected WordContinuous Speech
1.1.3
(LPC) LPCC Mel MFCC

Wilpon[Wilpon 1989]
-
-2-
[Wilpon 1991]
2 HMM
HMM
Viterbi [Viterbi
1967]
Artificial Neural Network, ANN
HMM
HMM
[Franzini 1990][Morgan 1990]
HMM
3
Statistical Language ModelKnowledge-based Language
Model
N-Gram
Jelinek [Jelinek 1983] N-Gram
Nadas
[Nadas 1985]Katz [Katz 1987]
4
-3-
HMM Viterbi
[Lee 1989]
1.2
1.2.1
1.1
A
A
L
1-1
A w
-4-
w*
w* = arg max P( w A) = arg max
w
P( A w) P( w)
P( A)
(1.1)
Bayes P(A|w)
P(w)
w P(A)
w* = arg max P( w A) P( w)
(1.2)
1.2.2
HMM
HMM
ANN
ANN [Verhasselt
1998]
-5-
ANN ANN
ANN
ANN
1.3
[Zheng 1996]
1
(Phoneme) (Triphone)
Triphone
Triphone [Yang 1995]
(Syllable)
-6-
[Zheng 1996] 418
(Phoneme)
Coarticulation
Context Dependent,
CD triphone
triphone
(Initial/Final, IF)
[Li 2000]
Context Dependent
Initial/Final, CD-IF
1.4
HMM
Semi-Continuous HMM, SCHMM HMM Continuous HMM,
CHMM
-7-
SCHMM HMMTied Mixture HMM, TMHMM

Probability Density Function, pdf
SCHMM
EasyTalk [Zheng
1999]
MGCPM[Zheng 1998] CHMM

6 16
40,000 SCHMM 5,000
[Reichl 2000]
[Breiman 1984]
1.5
(1)
Semi-Continuous Segmental Probability ModelSCSPM
(2) HTK
HTK (3)
-8-
-9-
2.1 HMM
HMM
HMM HMM
[Rabiner 1989]
2.1.1 HMM
HMM {S (t ): t T }
t st = S (t ) t
t
{S (t ): t T }
T
I T {S (t ) :t = 0,1,2,L}
P{S (t + 1) = s S (0) = s 0 , S (1) = s1 ,L, S (t ) = st }
= P{S (t + 1) = s S (t ) = s t }
(2-1)
aij (t ) = P S (t + 1) = j S (t ) = i
(2-2)
t i j
aij (t ) 0, i , j I
(2-3)
(2-4)
j I
ij
(t ) = 1, i I
(HMM)
HMM t
- 11 -
st
R Q
{ }
HMM aij
HMM
(1) N = {1,2,3, L , N } st = s(t ) N t
(2) A(t ) = aij (t )
NN
aij (t ) = P s(t + 1) = j s(t ) = i
{ }
(3) B = b j ()
N 1
( )
b j ( x ) = Pd output = x state = j Pd {}
(4) = { i } N 1 i = P{s(1) = i}
HMM
{ }
A = aij
NN
aij = P s(t + 1) = j s(t ) = i
N HMM = {, A, B}
2.1.2 HMM
HMM
HMM O
O
- 12 -
M {O ( m ) , m = 1 ~ M } HMM
HMM
O
O O
(Forward-Backward) Baum-Welch [Baum 1972]
( 1 j N )
N
t ( j ) = P{o1 , o2 ,L, ot , st = j } = t 1 (i )aij b j (ot ),2 t T
(2-5)
1 ( j ) = i bi (o1 )
(2-6)
i =1
( 1 i N )
N
t (i ) = P{ot +1 , ot +2 ,L, oT st = i, } = aij b j (ot +1 ) t +1 ( j ),1 t T 1
(2-7)
T (i ) = 1
(2-8)
j =1
[Baum 1972][Huang 1989]

i =
1 (i ) 1 (i )
i =1
T 1
aij =
(2-9)
t =1
(i )
(i )aij b j (ot +1 ) t +1 ( j )
(2-10)
T 1
t =1
(i ) t (i )
bi ( x ) = f ( , O) ( O )
(2-11)
bi () HMM
P{O } = T (i )
(2-12)
i =1
Viterbi [Viterbi 1967]

- 13 -
1 ( j ) = j b j (o1 ), 1 j N
(2-13)
( 2 t T ,1 j N )
t ( j ) = max t 1 (i ) aij b j (ot )

1i N
(2-14)
t ( j ) = arg max t 1 (i ) aij b j (ot )

1i N
(2-15)
sT( ML ) = arg max T (i ) 1 1
(2-16)
)
st( ML ) = t +1 ( st(+ML
1 ), 1 t T 1
(2.17)
1i N
S ( ML ) = {st( ML ) 1 t T } (Maximum Likelihood, ML)

HMM
T
Pd( ML ) = T ( sT( ML ) ) = s ( ML ) bs ( ML ) (o1 ) a s ( ML ) s ( ML ) bs ( ML ) (ot )

1
t =2
t 1
} {
= s ( ML ) a s ( ML ) s ( ML ) bs ( ML ) (ot ) = P S ( ML ) Pd O S ( ML ) ,
t
t
1
t 1
t =2
t =1
(2-18)
()
Pd {O } = Pd {O, S } = Pd {O , S} P{S }
S
= s1 a st 1st bst (ot )

t =1
t =2
S
(2-19)
= s1 bs1 (o1 ) a st 1st bst (ot )
t =2
S
S = {st 1 t T } Viterbi
HMM aij = 0, j < i

( aij = 0, j < i , j > i + 1 )
- 14 -
2.1.3 HMM
b j ( x ) VQ
HMM HMM (Discrete HMM, DHMM) HMM (Continuous
HMM, CHMM) HMM (Semi-Continuous HMM, SCHMM)[Huang 1989]

HMM A
Pd {O } = Pd {O, S } = P{S } Pd {O , S}
S
= s1 a st 1st bst (ot )

t =1
t =2
S
T
(2-20)
Pd {O , S} = bst (ot )
(2-21)
t =1
bst (ot )
2.1.3.1 HMM
DHMM (Vector QuantizationVQ)
(codeword)(codebook)
bst (ot ) bst (V (ot )) V ()
bst (v)
T
(i) (i)
bi (v) =
t =1
V ( ot ) = v
(2-22)
(i) (i)
t =1
DHMM
- 15 -
DHMM
2.1.3.2 HMM
DHMM CHMM CHMM
(Mixed Gaussian Density, MGD)[Wilpon
1989][Huang
1989]
bn ( x ) = g nm N ( x; nm , nm )
(2-23)
m =1
g
m =1
=1
nm
(2-24)
M n
M MGD
2.1.3.3 HMM
MGD ( nm
nm g nm ) M bn ( x ) M
SCHMM VQ
} { }
bst (ot ) = f ot st = f ot Vl , st P Vl st
L
l =1
= f ot Vl , st b
l =1
( D)
st
{V } = f {o V } b {V }
l
l =1
( D)
st
(2-25)
{Vl 1 l L} bs(tD ) {Vl } Vl

- 16 -
{ }
f Vl Vl
bn (ot ) = g nl f ot Vl
l =1
(2-26)
MGD(Tied Mixture Gaussian Density, TMGD)[Bellegarda 1990]

L
bn ( x ) G = { g nl 1 n N }
SCHMM TMGD
L
MGD
2.1.4 HMM
HMM
HMM
HMM
HMM
HMM
HMM
HMM HMM
- 17 -
LDA
HMM
HMMDHMM
HMMCHMM SCHMM
(Maximum Mutual Information, MMI)[Bahl 1986]
2.2 MGCPM
HMM Viterbi
Mixed Gaussian Continuous Probability Model, MGCPM[Zheng 1998][Mou 1998]

HMM
Viterbi
HMM MGCPM
CHMM
K k
Z k = {z1 , z 2, ..., zT } , z j
M
p( z j | ) = { pi p( z j | i )
(2-27)
i =1
z j Z d j =1,2,, T M
1
1
p( z j | ) = (2 ) d / 2 | Ri |1/ 2 exp ( z j ui )T Ri ( z j ui )
2
i u i Ri
- 18 -
(2-28)
pi i p i = 1, p i 0
j =1
pi i i =1,2,, M
Z
Z p( Z | ) Dempster EM

[Dempster 1977]
Z
T
p( Z | ) = p( z j | )
(2-29)
j =1
p( Z | ) p( Z | )
f = ( ln p( Z | ) ) = 0
(2-30)
i Ri Pi i 1,2,L , M i
f = (ln p ( Z | ) )
i
= p( z j | ) 1 pi p( z j | i )
i i =1
j =1
= p( z j | ) 1 pi p( z j | i )
(2-31)
j =1
T
= p( z j | ) 1 pi p( z j | i ) Ri ( z j i )
1
j =1
=0
Pi , j = p ( z j | ) 1 p i p ( z j | i )
(2-32)
i = (Pi , j z j )
T
j =1
P
j =1
i, j
(2.33)
)
)
i i i i
)
)
pi Ri pi Ri
- 19 -
i = (Pi , j z j )
T
j =1
P
j =1
(2-34)
i, j
T
)
Ri = Pi , j ( z j i )( z j i ) T
j =1
) P
T
j =1
(2-35)
i, j
T
)
pi = T 1 Pi , j
(2-36)
j =1
(2-36) pi = 1, pi 0.
j =1
(2-29)
2.3
1
418
60 40
40 64,000
192,000
(State Tying)Tied
- 20 -
Mixture HMM
3

Decision Tree
- 21 -
HMM HMM
MGCPM
MGCPM SCHMM MGCPM

(Semi-Continuous Segmental Probability ModelSCSPM)
3.1 SCSPM
3.1.1 (SPM)
HMM HMM
[Juang 1985][Zheng 1997]
Segmental Probability Model, SPM
MGDEqual
Feature Variance Sum, EFVS[Xu 1999]None Linear Partition, NLP

[Jiang
1989]
HMM SPM
3.1.2 SCHMM
HMM(CHMM) HMM(SCHMM)
pdf CHMM SCHMM
CHMM
pdf SCHMM pdf

SCHMM
Tied Mixture HMM[Bellegarda 1990]SCHMM
[Duchateau 1998]
SCHMM
- 23 -
HMM HMM
SCHMM
CHMM
SCHMM
3.1.3 SCSPM
SCSPM SCHMM
MGCPM SCSPM
SCSPM
3-1 SCSPM
3.2
SCSPM
SCSPM
- 24 -
3.2.1
K-
LBG ( GLAGeneralized Lloyd Algorithm) [Sadaoki 1985] K-
(1) S
(2) L
(3)
(4) M C1(0)C2(0) CM(0)
(5) D(0)=
(6) m=1
(7) S M S1(m)S2(m) SM(m)Sl(m)
d ( X , Yl ( m 1) ) d ( X , Yi ( m 1) ), i
(3-1)
(8) D(m)
M
D ( m) =
l =1
d ( X ,Y
( m 1)
(3-2)
X Sl( m )
(9) D(m)(m)
( m)
D ( m1) D ( m)
D ( m)
= ( m) =
D
D ( m)
(3-3)
(10) C1(m)C2(m) CM(m)

Ci( m) =
1
Ni
(3-4)
X Si( m )
(11) (m)<(13)
(12)
(12) m<L?m=m+1(7)(13)
(13) C1(m)C2(m) CM(m) D(m)
(14)
- 25 -
L
0 1 (m)<
L
L
20
K-
LBG LBG
K- S
X X0 1.01
1.01 X0X1
K- 2
X 2 B
M=2B
LBG
LBG
MGD MGCPM
1 n1
2 LBG n n2
n n2
3 MGD
4 nN 2
N4,096
4,096
- 26 -
3.2.2
MGCPM MGCPM
MGCPM
v
v
D(G 1 , G 2 ) =|| 1 2 ||
(3-5)
v
Gt t t Gt
2 Gi G j Gi G j
v
v
G
= ( i + j ) / 2 , = ( i + j ) / 2
(3-6)
3 n n = N 1
3.3
M m (m=1,2,..,M) m
Z J Z Zj Z
j(j=1,2,..J) g m m
MGD, Z j
M
p( z j | ) = {g m ( p( z j | m )}
(3-7)
m =1
g
m =1
=1
(3-8)
- 27 -
j =1
j =1
h( ) = ln( p ( Z | )) = ln( p( z j | )) = ln p (z j | )
(3-9)
3-9
M
f = h( ) + ( g m 1)
(3-10)
m =1
t=1,2,M,
J
j=1
m =1
p( z j | ) 1 gt ( g m p( z j | m )) + = 0
(3-11)
g t = p ( z j | ) 1 g t p ( z j | t )
(3-12)
j =1
= p ( z j | ) 1 g t p ( z j | t ) = J
(3-13)
t =1 j =1
3-123-13 g t
J
(
g t = J 1 p ( z j | ) 1 g t p ( z j | t )
(3-14)
j =1
3.4 SCSPM
SCHMM
[Gales 1999] [Fischer 1999]
1
0
2 N
3
0
- 28 -
(3-8)
SCSPM
MGD MGD
4 MGD
1 4
(3-8)
3.5
863
13 520
10 3
16KHz 16 MFCC
6
3.5.1
SCSPM
1,0242048 4,096
4
4,096 3-1
1,024 4,096
36.6%
- 29 -
3-1
1,024
2,048
4,096
60.39
61.33
69.50
70.21
75.49
(%)
5
10
85.74
91.56
86.08
91.64
88.79
92.76
88.93
92.94
90.48
94.42
3.5.2
2,048
(%)
50
1
2
3
4
100
95
90
85
80
75
70
65
60
(th=0.005)
(th=50)
(th=90%)
(th=0.00001)
10
3-2
3.5.3 SCSPM MGCPM

MGCPM MGCPM SCSPM
MGCPM 4 8
- 30 -
SCSPM
3-2 SCSPM MGCPM
MGCPMs
(4)
MGCPMs
(8)
SCSPMs
(%)
10
9,669
74.35
89.15
93.78
18,685
75.87
90.62
94.55
4,096
75.49
90.48
94.42
SCSPM 4
MGCPM 8 MGCPM MGCPM

SCSPM SCSPM
3.6
SCSPM
MGCPM SCSPM
SCSPM
- 31 -

MGCPM SCSPM
418
73,000,000
4.1
4.1.1
[Ma 2000]
4-1
/aI/
/a/
/Ie/
/eI/
/eN/
/e/
/Ci/
/CHi/
/Bi/
/oU/
/o/
/u/
/v/
ai,ana
a
iee
eie
ene
e
cisizii
chishizhii
i
ouo
o
u
yu
- 33 -
13
/b/, /c/, /ch/, /d/, /f/, /g/, /h/, /j/, /k/, /l/, /m/, /n/, /ng/, /p/, /q/, /r/, /s/, /sh/, /t/,
/x/, /z/, /zh/
/ng/ 22
/er//sil/ 37
4.1.2
4-2
a
ang
ei
er
ou
ian
ie
iong
ua
uang
ueng
van
io
/a/
/a/ng/
/eI/Bi/
/er/
/oU/u/
/Bi/Ie/n/
/Bi/Ie/
/Bi/u/ng/
/u/a/
/u/a/ng/
/u/eN/ng/
/v/Ie/n/
/Bi/o/
ai
ao
en
o
i
iang
in
iou
uai
uei
uo
ve
/aI/Bi/
/a/u/
/eN/n/
/o/
/Bi/
/Bi/a/ng/
/Bi/n/
/Bi/oU/u/
/u/aI/Bi/
/u/eI/Bi/
/u/o/
/v/Ie/
- 34 -
an
e
eng
ong
ia
iao
ing
u
uan
uen
v
vn
/aI/Bi/
/e/
/eN/ng/
/u/ng/
/Bi/a/
/Bi/a/u/
/Bi/ng/
/u/
/u/aI/n/
/u/eN/n/
/v/
/v/n/
4.2
4.2.1
[wu 1989]
4-3
(High)
(Medium)
(Low)
(Top Vowel)
(Front)
(End)
(Unrounded)
(Rounded)
(Apical Vowel)
E1(Evowel)
E2(Evowel2)
I(Ivowel)
O(Ovowel)
UV(u and v)
/Bi/, /Ci/, /CHi/, /u/, /v/

/e/, /Ie/, /eI/, /eN/, /er/, /o/, /oU/
/a/, /aI/
/a/, /aI/, /e/, /Ie/, /eI/, /eN/, /Bi/, /o/, /oU/, /u/, /v/
/aI/, /a/, /Bi/, /v/, /Ie/
/a/, /e/, /eI/, /eN/, /u/, /o/, /oU/
/a/, /aI/, /Bi/, /e/, /Ie/, /eI/, /eN/
/u/, /v/, /o/, /oU/
/Ci/, /CHi/
/e/, /Ie/, /eI/, /eN/
/e/, /eI/, /eN/
/Bi/, /Ci/, /CHi/
/o/, /oU/
/u/, /v/
4-4
(Stop)
(Aspirated Stop)
(Unaspirated Stop)
(Affricate)
(Aspirated Affricate)
/b/, /d/, /g/, /p/, /t/, /k/

/b/, /d/, /g/
/p/, /t/, /k/
/z/, /zh/, /j/, /c/, /ch/, /q/
/z/, /zh/, /j/
- 35 -
(Unaspirated Affricate)
(Fricative)
2(Fricative2)
(Voiceless Fricative)
(Voice Fricative)
(Nasal)
2(Nasal2)
3(Nasal3)
(Labial)
2(Labial2)
(Apical)
(Apical Front)
1(Apical1)
2(Apical2)
3(Apical3)
1(Apical End)
2(Apical End2)
(Tongue Top)
(Tongue Root)
2(Tongue Root2)
/c/, /ch/, /q/

/f/, /s/, /sh/, /x/, /h/, /r/
/f/, /s/, /sh/, /x/, /h/, /r/, /k/
/f/, /s/, /sh/, /x/, /h/
/r/, /k/
/m/, /n/, /ng/
/m/, /n/, /l/
/m/, /n/, /l/, /ng/
/b/, /p/, /m/
/b/, /p/, /m/, /f/
/z/, /c/, /s/, /d/ ,/t/, /n/, /l/, /zh/, /ch/, /sh/, /r/
/z/, /c/, /s/
/d/ ,/t/, /n/, /l/
/d/, /t/
/n/, /l/
/zh/, /ch/, /sh/, /r/
/zh/, /ch/, /sh/
/j/, /q/, /x/
/g/, /k/, /h/, /ng/
/g/, /k/, /h/
37
3
Stop 3
Left Question
QS L_Stop
{b-*, d-*, g-*, p-*, t-*, k-*}
Right Question
QS "R_Stop"
{*+b, *+d, *+g, *+p, *+t, *+k}
Central Question
QS "C_Stop"
{*-b+*, *-d+*, *-g+*, *-p+*, *-t+*, *-k+*, *-b, *-d, *-g,

*-p, *-t, *-k, b+*, d+*, g+*, p+*, t+*, k+*, b, d, g, p, t, k}
- 36 -
4.2.2
[Gao 1998]
L ( S ) = log P ( X | S ) S X = {X 1 , X 2 ,L , X N }
N X 1 = {X 11 , X 21 ,L, X 1N 1} , X 2 = {X 12 , X 22 , L , X N2 2 }
X = X 1U X 2 , X 1 I X 2 =
L parent , L1child L 2child
= L1child + L2child L parent
L (S )
[Reichl 2000]
Q( s ) = s ( xt ) log N ( xt | ( S ), ( S ))
(4-1)
xt sS
s ( xt ) xt s N(|,)
Q (S ) L (S )
Q( S ) Q( S ) L( S ) L( S )
(4-2)
Q (S )
4.2.3
37
l-c+r
c l r
8,757
(tie)
- 37 -
4-1
1 5
23 4
4-2 /aI/
2
II
4-3 2
- 38 -
b-aI+f.s[2]
f-aI+m.s[2]
j-aI+z.s[2]
s-aI+t.s[2]
R_Affricate?
n
Root
R_Nasal?
y
L_Stop?
n
L_Affricate?
y
4-2 I
*.s[2]
C_Affricate?
n
Root
y
R_Nasal?
y
L_Stop?
n
C_Affricate?
y
4-3 II
I 373111
- 39 -
4.2.1
II 3
4.2.1
4.3
863 80
520
16KHz 13 MFCC 1
42
12msHTK v2.2[Yong 1999]
4.3.1
CI-Phone
CI-Phone
4-5 CI-Phone
1
2
4
8
(%)
31.30
37.69
43.42
47.37
8 47%
4.3.2
4.2.3 I II
- 40 -
I (=350 6,587)
II (=350 6,285)
II (=100 10,528)
(%)
80
75
70
65
60
1
4-4
II
I II
II
I 3 II
350 100
I I
4.3.3 CD-Phone
CD-Phone
6
4-6
4-6
CD-Phone
8 CD-Phone
10%
- 41 -
4-6 CD-Phone
1
2
4
6
8
(%)
CD-Phone
60.51
68.92
65.04
72.35
69.92
74.59
72.37
75.62
73.83
76.47
4.4
I
CD-Phone
CD-Phone
10% CD-Phone
- 42 -
5.1
21 38 [Li 2000]
5-1 (Initial/Final, IF)
(21)
b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh,
ch, sh, z, c, s, r
(38)
a, ai, an, ang, ao, e, ei, en, eng, er, o,

ong, ou, i, i1, i2, ia, ian, iang, iao, ie,
in, ing, iong, iou, u, ua, uai, uan, uang,
uei, uen, ueng, uo, v, van, ve, vn
ang
angzhangzhang
- 43 -
_a, _o_e_i_u_v
ang_aang
5-2 Extended Initial/Final, XIF
(27)
b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh,
ch, sh, z, c, s, r, _a, _o, _e, _i, _u, _v
(38)
a, ai, an, ang, ao, e, ei, en, eng, er, o, ong,

ou, i, i1, i2, ia, ian, iang, iao, ie, in, ing,
iong, iou, u, ua, uai, uan, uang, uei, uen,
ueng, uo, v, van, ve, vn
XIFs
IFs
122,118 XIFs 29,047
5.2
Affricate
QS R_Affricate
{*+z, *+zh, *+j, *+c, *+ch, *+q}
QS L_Affricate
{z-*, zh-*, j-*, c-*, ch-*, q-*}
+-
AffricateAspirated
Aspirated Affricate
QS R_AspiratedAffricate {*+z, *+zh, *+j }

QS L_AspiratedAffricate
{z-*, zh-*, j-*}
- 44 -
QS R_Type_A
{*+a, *+ai, *+an, *+ang, *+ao }
QS L_Type_A
{a-*, ia-*, ua-*}
61
32 29
QS R_p
{*+p }
QS L_p
{p-*}
IFs XIF
a
QS L_Type_A
{ _a-*, a-*, ia-*, ua-*}
5.3
1. sil 2 4
0.2
2. Short Pause, sp sp 3
1 3 2 sil
3
3. sp 1 3 0.3
4. sil 3 sp 2
- 45 -
Sil
1
sp
5-1 sp
5.4
4.3
5.4.1 IF XIF
IF XIF
5-3 IF XIF
1
2
4
CI-Phone
31.30
37.69
43.42
(%)
CI-IF
43.71
50.13
54.25
CI-XIF
47.09
55.26
58.67
XIF IF 4
- 46 -
XIF
5.4.2 SP
sp CD-XIF
5-4 sp
1
2
4
(%)
SP
SP
74.85
75.24
76.83
77.28
78.77
79.30
sp 0.5
5.4.3
1
5-5
100
200
350
21,222
14,630
9,311
(%)
76.23
76.19
75.24
42,255
- 47 -
5.4.4
(CI-Syllable)CD-Phone
CD-IF CD-XIF
I350
CI-Syllable
CD-Phone
CD-IF
CD-XIF
85
80
75
70
65
60
55
1
5-2
8 8
CD
CI-Syllable CD-Phone CD-IF CD-XIF
CD-IF 4
80.43% CI-Syllable 25%
5.5
- 48 -
(CD-IF) IF
6 XIF XIFs
CD-XIF CD-IF 4 CD-XIF
80.43% 25%
- 49 -
6.1
1(SCSPM) HMM
MGCPM
SCSPM
MGCPM SCSPM
2 HTK HTK
4CD-Phone
CD-Phone
10%
5CD Initial/Final,
CD-IF
Extended Initial/Final, XIF
XIF IF CD-XIF
80% 25%
6.2
- 51 -
- 52 -
HTK
1. HTK
HTK Hidden Markov Model HMM
HTK
Hbuild HTK
HCopy
HDMan
HLEd
HList HTK
HLStats HTK
HParse Backus-Naur (EBNF)
HSGen HTK
HSLab
HCompV
HERest Baum-Welch HMM

- 53 -
(Embedded Training)
z
HEAdapt MLLR / MAP HMM
HHEd HMM
HInit HMM
HQuant HTK VQ
HRest HMM Baum-Welch
HSmooth HMM
HResultsHTK
HVite Viterbi
2.
HTK
HERest
HInit HRest HERest
Baum-Welch
- 54 -
HMM
(Proto)
HCompV
HInit
(mlf)
HRest
HERest(5)
-1
3.
HLEd HHEd
HERest
- 55 -
(mlf)
HLEd
(mlf)
HHEd(
)
HERest(
5)
HHEd(
)
HERest(
5)
-2
- 56 -
[1] Bahl, L.R., Brown, P.F., Souza, P.V., et al, (Bahl 1986) Maximum mutual information
estimation of hidden Markov model parameters for speech recognition, ICASSP, pp.49-52,
1986
[2] Baum, L.E., (Baum 1972)
An inequality and associated maximization technique in
statistical estimation of probabilistic functions of Markov processes, Inequalities, 3, 1972

[3] Bellegarda, J.R., Nahamoo, D., (Bellegarda 1990) Tied Mixture Continuous Parameter
Modeling for Speech Recognition, IEEE Trans. on ASSP, vol.ASSP-38, No.12,
pp.2033-2045, 1990
[4] Breiman, L., Friedman, J., Olshen, R.A., et al, (Breiman 1984) Classification and
Regression Trees, Belmout, CA: Wadsworth, 1984
[5] Dempster, A.P., Laird, N.M. and Rubin, D.B., (Dempster 1977) Maximum likelihood
from incomplete data via the EM algorithm, J.R.Statist. Soc., vol.39, pp 1-83, 1977
[6] Duchateau, J., Demuynck, K., Compernolle, D., (Duchateau 1998) Fast and Accurate
acoustic Modeling with Semi-Continuous HMMs, Speech Communication, vol.24, pp.5-17,
1998
[7] Fischer, V., RoB, T., (Fischer 1999) Reduced Gaussian Mixture Models in a Large
Vocabulary Continuous Speech Recognizer, EuroSpeech'99, Vol.3, pp.1099-1102, 1999
[8] Franzini, M.A., Lee, K.F.,
(Franzini 1990) A New Hybrid Method for Continuous
Speech Recognition, ICASSP-90, 1990

[9] Gales, M., Knill, K., and Young, S., (Gales 1999) State-Based Gaussian Selection in
Large Vocabulary Continuous Speech Recognition Using HMM's, IEEE Trans. On Speech
and Audio Processing, vol.7, No.2, pp.152-161, 1999
[10] Gao,S., Xu,B., Huang, T.Y., (Gao 1998) Class-triphone Acoustic Modeling Based on
Decision Tree for Mandarin Continuous Speech Recognition, International Symposium on
Chinese Spoken Language Processing (ISCSLP 98), ASR-A1, Singapore , 1998
- 57 -
[11] Huang, X.D., Jack, M.A., (Huang 1989) Semi-continuous hidden Markov models for
speech signals, Computer Speech and Language, vol.3, pp.239-251, 1989
[12] Jelinek, F., Mercer, R.L., Bahl, L.R., (Jelinek 1983) A maximum likehood approach to
continuous speech recognition, IEEE Trans. On Pattern Analysis and Machine Intelligence,
Vol. PAMI-5, pp179-190, 1983
[13] , (Jiang 1989), , [
]. , 1989
[14] Juang, B.H., Rabiner, L.R., (Juang 1985) A probabilistic distance measure for hidden
Markov models, AT&T Technical Journal, 64(2), pp.391~408, 1985
[15] Katz, S., (Katz 1987) Estimation of Probabilities from Sparse Data for the Language
Model Component of a Speech Recognizer, IEEE Trans. Acoust., Speech, Signal
processing, vol. ASSP-35, pp. 400-401, 1987
[16] Lee, C.H., Rabiner, L.R., (Lee 1989) A frame synchronous network search algorithm for
connected word recognition, Proc. of IEEE, 77(2), pp.257-285, 1989
[17] Li, J., Zheng, F., and Wu, W.H., (Li 2000) Context-Independent Chinese Initial-Final
Acoustic Modeling, International Symposium on Chinese Spoken Language Processing
(ISCSLP00), pp. 23-26, Beijing, 2000
[18] Ma, B., and Huo, Q., (Ma 2000) Benchmark results of triphone-based acoustic modeling
on HKU96 and HKU99 Putonghua corpora, International Symposium on Chinese Spoken
Language Processing (ISCSLP00), pp.359~362, Beijing, 2000
[19] Morgan, N., Bourlard, H., Morgan 1990Continuous Speech Recognition Using
Multilayer Perception with Hidden Markov Models, ICASSP, 1990
[20] (Mou 1998) []
1998
[21] Nadas, A., (Nadas 1985) On Turings formula for word probabilities, IEEE Trans.
Acoust., Speech, Signal processing, vol. ASSP-33, pp.1414-1416, 1985
- 58 -
[22] Reichl, W. and Chou, W., (Reichl 2000) Robust Decision Tree State Tying for
Continuous Speech Recognition, IEEE Trans. Speech and Audio Proc., Vol.8, No.5,
pp.555-566, 2000.
[23] Rabiner, L.R., (Rabiner 1989) A Tutorial on Hidden Markov Models and Selected
Applications in Speech recognition, IEEE Proceedings, Vol. 77, No.2, pp.257-286, 1989
[24] Sadaoki, F., (Sadaoki 1985) Digital Speech Processing, Synthesis, and Recognition,
Tokai University Press, 1985
[25] Verhasselt, J., Martens, J.P., (Verhasselt 1998) Context modeling in hybrid
segment-based/neural network recognition systems, Proceedings of ICASSP, Vol.1,
pp.501-504, 1998
[26] Vintsjuk, T.K, (Vintsjuk 1968) Recognition of Words of Oral Speech by Dynamic
Programming, Kibernetica, Vol. 81, No. 8, 1968.
[27] Viterbi, A.J., (Viterbi 1967)
Error bounds for convolutional codes and an
asymptotically optimum decoding algorithm, IEEE Trans. on IT, 13(2), 1967

[28] Waibel, A., et al, (Waibel 1989) Phoneme Recognition Using Time-Delay Neural
Networks, IEEE Trans. On ASSP, Vol.37, no.12, pp.1888-1898, 1989
[29] Wilpon, J.G., Lee, C.H., Rabiner, L.R., (Wilpon 1989) Application of hidden Markov
models for recognition of a limited set of words in unconstrained speech, ICASSP-89,
Vol.3, pp.254-257, 1989
[30] Wilpon, J.G., Miller, L.G., Modi, P., (Wilpon 1991) Improvements and Applications for
Key Word Recognition Using Hidden Markov Modeling Techniques, ICASSP-91, pp.
905-908, 1991
[31] , (Wu 1989) . . 1989
[32] Xu, M.X., Zheng, F., Wu, W.H., (Xu 1999) A Fast and Effective State Decoding
Algorithm, EuroSpeech'99, Vol.3, pp.1255-1258, Hungary, 1999
[33] Yang 1995. 1995
- 59 -
[34] Yong, S., Kershaw, D., Odell, J., Ollason, D., et al, (Yong 1999) The HTK Book (for
HTK Version 2.2), Cambridge University, 1999
[35] (Zheng 1996)
NCMMSC-96pp.32~35, 1996
[36] , , , Zheng 1997
3 (NCCIIIA). : ,
pp. 98-103, 1997
[37] Zheng, F., Mou, X.L., Wu, W.H., et al, (Zheng 1998) On the embedded multiple-model
scoring scheme for speech recognition, International Symposium on Chinese Spoken
Language Processing (ISCSLP98), ASR-A3, pp49-53, Singapore, 1998
[38] Zheng, F., Song, Z.J, and Xu, M.X, (Zheng 1999) EASYTALK: A large-vocabulary
speaker-independent Chinese dictation machine, EuroSpeech'99, Vol.2, pp.819-822,
Hungary, 1999
- 60 -
- 61 -
1977 1 1994
1999 6
1.
Jiyong ZHANG, Fang ZHENG, Mingxing XU, Ditang FANG, Semi-Continuous

Segmental Probability Modeling for Continuous Speech Recognition, International
Conference on Spoken Language Processing (ICSLP2000), Vol.1, pp.278-281, 2000
Beijing
2.
Jiyong ZHANG, Fang ZHENG, Mingxing XU, Shuqing LI, Intra-Syllable Dependent
Phonetic Modeling for Chinese Speech Recognition, International Symposium on
Chinese Spoken Language Processing (ISCSLP2000), pp. 73-76, 2000, Beijing
3.
, Vol. 10, No. 11, pp.1212-12151999 11
4.
Jiyong ZHANG, Fang ZHENG, Jing LI, Chunhua LUO, and Guoliang ZHANG,
Improved Context-Dependent Acoustic Modeling for Continuous Chinese Speech
Recognition, EuroSpeech, 2001 ()
- 63 -

Chinese Acoustic Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chinese Acoustic Model

Uploaded by

Copyright:

Available Formats

Study on Acoustic Modeling and

Parameter Tying Strategy for

Hidden Markov Model, HMM

(LPC) LPCC Mel MFCC

Artificial Neural Network, ANN

Triphone [Yang 1995]

[Zheng 1996] 418

SCHMM HMMTied Mixture HMM, TMHMM

MGCPM[Zheng 1998] CHMM

P{S (t + 1) = s S (0) = s 0 , S (1) = s1 ,L, S (t ) = st }

(1) N = {1,2,3, L , N } st = s(t ) N t

(2) A(t ) = aij (t )

aij (t ) = P s(t + 1) = j s(t ) = i

aij = P s(t + 1) = j s(t ) = i

t ( j ) = P{o1 , o2 ,L, ot , st = j } = t 1 (i )aij b j (ot ),2 t T

t (i ) = P{ot +1 , ot +2 ,L, oT st = i, } = aij b j (ot +1 ) t +1 ( j ),1 t T 1

[Baum 1972][Huang 1989]

Viterbi [Viterbi 1967]

t ( j ) = max t 1 (i ) aij b j (ot )

t ( j ) = arg max t 1 (i ) aij b j (ot )

sT( ML ) = arg max T (i ) 1 1

S ( ML ) = {st( ML ) 1 t T } (Maximum Likelihood, ML)

Pd( ML ) = T ( sT( ML ) ) = s ( ML ) bs ( ML ) (o1 ) a s ( ML ) s ( ML ) bs ( ML ) (ot )

= s1 a st 1st bst (ot )

= s1 bs1 (o1 ) a st 1st bst (ot )

HMM aij = 0, j < i

HMM, CHMM) HMM (Semi-Continuous HMM, SCHMM)[Huang 1989]

= s1 a st 1st bst (ot )

(Mixed Gaussian Density, MGD)[Wilpon

{Vl 1 l L} bs(tD ) {Vl } Vl

MGD(Tied Mixture Gaussian Density, TMGD)[Bellegarda 1990]

(Maximum Mutual Information, MMI)[Bahl 1986]

Mixed Gaussian Continuous Probability Model, MGCPM[Zheng 1998][Mou 1998]

MGCPM SCHMM MGCPM

Feature Variance Sum, EFVS[Xu 1999]None Linear Partition, NLP

pdf SCHMM pdf

LBG ( GLAGeneralized Lloyd Algorithm) [Sadaoki 1985] K-

(10) C1(m)C2(m) CM(m)

[Gales 1999] [Fischer 1999]

3.5.3 SCSPM MGCPM

MGCPM 8 MGCPM MGCPM

/Bi/, /Ci/, /CHi/, /u/, /v/

/b/, /d/, /g/, /p/, /t/, /k/

/c/, /ch/, /q/

{b-*, d-*, g-*, p-*, t-*, k-*}

{*+b, *+d, *+g, *+p, *+t, *+k}

{*-b+*, *-d+*, *-g+*, *-p+*, *-t+*, *-k+*, *-b, *-d, *-g,

12msHTK v2.2[Yong 1999]

a, ai, an, ang, ao, e, ei, en, eng, er, o,

5-2 Extended Initial/Final, XIF

a, ai, an, ang, ao, e, ei, en, eng, er, o, ong,

122,118 XIFs 29,047

{*+z, *+zh, *+j, *+c, *+ch, *+q}

{z-*, zh-*, j-*, c-*, ch-*, q-*}

QS R_AspiratedAffricate {*+z, *+zh, *+j }

{z-*, zh-*, j-*}

{*+a, *+ai, *+an, *+ang, *+ao }

{a-*, ia-*, ua-*}

{ _a-*, a-*, ia-*, ua-*}

HParse Backus-Naur (EBNF)

HERest Baum-Welch HMM

HEAdapt MLLR / MAP HMM

HRest HMM Baum-Welch

An inequality and associated maximization technique in

{b-, d-, g-, p-, t-, k-}

{+b, +d, +g, +p, +t, +k}

{-b+, -d+, -g+, -p+, -t+, -k+, -b, -d, *-g,

{+z, +zh, +j, +c, +ch, +q}

{z-, zh-, j-, c-, ch-, q-}

QS R_AspiratedAffricate {+z, +zh, *+j }

{z-, zh-, j-*}

{+a, +ai, +an, +ang, *+ao }

{a-, ia-, ua-*}

{ _a-, a-, ia-, ua-}