You are on page 1of 73

Study on Acoustic Modeling and

Parameter Tying Strategy for


Chinese Speech Recognition
()

()

SCSPM

1(SCSPM) HMM
MGCPM

SCSPM
MGCPM SCSPM

2 HTK HTK

3Context Dependent, CD

4CD-Phone

CD-Phone
10%
5CD Initial/Final,
CD-IF
Extended Initial/Final, XIF
XIF IF CD-XIF
80% 25%
SCSPM

-I-

- II -

Abstract
Acoustic Modeling is one of the key problems in the field of speech recognition.
In this paper, the techniques of acoustic modeling and parameter tying strategy are
deeply studied. Two main aspects are focused on: the proposition and
implementation of the Semi-Continuous Segmental Probability Model (SCSPM);
The study of Context Dependent (CD) acoustic modeling with decision tree based
state tying, and the implementation of CD phone modeling and Initial/Final (IF)
modeling respectively. Including:
1The SCSPM is proposed and implemented. It is based on the traditional
Hidden Markov Model (HMM) and the modified HMM namely Mixed Gaussian
Continuous Probability Model (MGCPM), the Vector Quantizaztion (VQ) technique
and the feature of continuous probability density distribution are integrated, and the
method of Tied Mixture is adopted to describe the probability distribution of each
state. Moreover, the mixture weight reduction method is studied and analyzed, and a
new effective method which prunes the small tying weight through the iterative
training process is proposed. Compared with the MGCPM, SCSPM can reduce the
model scale and computational complexity significantly with little degradation in
recognition accuracy.
2The HMM Tool Kit (HTK) platform is studied and analyzed. Based on HTK,
an effective method is implemented for acoustic model training and performance
evaluation.
3The Decision Tree (DT) based state tying strategy in the Context Dependent
(CD) acoustic modeling is deeply studied. Two different DT design methods are
analyzed, the design of question set and the DT node splitting strategy are discussed.
Furthermore, the method for treating the silence model is studied to enhance the
robustness.
4The CD-Phone model with decision tree based state tying is implemented.
The phone question set and different DT structure are designed. Compared with the
baseline syllable model, CD-Phone model reduces the syllable error rate (SER) by
about 10%.
5The CD Initial/Final (IF) model with decision tree based state tying is
implemented. To maintain the connection between Initial and Final, the Extend IF
(XIF) set is proposed by adding the Zero Initials to the prior standard IF set.
Experiments show that the XIF model outperforms the IF model. The syllable
correct rate of the implemented CD-XIF model can achieve over 80%, and the
CD-XIF model can reduce the SER by over 25% compared with the baseline
syllable model.
Keyword: Semi-Continuous Segmental Probability Model, Parameter Tying
Strategy, Decision Tree based State Tying, Context Dependent Phone
Model, Context Dependent Initial/Final Model
- III -

- IV -

1.1 1
1.1.1 1
1.1.2 2
1.1.3 2
1.2 4
1.2.1 4
1.2.2 5
1.3 6
1.4 7
1.5 8

11

2.1 HMM 11
2.1.1 HMM 11
2.1.2 HMM 12
2.1.3 HMM 15
2.1.4 HMM 17
2.2 MGCPM 18
2.3 20

23

3.1 SCSPM 23
3.1.1 (SPM) 23
3.1.2 SCHMM 23
3.1.3 SCSPM 24
3.2 24
3.2.1 25
3.2.2 27
3.3 27
3.4 SCSPM 28
3.5 29
-V-

3.5.1 29
3.5.2 30
3.5.3 SCSPM MGCPM 30
3.6 31
33
4.1 33
4.1.1 33
4.1.2 34
4.2 35
4.2.1 35
4.2.2 37
4.2.3 37
4.3 40
4.3.1 40
4.3.2 40
4.3.3 CD-Phone 41
4.4 42
43
5.1 43
5.2 44
5.3 45
5.4 46
5.4.1 IF XIF 46
5.4.2 SP 47
5.4.3 47
5.4.4 48
5.5 48
51
6.1

51

6.2 51
HTK 53
HTK 53
54
55
- VI -

57
61
63
63

1-1 ............................................................................4
3-1 SCSPM .........................................................................24
3-1 ......................................................................................30
3-2 ......................................................................................30
3-2 SCSPM MGCPM ............................................................31
4-1 ......................................................................................33
4-2 ..........................................................................34
4-3 ..................................................................................35
4-4 ..................................................................................35
4-1 ...............................................................................38
4-2 I ...................................................................39
4-3 II .................................................................39
4-5 CI-Phone ...............................................................................40
4-4 ..................................................41
4-6 CD-Phone .........................................................42
5-1 (Initial/Final, IF)...........................................................43
5-2 Extended Initial/Final, XIF ..................................44
5-1 sp ........................................................................46
5-3 IF XIF ......................................................................46
5-4 sp ..........................................................................47
5-5 ..................................................................47
5-2 ..................................................................................48
-1 .....................................................................55
-2 .....................................................................56

- VII -

1.1
1.1.1

50
1952 Davis
10 50

1960 Denes
70
LPC
DTW[Vintsjuk 1968]VQ

Hidden Markov Model, HMM


HMM
CMU Sphinx 80 90 90
IBM ViaVoice

-1-

1.1.2

Speaker Dependent, SD
Speaker Independent, SI
Small VocabularyMedium VocabularyLarge
VocabularyIsolated Word
Connected WordContinuous Speech

1.1.3

(LPC) LPCC Mel MFCC


Wilpon[Wilpon 1989]
-

-2-

[Wilpon 1991]

2 HMM
HMM
Viterbi [Viterbi

1967]

Artificial Neural Network, ANN

HMM
HMM
[Franzini 1990][Morgan 1990]
HMM

3
Statistical Language ModelKnowledge-based Language
Model

N-Gram
Jelinek [Jelinek 1983] N-Gram
Nadas
[Nadas 1985]Katz [Katz 1987]

4
-3-

HMM Viterbi
[Lee 1989]

1.2
1.2.1

1.1
A

A
L

1-1

A w
-4-

w*
w* = arg max P( w A) = arg max
w

P( A w) P( w)
P( A)

(1.1)

Bayes P(A|w)
P(w)
w P(A)
w* = arg max P( w A) P( w)

(1.2)

1.2.2

HMM

HMM

ANN

ANN [Verhasselt
1998]

-5-

ANN ANN

ANN
ANN

1.3

[Zheng 1996]
1

(Phoneme) (Triphone)
Triphone

Triphone [Yang 1995]

(Syllable)
-6-

[Zheng 1996] 418

(Phoneme)

Coarticulation
Context Dependent,
CD triphone
triphone

(Initial/Final, IF)

[Li 2000]
Context Dependent
Initial/Final, CD-IF

1.4

HMM
Semi-Continuous HMM, SCHMM HMM Continuous HMM,
CHMM

-7-

SCHMM HMMTied Mixture HMM, TMHMM


Probability Density Function, pdf

SCHMM
EasyTalk [Zheng

1999]

MGCPM[Zheng 1998] CHMM


6 16
40,000 SCHMM 5,000

[Reichl 2000]
[Breiman 1984]

1.5

(1)
Semi-Continuous Segmental Probability ModelSCSPM
(2) HTK
HTK (3)

-8-

-9-

2.1 HMM
HMM
HMM HMM
[Rabiner 1989]

2.1.1 HMM
HMM {S (t ): t T }
t st = S (t ) t
t

{S (t ): t T }
T
I T {S (t ) :t = 0,1,2,L}

P{S (t + 1) = s S (0) = s 0 , S (1) = s1 ,L, S (t ) = st }

= P{S (t + 1) = s S (t ) = s t }

(2-1)

aij (t ) = P S (t + 1) = j S (t ) = i

(2-2)

t i j

aij (t ) 0, i , j I

(2-3)

(2-4)

j I

ij

(t ) = 1, i I

(HMM)
HMM t
- 11 -

st
R Q

{ }

HMM aij
HMM

(1) N = {1,2,3, L , N } st = s(t ) N t

(2) A(t ) = aij (t )

NN

aij (t ) = P s(t + 1) = j s(t ) = i

{ }

(3) B = b j ()

N 1

( )

b j ( x ) = Pd output = x state = j Pd {}
(4) = { i } N 1 i = P{s(1) = i}
HMM

{ }

A = aij

NN

aij = P s(t + 1) = j s(t ) = i

N HMM = {, A, B}

2.1.2 HMM
HMM
HMM O
O

- 12 -

M {O ( m ) , m = 1 ~ M } HMM

HMM
O

O O
(Forward-Backward) Baum-Welch [Baum 1972]
( 1 j N )
N

t ( j ) = P{o1 , o2 ,L, ot , st = j } = t 1 (i )aij b j (ot ),2 t T

(2-5)

1 ( j ) = i bi (o1 )

(2-6)

i =1

( 1 i N )
N

t (i ) = P{ot +1 , ot +2 ,L, oT st = i, } = aij b j (ot +1 ) t +1 ( j ),1 t T 1

(2-7)

T (i ) = 1

(2-8)

j =1

[Baum 1972][Huang 1989]


i =

1 (i ) 1 (i )

i =1

T 1

aij =

(2-9)

t =1

(i )

(i )aij b j (ot +1 ) t +1 ( j )

(2-10)

T 1

t =1

(i ) t (i )

bi ( x ) = f ( , O) ( O )

(2-11)

bi () HMM

P{O } = T (i )

(2-12)

i =1

Viterbi [Viterbi 1967]


- 13 -

1 ( j ) = j b j (o1 ), 1 j N

(2-13)

( 2 t T ,1 j N )

t ( j ) = max t 1 (i ) aij b j (ot )


1i N

(2-14)

t ( j ) = arg max t 1 (i ) aij b j (ot )


1i N

(2-15)

sT( ML ) = arg max T (i ) 1 1

(2-16)

)
st( ML ) = t +1 ( st(+ML
1 ), 1 t T 1

(2.17)

1i N

S ( ML ) = {st( ML ) 1 t T } (Maximum Likelihood, ML)


HMM
T

Pd( ML ) = T ( sT( ML ) ) = s ( ML ) bs ( ML ) (o1 ) a s ( ML ) s ( ML ) bs ( ML ) (ot )


1

t =2

t 1

} {

= s ( ML ) a s ( ML ) s ( ML ) bs ( ML ) (ot ) = P S ( ML ) Pd O S ( ML ) ,
t
t
1
t 1
t =2

t =1

(2-18)

()

Pd {O } = Pd {O, S } = Pd {O , S} P{S }
S

= s1 a st 1st bst (ot )


t =1

t =2
S

(2-19)

= s1 bs1 (o1 ) a st 1st bst (ot )

t =2
S

S = {st 1 t T } Viterbi

HMM aij = 0, j < i


( aij = 0, j < i , j > i + 1 )
- 14 -

2.1.3 HMM
b j ( x ) VQ
HMM HMM (Discrete HMM, DHMM) HMM (Continuous

HMM, CHMM) HMM (Semi-Continuous HMM, SCHMM)[Huang 1989]


HMM A
Pd {O } = Pd {O, S } = P{S } Pd {O , S}
S

= s1 a st 1st bst (ot )


t =1

t =2
S
T

(2-20)

Pd {O , S} = bst (ot )

(2-21)

t =1

bst (ot )

2.1.3.1 HMM
DHMM (Vector QuantizationVQ)

(codeword)(codebook)
bst (ot ) bst (V (ot )) V ()

bst (v)
T

(i) (i)

bi (v) =

t =1
V ( ot ) = v

(2-22)

(i) (i)
t =1

DHMM
- 15 -

DHMM
2.1.3.2 HMM
DHMM CHMM CHMM

(Mixed Gaussian Density, MGD)[Wilpon

1989][Huang

1989]

bn ( x ) = g nm N ( x; nm , nm )

(2-23)

m =1

g
m =1

=1

nm

(2-24)

M n
M MGD

2.1.3.3 HMM
MGD ( nm
nm g nm ) M bn ( x ) M
SCHMM VQ

} { }

bst (ot ) = f ot st = f ot Vl , st P Vl st
L

l =1

= f ot Vl , st b
l =1

( D)
st

{V } = f {o V } b {V }
l

l =1

( D)
st

(2-25)

{Vl 1 l L} bs(tD ) {Vl } Vl


- 16 -

{ }

f Vl Vl

bn (ot ) = g nl f ot Vl
l =1

(2-26)

MGD(Tied Mixture Gaussian Density, TMGD)[Bellegarda 1990]


L
bn ( x ) G = { g nl 1 n N }
SCHMM TMGD
L
MGD

2.1.4 HMM
HMM
HMM
HMM

HMM

HMM

HMM

HMM HMM

- 17 -

LDA

HMM
HMMDHMM

HMMCHMM SCHMM

(Maximum Mutual Information, MMI)[Bahl 1986]

2.2 MGCPM
HMM Viterbi

Mixed Gaussian Continuous Probability Model, MGCPM[Zheng 1998][Mou 1998]


HMM
Viterbi

HMM MGCPM
CHMM
K k
Z k = {z1 , z 2, ..., zT } , z j
M

p( z j | ) = { pi p( z j | i )

(2-27)

i =1

z j Z d j =1,2,, T M

1
1
p( z j | ) = (2 ) d / 2 | Ri |1/ 2 exp ( z j ui )T Ri ( z j ui )
2

i u i Ri

- 18 -

(2-28)

pi i p i = 1, p i 0
j =1

pi i i =1,2,, M
Z
Z p( Z | ) Dempster EM

[Dempster 1977]

Z
T

p( Z | ) = p( z j | )

(2-29)

j =1

p( Z | ) p( Z | )

f = ( ln p( Z | ) ) = 0

(2-30)

i Ri Pi i 1,2,L , M i

f = (ln p ( Z | ) )
i

= p( z j | ) 1 pi p( z j | i )
i i =1
j =1

= p( z j | ) 1 pi p( z j | i )

(2-31)

j =1
T

= p( z j | ) 1 pi p( z j | i ) Ri ( z j i )
1

j =1

=0

Pi , j = p ( z j | ) 1 p i p ( z j | i )

(2-32)

i = (Pi , j z j )
T

j =1

P
j =1

i, j

(2.33)

)
)
i i i i
)
)
pi Ri pi Ri

- 19 -

i = (Pi , j z j )
T

j =1

P
j =1

(2-34)

i, j

T
)
Ri = Pi , j ( z j i )( z j i ) T
j =1

) P
T

j =1

(2-35)

i, j

T
)
pi = T 1 Pi , j

(2-36)

j =1

(2-36) pi = 1, pi 0.
j =1

(2-29)

2.3

1
418

60 40

40 64,000
192,000

(State Tying)Tied
- 20 -

Mixture HMM
3


Decision Tree

- 21 -

HMM HMM
MGCPM

MGCPM SCHMM MGCPM


(Semi-Continuous Segmental Probability ModelSCSPM)

3.1 SCSPM
3.1.1 (SPM)
HMM HMM
[Juang 1985][Zheng 1997]
Segmental Probability Model, SPM

MGDEqual

Feature Variance Sum, EFVS[Xu 1999]None Linear Partition, NLP


[Jiang

1989]

HMM SPM

3.1.2 SCHMM
HMM(CHMM) HMM(SCHMM)
pdf CHMM SCHMM
CHMM

pdf SCHMM pdf


SCHMM
Tied Mixture HMM[Bellegarda 1990]SCHMM
[Duchateau 1998]
SCHMM
- 23 -

HMM HMM

SCHMM

CHMM
SCHMM

3.1.3 SCSPM
SCSPM SCHMM
MGCPM SCSPM
SCSPM

3-1 SCSPM

3.2
SCSPM
SCSPM

- 24 -

3.2.1
K-

LBG ( GLAGeneralized Lloyd Algorithm) [Sadaoki 1985] K-

(1) S
(2) L
(3)
(4) M C1(0)C2(0) CM(0)
(5) D(0)=
(6) m=1
(7) S M S1(m)S2(m) SM(m)Sl(m)

d ( X , Yl ( m 1) ) d ( X , Yi ( m 1) ), i

(3-1)

(8) D(m)
M

D ( m) =
l =1

d ( X ,Y

( m 1)

(3-2)

X Sl( m )

(9) D(m)(m)

( m)

D ( m1) D ( m)
D ( m)
= ( m) =
D
D ( m)

(3-3)

(10) C1(m)C2(m) CM(m)


Ci( m) =

1
Ni

(3-4)

X Si( m )

(11) (m)<(13)
(12)

(12) m<L?m=m+1(7)(13)
(13) C1(m)C2(m) CM(m) D(m)
(14)
- 25 -

L
0 1 (m)<
L
L
20
K-
LBG LBG
K- S
X X0 1.01
1.01 X0X1
K- 2
X 2 B
M=2B
LBG

LBG
MGD MGCPM

1 n1
2 LBG n n2
n n2

3 MGD

4 nN 2

N4,096
4,096
- 26 -

3.2.2
MGCPM MGCPM
MGCPM

v
v
D(G 1 , G 2 ) =|| 1 2 ||

(3-5)

v
Gt t t Gt
2 Gi G j Gi G j
v
v
G

= ( i + j ) / 2 , = ( i + j ) / 2

(3-6)

3 n n = N 1

3.3

M m (m=1,2,..,M) m
Z J Z Zj Z
j(j=1,2,..J) g m m

MGD, Z j
M

p( z j | ) = {g m ( p( z j | m )}

(3-7)

m =1

g
m =1

=1

(3-8)
- 27 -

j =1

j =1

h( ) = ln( p ( Z | )) = ln( p( z j | )) = ln p (z j | )

(3-9)

3-9
M

f = h( ) + ( g m 1)

(3-10)

m =1

t=1,2,M,
J

j=1

m =1

p( z j | ) 1 gt ( g m p( z j | m )) + = 0

(3-11)

g t = p ( z j | ) 1 g t p ( z j | t )

(3-12)

j =1

= p ( z j | ) 1 g t p ( z j | t ) = J

(3-13)

t =1 j =1

3-123-13 g t
J
(
g t = J 1 p ( z j | ) 1 g t p ( z j | t )

(3-14)

j =1

3.4 SCSPM
SCHMM

[Gales 1999] [Fischer 1999]

1
0
2 N

3
0

- 28 -

(3-8)

SCSPM
MGD MGD

4 MGD

1 4

(3-8)

3.5
863
13 520
10 3
16KHz 16 MFCC
6

3.5.1
SCSPM
1,0242048 4,096
4
4,096 3-1
1,024 4,096

36.6%

- 29 -

3-1

1,024
2,048
4,096

60.39
61.33
69.50
70.21
75.49

(%)
5
10
85.74
91.56
86.08
91.64
88.79
92.76
88.93
92.94
90.48
94.42

3.5.2
2,048

(%)

50
1
2
3
4

100
95
90
85
80
75
70
65
60

(th=0.005)
(th=50)
(th=90%)
(th=0.00001)

10

3-2

3.5.3 SCSPM MGCPM


MGCPM MGCPM SCSPM
MGCPM 4 8
- 30 -

SCSPM
3-2 SCSPM MGCPM

MGCPMs
(4)
MGCPMs
(8)
SCSPMs

(%)

10

9,669

74.35

89.15

93.78

18,685

75.87

90.62

94.55

4,096

75.49

90.48

94.42

SCSPM 4

MGCPM 8 MGCPM MGCPM


SCSPM SCSPM

3.6
SCSPM

MGCPM SCSPM
SCSPM

- 31 -


MGCPM SCSPM

418
73,000,000

4.1
4.1.1

[Ma 2000]
4-1

/aI/
/a/
/Ie/
/eI/
/eN/
/e/
/Ci/
/CHi/
/Bi/
/oU/
/o/
/u/
/v/

ai,ana
a
iee
eie
ene
e
cisizii
chishizhii
i
ouo
o
u
yu
- 33 -

13

/b/, /c/, /ch/, /d/, /f/, /g/, /h/, /j/, /k/, /l/, /m/, /n/, /ng/, /p/, /q/, /r/, /s/, /sh/, /t/,
/x/, /z/, /zh/
/ng/ 22
/er//sil/ 37

4.1.2

4-2

a
ang
ei
er
ou
ian
ie
iong
ua
uang
ueng
van
io

/a/
/a/ng/
/eI/Bi/
/er/
/oU/u/
/Bi/Ie/n/
/Bi/Ie/
/Bi/u/ng/
/u/a/
/u/a/ng/
/u/eN/ng/
/v/Ie/n/
/Bi/o/

ai
ao
en
o
i
iang
in
iou
uai
uei
uo
ve

/aI/Bi/
/a/u/
/eN/n/
/o/
/Bi/
/Bi/a/ng/
/Bi/n/
/Bi/oU/u/
/u/aI/Bi/
/u/eI/Bi/
/u/o/
/v/Ie/

- 34 -

an
e
eng
ong
ia
iao
ing
u
uan
uen
v
vn

/aI/Bi/
/e/
/eN/ng/
/u/ng/
/Bi/a/
/Bi/a/u/
/Bi/ng/
/u/
/u/aI/n/
/u/eN/n/
/v/
/v/n/

4.2
4.2.1

[wu 1989]

4-3

(High)
(Medium)
(Low)
(Top Vowel)
(Front)
(End)
(Unrounded)
(Rounded)
(Apical Vowel)
E1(Evowel)
E2(Evowel2)
I(Ivowel)
O(Ovowel)
UV(u and v)

/Bi/, /Ci/, /CHi/, /u/, /v/


/e/, /Ie/, /eI/, /eN/, /er/, /o/, /oU/
/a/, /aI/
/a/, /aI/, /e/, /Ie/, /eI/, /eN/, /Bi/, /o/, /oU/, /u/, /v/
/aI/, /a/, /Bi/, /v/, /Ie/
/a/, /e/, /eI/, /eN/, /u/, /o/, /oU/
/a/, /aI/, /Bi/, /e/, /Ie/, /eI/, /eN/
/u/, /v/, /o/, /oU/
/Ci/, /CHi/
/e/, /Ie/, /eI/, /eN/
/e/, /eI/, /eN/
/Bi/, /Ci/, /CHi/
/o/, /oU/
/u/, /v/

4-4

(Stop)
(Aspirated Stop)

(Unaspirated Stop)
(Affricate)

(Aspirated Affricate)

/b/, /d/, /g/, /p/, /t/, /k/


/b/, /d/, /g/
/p/, /t/, /k/
/z/, /zh/, /j/, /c/, /ch/, /q/
/z/, /zh/, /j/

- 35 -

(Unaspirated Affricate)
(Fricative)
2(Fricative2)
(Voiceless Fricative)
(Voice Fricative)
(Nasal)
2(Nasal2)
3(Nasal3)
(Labial)
2(Labial2)
(Apical)
(Apical Front)
1(Apical1)
2(Apical2)
3(Apical3)
1(Apical End)
2(Apical End2)
(Tongue Top)
(Tongue Root)
2(Tongue Root2)

/c/, /ch/, /q/


/f/, /s/, /sh/, /x/, /h/, /r/
/f/, /s/, /sh/, /x/, /h/, /r/, /k/
/f/, /s/, /sh/, /x/, /h/
/r/, /k/
/m/, /n/, /ng/
/m/, /n/, /l/
/m/, /n/, /l/, /ng/
/b/, /p/, /m/
/b/, /p/, /m/, /f/
/z/, /c/, /s/, /d/ ,/t/, /n/, /l/, /zh/, /ch/, /sh/, /r/
/z/, /c/, /s/
/d/ ,/t/, /n/, /l/
/d/, /t/
/n/, /l/
/zh/, /ch/, /sh/, /r/
/zh/, /ch/, /sh/
/j/, /q/, /x/
/g/, /k/, /h/, /ng/
/g/, /k/, /h/

37
3
Stop 3

Left Question

QS L_Stop

{b-*, d-*, g-*, p-*, t-*, k-*}

Right Question

QS "R_Stop"

{*+b, *+d, *+g, *+p, *+t, *+k}

Central Question

QS "C_Stop"

{*-b+*, *-d+*, *-g+*, *-p+*, *-t+*, *-k+*, *-b, *-d, *-g,


*-p, *-t, *-k, b+*, d+*, g+*, p+*, t+*, k+*, b, d, g, p, t, k}

- 36 -

4.2.2
[Gao 1998]
L ( S ) = log P ( X | S ) S X = {X 1 , X 2 ,L , X N }

N X 1 = {X 11 , X 21 ,L, X 1N 1} , X 2 = {X 12 , X 22 , L , X N2 2 }

X = X 1U X 2 , X 1 I X 2 =
L parent , L1child L 2child
= L1child + L2child L parent

L (S )
[Reichl 2000]

Q( s ) = s ( xt ) log N ( xt | ( S ), ( S ))

(4-1)

xt sS

s ( xt ) xt s N(|,)
Q (S ) L (S )

Q( S ) Q( S ) L( S ) L( S )

(4-2)

Q (S )

4.2.3
37
l-c+r

c l r
8,757
(tie)
- 37 -

4-1

1 5
23 4

4-2 /aI/
2
II

4-3 2

- 38 -

b-aI+f.s[2]
f-aI+m.s[2]
j-aI+z.s[2]
s-aI+t.s[2]

R_Affricate?
n

Root

R_Nasal?
y

L_Stop?
n

L_Affricate?
y

4-2 I
*.s[2]
C_Affricate?
n

Root

y
R_Nasal?
y

L_Stop?
n

C_Affricate?
y

4-3 II

I 373111
- 39 -

4.2.1
II 3
4.2.1

4.3
863 80
520
16KHz 13 MFCC 1
42

12msHTK v2.2[Yong 1999]

4.3.1
CI-Phone
CI-Phone

4-5 CI-Phone

1
2
4
8

(%)
31.30
37.69
43.42
47.37

8 47%

4.3.2

4.2.3 I II

- 40 -

I (=350 6,587)
II (=350 6,285)
II (=100 10,528)

(%)

80
75
70
65
60
1

4-4

II
I II
II
I 3 II

350 100
I I

4.3.3 CD-Phone
CD-Phone
6
4-6
4-6
CD-Phone
8 CD-Phone

10%

- 41 -

4-6 CD-Phone

1
2
4
6
8

(%)

CD-Phone
60.51
68.92
65.04
72.35
69.92
74.59
72.37
75.62
73.83
76.47

4.4

I
CD-Phone
CD-Phone
10% CD-Phone

- 42 -

5.1
21 38 [Li 2000]
5-1 (Initial/Final, IF)

(21)

b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh,
ch, sh, z, c, s, r

(38)

a, ai, an, ang, ao, e, ei, en, eng, er, o,


ong, ou, i, i1, i2, ia, ian, iang, iao, ie,
in, ing, iong, iou, u, ua, uai, uan, uang,
uei, uen, ueng, uo, v, van, ve, vn

ang
angzhangzhang

- 43 -

_a, _o_e_i_u_v
ang_aang

5-2 Extended Initial/Final, XIF

(27)

b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh,
ch, sh, z, c, s, r, _a, _o, _e, _i, _u, _v

(38)

a, ai, an, ang, ao, e, ei, en, eng, er, o, ong,


ou, i, i1, i2, ia, ian, iang, iao, ie, in, ing,
iong, iou, u, ua, uai, uan, uang, uei, uen,
ueng, uo, v, van, ve, vn

XIFs

IFs

122,118 XIFs 29,047

5.2

Affricate

QS R_Affricate

{*+z, *+zh, *+j, *+c, *+ch, *+q}

QS L_Affricate

{z-*, zh-*, j-*, c-*, ch-*, q-*}

+-
AffricateAspirated
Aspirated Affricate

QS R_AspiratedAffricate {*+z, *+zh, *+j }


QS L_AspiratedAffricate

{z-*, zh-*, j-*}

- 44 -

QS R_Type_A

{*+a, *+ai, *+an, *+ang, *+ao }

QS L_Type_A

{a-*, ia-*, ua-*}

61
32 29

QS R_p

{*+p }

QS L_p

{p-*}

IFs XIF
a

QS L_Type_A

{ _a-*, a-*, ia-*, ua-*}

5.3

1. sil 2 4
0.2

2. Short Pause, sp sp 3
1 3 2 sil
3

3. sp 1 3 0.3
4. sil 3 sp 2

- 45 -

Sil
1

sp

5-1 sp

5.4

4.3

5.4.1 IF XIF
IF XIF

5-3 IF XIF

1
2
4

CI-Phone
31.30
37.69
43.42

(%)
CI-IF
43.71
50.13
54.25

CI-XIF
47.09
55.26
58.67

XIF IF 4

- 46 -

XIF

5.4.2 SP
sp CD-XIF

5-4 sp

1
2
4

(%)
SP
SP
74.85
75.24
76.83
77.28
78.77
79.30

sp 0.5

5.4.3

1
5-5

100
200
350

21,222
14,630
9,311

(%)
76.23
76.19
75.24

42,255

- 47 -

5.4.4
(CI-Syllable)CD-Phone

CD-IF CD-XIF
I350
CI-Syllable
CD-Phone
CD-IF
CD-XIF

85

80
75
70
65
60
55
1

5-2

8 8
CD
CI-Syllable CD-Phone CD-IF CD-XIF
CD-IF 4
80.43% CI-Syllable 25%

5.5

- 48 -

(CD-IF) IF
6 XIF XIFs
CD-XIF CD-IF 4 CD-XIF
80.43% 25%

- 49 -

6.1

1(SCSPM) HMM
MGCPM

SCSPM
MGCPM SCSPM

2 HTK HTK

4CD-Phone

CD-Phone
10%

5CD Initial/Final,
CD-IF
Extended Initial/Final, XIF
XIF IF CD-XIF
80% 25%

6.2

- 51 -

- 52 -

HTK

1. HTK
HTK Hidden Markov Model HMM
HTK

Hbuild HTK

HCopy

HDMan

HLEd

HList HTK

HLStats HTK

HParse Backus-Naur (EBNF)

HSGen HTK

HSLab

HCompV

HERest Baum-Welch HMM


- 53 -

(Embedded Training)
z

HEAdapt MLLR / MAP HMM

HHEd HMM

HInit HMM

HQuant HTK VQ

HRest HMM Baum-Welch

HSmooth HMM

HResultsHTK

HVite Viterbi

2.
HTK
HERest
HInit HRest HERest

Baum-Welch

- 54 -

HMM
(Proto)

HCompV

HInit

(mlf)

HRest

HERest(5)

-1

3.

HLEd HHEd
HERest

- 55 -

(mlf)

HLEd

(mlf)

HHEd(
)
HERest(
5)

HHEd(
)

HERest(
5)

-2

- 56 -

[1] Bahl, L.R., Brown, P.F., Souza, P.V., et al, (Bahl 1986) Maximum mutual information
estimation of hidden Markov model parameters for speech recognition, ICASSP, pp.49-52,
1986
[2] Baum, L.E., (Baum 1972)

An inequality and associated maximization technique in

statistical estimation of probabilistic functions of Markov processes, Inequalities, 3, 1972


[3] Bellegarda, J.R., Nahamoo, D., (Bellegarda 1990) Tied Mixture Continuous Parameter
Modeling for Speech Recognition, IEEE Trans. on ASSP, vol.ASSP-38, No.12,
pp.2033-2045, 1990
[4] Breiman, L., Friedman, J., Olshen, R.A., et al, (Breiman 1984) Classification and
Regression Trees, Belmout, CA: Wadsworth, 1984
[5] Dempster, A.P., Laird, N.M. and Rubin, D.B., (Dempster 1977) Maximum likelihood
from incomplete data via the EM algorithm, J.R.Statist. Soc., vol.39, pp 1-83, 1977
[6] Duchateau, J., Demuynck, K., Compernolle, D., (Duchateau 1998) Fast and Accurate
acoustic Modeling with Semi-Continuous HMMs, Speech Communication, vol.24, pp.5-17,
1998
[7] Fischer, V., RoB, T., (Fischer 1999) Reduced Gaussian Mixture Models in a Large
Vocabulary Continuous Speech Recognizer, EuroSpeech'99, Vol.3, pp.1099-1102, 1999
[8] Franzini, M.A., Lee, K.F.,

(Franzini 1990) A New Hybrid Method for Continuous

Speech Recognition, ICASSP-90, 1990


[9] Gales, M., Knill, K., and Young, S., (Gales 1999) State-Based Gaussian Selection in
Large Vocabulary Continuous Speech Recognition Using HMM's, IEEE Trans. On Speech
and Audio Processing, vol.7, No.2, pp.152-161, 1999

[10] Gao,S., Xu,B., Huang, T.Y., (Gao 1998) Class-triphone Acoustic Modeling Based on
Decision Tree for Mandarin Continuous Speech Recognition, International Symposium on
Chinese Spoken Language Processing (ISCSLP 98), ASR-A1, Singapore , 1998

- 57 -

[11] Huang, X.D., Jack, M.A., (Huang 1989) Semi-continuous hidden Markov models for
speech signals, Computer Speech and Language, vol.3, pp.239-251, 1989
[12] Jelinek, F., Mercer, R.L., Bahl, L.R., (Jelinek 1983) A maximum likehood approach to
continuous speech recognition, IEEE Trans. On Pattern Analysis and Machine Intelligence,
Vol. PAMI-5, pp179-190, 1983
[13] , (Jiang 1989), , [

]. , 1989
[14] Juang, B.H., Rabiner, L.R., (Juang 1985) A probabilistic distance measure for hidden
Markov models, AT&T Technical Journal, 64(2), pp.391~408, 1985
[15] Katz, S., (Katz 1987) Estimation of Probabilities from Sparse Data for the Language
Model Component of a Speech Recognizer, IEEE Trans. Acoust., Speech, Signal
processing, vol. ASSP-35, pp. 400-401, 1987

[16] Lee, C.H., Rabiner, L.R., (Lee 1989) A frame synchronous network search algorithm for
connected word recognition, Proc. of IEEE, 77(2), pp.257-285, 1989
[17] Li, J., Zheng, F., and Wu, W.H., (Li 2000) Context-Independent Chinese Initial-Final
Acoustic Modeling, International Symposium on Chinese Spoken Language Processing
(ISCSLP00), pp. 23-26, Beijing, 2000

[18] Ma, B., and Huo, Q., (Ma 2000) Benchmark results of triphone-based acoustic modeling
on HKU96 and HKU99 Putonghua corpora, International Symposium on Chinese Spoken
Language Processing (ISCSLP00), pp.359~362, Beijing, 2000

[19] Morgan, N., Bourlard, H., Morgan 1990Continuous Speech Recognition Using
Multilayer Perception with Hidden Markov Models, ICASSP, 1990
[20] (Mou 1998) []

1998
[21] Nadas, A., (Nadas 1985) On Turings formula for word probabilities, IEEE Trans.
Acoust., Speech, Signal processing, vol. ASSP-33, pp.1414-1416, 1985

- 58 -

[22] Reichl, W. and Chou, W., (Reichl 2000) Robust Decision Tree State Tying for
Continuous Speech Recognition, IEEE Trans. Speech and Audio Proc., Vol.8, No.5,
pp.555-566, 2000.
[23] Rabiner, L.R., (Rabiner 1989) A Tutorial on Hidden Markov Models and Selected
Applications in Speech recognition, IEEE Proceedings, Vol. 77, No.2, pp.257-286, 1989
[24] Sadaoki, F., (Sadaoki 1985) Digital Speech Processing, Synthesis, and Recognition,
Tokai University Press, 1985

[25] Verhasselt, J., Martens, J.P., (Verhasselt 1998) Context modeling in hybrid
segment-based/neural network recognition systems, Proceedings of ICASSP, Vol.1,
pp.501-504, 1998
[26] Vintsjuk, T.K, (Vintsjuk 1968) Recognition of Words of Oral Speech by Dynamic
Programming, Kibernetica, Vol. 81, No. 8, 1968.
[27] Viterbi, A.J., (Viterbi 1967)

Error bounds for convolutional codes and an

asymptotically optimum decoding algorithm, IEEE Trans. on IT, 13(2), 1967


[28] Waibel, A., et al, (Waibel 1989) Phoneme Recognition Using Time-Delay Neural
Networks, IEEE Trans. On ASSP, Vol.37, no.12, pp.1888-1898, 1989
[29] Wilpon, J.G., Lee, C.H., Rabiner, L.R., (Wilpon 1989) Application of hidden Markov
models for recognition of a limited set of words in unconstrained speech, ICASSP-89,
Vol.3, pp.254-257, 1989
[30] Wilpon, J.G., Miller, L.G., Modi, P., (Wilpon 1991) Improvements and Applications for
Key Word Recognition Using Hidden Markov Modeling Techniques, ICASSP-91, pp.
905-908, 1991
[31] , (Wu 1989) . . 1989
[32] Xu, M.X., Zheng, F., Wu, W.H., (Xu 1999) A Fast and Effective State Decoding
Algorithm, EuroSpeech'99, Vol.3, pp.1255-1258, Hungary, 1999
[33] Yang 1995. 1995

- 59 -

[34] Yong, S., Kershaw, D., Odell, J., Ollason, D., et al, (Yong 1999) The HTK Book (for
HTK Version 2.2), Cambridge University, 1999
[35] (Zheng 1996)

NCMMSC-96pp.32~35, 1996
[36] , , , Zheng 1997

3 (NCCIIIA). : ,
pp. 98-103, 1997
[37] Zheng, F., Mou, X.L., Wu, W.H., et al, (Zheng 1998) On the embedded multiple-model
scoring scheme for speech recognition, International Symposium on Chinese Spoken
Language Processing (ISCSLP98), ASR-A3, pp49-53, Singapore, 1998

[38] Zheng, F., Song, Z.J, and Xu, M.X, (Zheng 1999) EASYTALK: A large-vocabulary
speaker-independent Chinese dictation machine, EuroSpeech'99, Vol.2, pp.819-822,
Hungary, 1999

- 60 -

- 61 -

1977 1 1994
1999 6

1.

Jiyong ZHANG, Fang ZHENG, Mingxing XU, Ditang FANG, Semi-Continuous


Segmental Probability Modeling for Continuous Speech Recognition, International
Conference on Spoken Language Processing (ICSLP2000), Vol.1, pp.278-281, 2000

Beijing
2.

Jiyong ZHANG, Fang ZHENG, Mingxing XU, Shuqing LI, Intra-Syllable Dependent
Phonetic Modeling for Chinese Speech Recognition, International Symposium on
Chinese Spoken Language Processing (ISCSLP2000), pp. 73-76, 2000, Beijing

3.

, Vol. 10, No. 11, pp.1212-12151999 11

4.

Jiyong ZHANG, Fang ZHENG, Jing LI, Chunhua LUO, and Guoliang ZHANG,
Improved Context-Dependent Acoustic Modeling for Continuous Chinese Speech
Recognition, EuroSpeech, 2001 ()

- 63 -

You might also like