Professional Documents
Culture Documents
Lecture 10:
ASR: Sequence Recognition
3 Acoustic modeling
FIVE
60
50
FOUR
Reference
40
ONE TWO THREE
30
20
10
10 20 30 40 50 time /frames
Test
- distance metric?
- comparison between templates?
- constraints?
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2
Dynamic Time Warp (DTW)
• Find lowest-cost constrained path:
- matrix d(i,j) of distances between input frame fi
and reference frame rj
- allowable predecessors & transition costs Txy
Lowest cost to (i,j)
{ }
D(i-1,j) + T10
Reference frames rj
T01
Local match cost
11
T
Best predecessor
D(i-1,j) D(i-1,j) (including transition cost)
Input frames fi
Input frames
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4
DTW-based recognition (2)
+ Successfully handles timing variation
+ Able to recognize speech at reasonable cost
- Distance metric?
- pseudo-Euclidean space?
- Warp penalties?
- How to choose templates?
- several templates per word?
- choose ‘most representative’?
- align and average?
→ need a rigorous foundation...
3 Acoustic modeling
• Questions:
- what form of model to use for p ( X M j, Θ A ) ?
- how to find ΘA (training)?
- how to solve for Mj (decoding)?
time
X1N : x1 x2 x3 x4 x5 x6 ... observed feature
vectors
∑
N
p( X M j) = p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
- sum over all possible state sequences Qk
sound
Feature
calculation
feature vectors
Network Acoustic
weights classifier
phone probabilities
Word models HMM
Language model decoder
• Questions
- what to use for the acoustic classifier?
- how to represent ‘model’ sequences?
- how to score matches?
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10
Outline
1 Signal template matching
3 Acoustic modeling
- defining targets
- neural networks & Gaussian models
Feature
vectors
Training eh n
targets g w
time
Feature
vectors
time
Existing
classifier
Phone ow th r iy n s
posterior
Known word probabilities
sequence
Training ow th r iy
targets
Classifier
training
3 Acoustic classification
0.8
3 sequence
0.6
xn
2
0.4
0.2 1
0 0
p(x|q)
0 0
0 1 2 3 4 0 10 20 30
observation x time step n
- can still tell something about state seq...
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 18
(Generative) Markov models (2)
• HMM is specified by:
- transition probabilities p ( q n q n – 1 ) ≡ a ij
j i
- emission distributions p ( x q ) ≡ b i ( x )
i
- states qi • k a t •
k a t •
• 1.0 0.0 0.0 0.0
- transition • k a t •
k 0.9 0.1 0.0 0.0
probabilities aij a 0.0 0.9 0.1 0.0
t 0.0 0.0 0.9 0.1
• k a t •
- emission
distributions bi(x)
p(x|q)
x
x1 x2 x3 x4 x5
∏n = 1 ∏n = 1 b q n ( x n )
N N
= p ( xn qn ) =
• Markov transitions:
- transition to next state qi+1 depends only on qi
p ( Q M ) = p ( q 1, q 2, …q N M )
= p ( q N q 1 …q N– 1 ) p ( q N– 1 q 1 …q N– 2 )…p ( q 2 q 1 ) p ( q 1 )
= p ( q N q N– 1 ) p ( q N– 1 q N– 2 )…p ( q 2 q 1 ) p ( q 1 )
= p ( q1 ) ∏ p ( q n q n –1 ) = π q ∏
N N
n=2 1
a
n = 2 q n–1 q n
∑
N
p( X M j) = p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
• For HMMs:
∏n = 1 b q n ( x n )
N
p( X Q) =
p ( Q M ) = πq ⋅ ∏
N
1 n=2
aqn–1 qn
States
B • • 0.8 0.2 B
0.1 0.2 0.2 0.1
E • • • 1 A 0.7 0.7
S 0.9
0 1 2 3 4 time n
All possible 3-emission paths Qk from S to E
q0 q1 q2 q3 q4 p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)
S A A A E .9 x .7 x .7 x .1 = 0.0441 2.5 x 0.2 x 0.1 = 0.05 0.0022
S A A B E .9 x .7 x .2 x .2 = 0.0252 2.5 x 0.2 x 2.3 = 1.15 0.0290
S A B B E .9 x .2 x .8 x .2 = 0.0288 2.5 x 2.2 x 2.3 = 12.65 0.3643
S B B B E .1 x .8 x .8 x .2 = 0.0128 0.1 x 2.2 x 2.3 = 0.506 0.0065
Σ = 0.1109 Σ = p(X | M) = 0.4020
ai+1j bj(xn+1)
αn(i+1) aij
αn(i)
Time steps n
S
∑ αN ( i )
N
• Then total probability p ( X 1 M) =
i=1
p(X | M1)·p(M1)
p(X | M2)·p(M2)
Observations
X Choose best
xn
2
0
0 10 20 30
Viterbi labels: AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
Input
Model M2
t uw p(X | M2)·p(M2) = ...
Model M3
th r iy p(X | M3)·p(M3) = ...
• Continuous speech
- Viterbi decoding of one large HMM gives words
p(M1) w ah n
Input
p(M2)
t uw
sil
p(M3)
th r iy
p( O)
p( K )
p( T)
A)
p(x | q)
x|
x|
x|
x|
p(
0.8
Emission
0.6
distributions 0.4
0.2
0
0 1 2 3 4 x
Model M2 4
log p(X | M) = -47.0
state alignment 2
*
log p(X,Q | M) = -47.5
0
0
log trans.prob -1
0
0
log obs.l'hood -5
log p(X | Q*,M) = -26.0 -10
0 2 4 6 8 10 12 14 16 18
time n / steps
Model M'1 log p(X | M) = -32.2
log p(X,Q* | M) = -33.6 0
-1
log trans.prob -2
log p(Q* | M) = -7.6 -3
n
- independence of successive xns
p(xn|qi)
xn
n
p( X ) = ∏ p ( xn q ) ?
i
sound
Feature
calculation
feature vectors
Network Acoustic
weights classifier
phone probabilities
Word models HMM
Language model decoder