EE E6820: Speech & Audio Processing & Recognition

EE E6820: Speech & Audio Processing & Recognition
Lecture 10:
ASR: Sequence Recognition
1 Signal template matching
2 Statistical sequence recognition
3 Acoustic modeling
4 The Hidden Markov Model (HMM)
Dan Ellis <dpwe@ee.columbia.edu>

http://www.ee.columbia.edu/~dpwe/e6820/
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1

• Framewise comparison of unknown word and
stored templates:
70
FIVE
60
50
FOUR
Reference
40
ONE TWO THREE
30
20
10
10 20 30 40 50 time /frames
Test
- distance metric?
- comparison between templates?
- constraints?
Dynamic Time Warp (DTW)
• Find lowest-cost constrained path:
- matrix d(i,j) of distances between input frame fi
and reference frame rj
- allowable predecessors & transition costs Txy
Lowest cost to (i,j)
{ }
D(i-1,j) + T10
Reference frames rj
T10 D(i,j) = d(i,j) + min D(i,j-1) + T01

D(i-1,j) D(i-1,j-1) + T11
T01
Local match cost
11
T
Best predecessor
D(i-1,j) D(i-1,j) (including transition cost)
Input frames fi
• Best path via traceback from final state

- have to store predecessors for (almost) every (i,j)

DTW-based recognition
• Reference templates for each possible word
• Isolated word:
- mark endpoints of input word
- calculate scores through each template (+prune)
- choose best
• Continuous speech
- one matrix of template slices;
special-case constraints at word ends
FOUR
Reference
ONE TWO THREE
Input frames
DTW-based recognition (2)
+ Successfully handles timing variation
+ Able to recognize speech at reasonable cost
- Distance metric?
- pseudo-Euclidean space?
- Warp penalties?
- How to choose templates?
- several templates per word?
- choose ‘most representative’?
- align and average?
→ need a rigorous foundation...

Outline

- state-based modeling
3 Acoustic modeling

• DTW limited because it’s hard to optimize
- interpretation of distance, transition costs?
• Need a theoretical foundation: Probability
• Formulate as MAP choice among models:
*
M = argmax p ( M j X , Θ )
Mj
- X = observed features
- Mj = word-sequence models
- Θ = all current parameters

Statistical formulation (2)
• Can rearrange via Bayes’ rule (& drop p(X) ):
*
M = argmax p ( M j X , Θ )
Mj
= argmax p ( X M j, Θ A ) p ( M j Θ L )
Mj
- p(X | Mj) = likelihood of obs’v’ns under model
- p(Mj) = prior probability of model
- ΘA = acoustics-related model parameters
- ΘL = language-related model parameters
• Questions:
- what form of model to use for p ( X M j, Θ A ) ?
- how to find ΘA (training)?
- how to solve for Mj (decoding)?

State-based modeling
• Assume discrete-state model for the speech:
- observations are divided up into time frames
- model → states → observations:
Model Mj
Qk : q1 q2 q3 q4 q5 q6 ... states
time
X1N : x1 x2 x3 x4 x5 x6 ... observed feature
vectors
• Probability of observations given model is:
∑
N
p( X M j) = p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
- sum over all possible state sequences Qk
• How do observations depend on states?

How do state sequences depend on model?

The speech recognition chain
• After classification, still have problem of
classifying the sequences of frames:
sound
Feature
calculation
feature vectors
Network Acoustic
weights classifier
phone probabilities
Word models HMM
Language model decoder
phone & word

labeling
• Questions
- what to use for the acoustic classifier?
- how to represent ‘model’ sequences?
- how to score matches?
Outline
3 Acoustic modeling
- defining targets
- neural networks & Gaussian models

3 Acoustic Modeling
• Goal: Convert features into probabilities of
particular labels:
i.e find p ( q n X n ) over some state set {qi}
i
- conventional statistical classification problem

• Classifier construction is data-driven
- assume we can get examples of known good Xs
for each of the qis
- calculate model parameters by standard training
scheme
• Various classifiers can be used
- GMMs model distribution under each state
- Neural Nets directly estimate posteriors
• Different classifiers have different properties
- features, labels limit ultimate performance

Defining classifier targets
• Choice of {qi} can make a big difference
- must support recognition task
- must be a practical classification task
• Hand-labeling is one source...
- ‘experts’ mark spectrogram boundaries
• ...Forced alignment is another
- ‘best guess’ with existing classifiers, given words
• Result is targets for each training frame:
Feature
vectors
Training eh n
targets g w
time

Forced alignment
• Best labeling given existing classifier
constrained by known word sequence
Feature
vectors
time
Existing
classifier
Phone ow th r iy n s
posterior
Known word probabilities
sequence
Dictionary ow th r iy ... Constrained

alignment
Training ow th r iy
targets
Classifier
training

Gaussian Mixture Models vs. Neural Nets
• GMMs fit distribution of features under states:
- separate ‘likelihood’ model for each state qi
k 1 1 T –1
p ( x q ) = ----------------------------------
- ⋅ exp – --
- ( x – µ k Σk ( x – µk )
)
d 1⁄2 2
( 2π ) Σ k
- match any distribution given enough data
• Neural nets estimate posteriors directly
p ( q x ) = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ]
k
j j
- parameters set to discriminate classes
• Posteriors & likelihoods related by Bayes’ rule:
k k
k p ( x q ) ⋅ Pr ( q )
p ( q x ) = -----------------------------------------------
∑ ( ) ⋅ ( )
j j
p x q Pr q
j

Outline
3 Acoustic classification

- generative Markov models
- hidden Markov models
- model fit likelihood
- HMM examples

3 Markov models
• A (first order) Markov model
is a finite-state system
whose behavior depends
only on the current state
• E.g. generative Markov model:
.8 .8
S qn+1
.1 p(qn+1|qn) S A B C E
A .1 B
S 0 1 0 0 0
.1 .1 A 0 .8 .1 .1 0
.1 .1 qn B 0 .1 .8 .1 0
C C 0 .1 .1 .7 .1
.1 E 0 0 0 0 1
E
.7
SAAAAAAAABBBBBBBBBCCCCBBBBBBCE

Hidden Markov models
• Markov models where state sequence Q = {qn}
is not directly observable (= ‘hidden’)
• But, observations X do depend on Q:
- xn is rv that depends on current state: p ( x q )
State sequence
Emission distributions AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
q=A q=B q=C Observation
p(x|q)
0.8
3 sequence
0.6
xn
2
0.4
0.2 1
0 0
p(x|q)
0.8 q=A q=B q=C 3

0.6
xn 2
0.4
0.2 1
0 0
0 1 2 3 4 0 10 20 30
observation x time step n
- can still tell something about state seq...
(Generative) Markov models (2)
• HMM is specified by:
- transition probabilities p ( q n q n – 1 ) ≡ a ij
j i
- (initial state probabilities p ( q 1 ) ≡ π i )

i
- emission distributions p ( x q ) ≡ b i ( x )
i
- states qi • k a t •
k a t •
• 1.0 0.0 0.0 0.0
- transition • k a t •
k 0.9 0.1 0.0 0.0
probabilities aij a 0.0 0.9 0.1 0.0
t 0.0 0.0 0.9 0.1
• k a t •
- emission
distributions bi(x)
p(x|q)
x

Markov models for speech
• Speech models Mj
- typ. left-to-right HMMs (sequence constraint)
- observation & evolution are conditionally
independent of rest given (hidden) state qn
S ae1 ae2 ae3 E q1 q2 q3 q4 q5
x1 x2 x3 x4 x5
- self-loops for time dilation

Markov models for sequence recognition
• Independence of observations:
- observation xn depends only current state qn
p ( X Q ) = p ( x 1, x 2, …x N q 1, q 2, …q N )
= p ( x1 q1 ) ⋅ p ( x2 q2 ) ⋅ … p ( x N q N )
∏n = 1 ∏n = 1 b q n ( x n )
N N
= p ( xn qn ) =
• Markov transitions:
- transition to next state qi+1 depends only on qi
p ( Q M ) = p ( q 1, q 2, …q N M )
= p ( q N q 1 …q N– 1 ) p ( q N– 1 q 1 …q N– 2 )…p ( q 2 q 1 ) p ( q 1 )
= p ( q N q N– 1 ) p ( q N– 1 q N– 2 )…p ( q 2 q 1 ) p ( q 1 )
= p ( q1 ) ∏ p ( q n q n –1 ) = π q ∏
N N
n=2 1
a
n = 2 q n–1 q n

Model fit calculation
• From ‘state-based modeling’:
∑
N
p( X M j) = p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
• For HMMs:
∏n = 1 b q n ( x n )
N
p( X Q) =
p ( Q M ) = πq ⋅ ∏
N
1 n=2
aqn–1 qn
• Hence, solve for M* :

- calculate p ( X M j ) for each available model,
scale by prior p ( M j ) → p ( M j X )
• Sum over all Qk ???

Summing over all paths
Model M1 xn Observations Observation
0.7 0.8 p(x|B)
x1, x2, x3 likelihoods
p(x|q) x1 x2 x3
0.9 0.2 0.2 p(x|A)
S A B E
q{
A 2.5 0.2 0.1
0.1 0.1 1 2 3 n B 0.1 2.2 2.3
S A B E Paths
S • 0.9 0.1 • E
A • 0.7 0.2 0.1 0.8 0.8 0.2
States
B • • 0.8 0.2 B
0.1 0.2 0.2 0.1
E • • • 1 A 0.7 0.7
S 0.9
0 1 2 3 4 time n
All possible 3-emission paths Qk from S to E
q0 q1 q2 q3 q4 p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)
S A A A E .9 x .7 x .7 x .1 = 0.0441 2.5 x 0.2 x 0.1 = 0.05 0.0022
S A A B E .9 x .7 x .2 x .2 = 0.0252 2.5 x 0.2 x 2.3 = 1.15 0.0290
S A B B E .9 x .2 x .8 x .2 = 0.0288 2.5 x 2.2 x 2.3 = 12.65 0.3643
S B B B E .1 x .8 x .8 x .2 = 0.0128 0.1 x 2.2 x 2.3 = 0.506 0.0065
Σ = 0.1109 Σ = p(X | M) = 0.4020

The ‘forward recursion’
• Dynamic-programming-like technique to
calculate sum over all Qk
• Define α n ( i ) as the probability of getting to

state qi at time step n (by any path):
i n i
α n ( i ) = p ( x 1, x 2, …x n, q n =q ) ≡ p ( X 1, q n )
• Then α n + 1 ( j ) can be calculated recursively:

S
αn+1(j) = [i=1
Σ αn(i)·aij]·bj(xn+1)
Model states qi
ai+1j bj(xn+1)
αn(i+1) aij
αn(i)
Time steps n

Forward recursion (2)
• Initialize α 1 ( i ) = π i ⋅ b i ( x 1 )
S
∑ αN ( i )
N
• Then total probability p ( X 1 M) =
i=1
→ Practical way to solve for p(X | Mj) and hence

perform recognition
p(X | M1)·p(M1)
p(X | M2)·p(M2)
Observations
X Choose best

Optimal path
• May be interested in actual qn assignments
- which state was ‘active’ at each time frame
- e.g. phone labelling (for training?)
• Total probability is over all paths...
• ... but can also solve for single best path
= “Viterbi” state sequence
j
• Probability along best path to state q n + 1 :
* *
α n + 1 ( j ) = max { α n ( i )a ij } ⋅ b j ( x n + 1 )
i
- backtrack from final state to get best path
- final probability is product only (no sum)
→ log-domain calculation just summation
• Total probability often dominated by best path:
*
p ( X, Q M ) ≈ p ( X M )
Interpreting the Viterbi path
• Viterbi path assigns each xn to a state qi
- performing classification based on bi(x)
- ... at the same time as applying transition
constraints aij
Inferred classification
xn
2
0
0 10 20 30
Viterbi labels: AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
• Can be used for segmentation

- train an HMM with ‘garbage’ and ‘target’ states
- decode on new data to find ‘targets’, boundaries
• Can use for (heuristic) training
- e.g. train classifiers based on labels...

Recognition with HMMs
• Isolated word
- choose best p ( M X ) ∝ p ( X M ) p ( M )
Model M1
w ah n p(X | M1)·p(M1) = ...
Input
Model M2
t uw p(X | M2)·p(M2) = ...
Model M3
th r iy p(X | M3)·p(M3) = ...
• Continuous speech
- Viterbi decoding of one large HMM gives words
p(M1) w ah n
Input
p(M2)
t uw
sil
p(M3)
th r iy

HMM example:
Different state sequences
0.8 0.9 0.8
Model M1 0.2 0.1 0.2

S K A T E
0.8 0.9 0.8
Model M2 0.2 0.1 0.2

S K O T E
p( O)
p( K )
p( T)
A)
p(x | q)
x|
x|
x|
x|
p(
0.8
Emission
0.6
distributions 0.4
0.2
0
0 1 2 3 4 x

Model inference:
Emission probabilities
4
Observation A
sequence T
K2
xn O0
0 2 4 6 8 10 12 14 16 18
time n / steps
Model M1 4
log p(X | M) = -32.1
state alignment 2
log p(X,Q* | M) = -33.5

0
0
log trans.prob -1
log p(Q* | M) = -7.5 -2-3

0
log obs.l'hood -5
log p(X | Q*,M) = -26.0 -10
Model M2 4
log p(X | M) = -47.0
state alignment 2
*
log p(X,Q | M) = -47.5
0
0
log trans.prob -1
log p(Q* | M) = -8.3 -2-3

0
log obs.l'hood -5
log p(X | Q*,M) = -39.2-10

Model inference:
Transition probabilities
0.9 0.9
Model M'1 0.18 0.1 Model M'2 0.05 0.1
0.8 A 0.8 0.8 A 0.8
0.2 0.2
0.9 0.9
S K T E S K T E
0.1 0.1
0.02 O 0.15 O
4
state alignment
2
0
0
log obs.l'hood -5
log p(X | Q*,M) = -26.0 -10
0 2 4 6 8 10 12 14 16 18
time n / steps
Model M'1 log p(X | M) = -32.2
log p(X,Q* | M) = -33.6 0
-1
log trans.prob -2
log p(Q* | M) = -7.6 -3
Model M'2 log p(X | M) = -33.5

log p(X,Q* | M) = -34.9 0
log trans.prob -1
-2
log p(Q* | M) = -8.9 -3

Validity of HMM assumptions
• Key assumption is conditional independence:
Given qi, future evolution & obs. distribution
are independent of previous events
- duration behavior: self-loops imply exponential
distribution
p(N = n)
γ 1−γ
γ(1−γ)
n
- independence of successive xns
p(xn|qi)
xn
n
p( X ) = ∏ p ( xn q ) ?
i

Recap: Recognizer Structure
sound
Feature
calculation
feature vectors
Network Acoustic
weights classifier
phone probabilities
Word models HMM
Language model decoder
phone & word

labeling
• Know how to execute each state

• .. training HMMs?
• .. language/word models

Summary
• Speech is modeled as a sequence of features
- need temporal aspect to recognition
- best time-alignment of templates = DTW
• Hidden Markov models are rigorous solution
- self-loops allow temporal dilation
- exact, efficient likelihood calculations

EE E6820: Speech & Audio Processing & Recognition - Sequence Recognition and Hidden Markov Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EE E6820: Speech & Audio Processing & Recognition - Sequence Recognition and Hidden Markov Models

Uploaded by

Copyright:

Available Formats

1 Signal template matching

2 Statistical sequence recognition

4 The Hidden Markov Model (HMM)

Dan Ellis <dpwe@ee.columbia.edu>

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1

T10 D(i,j) = d(i,j) + min D(i,j-1) + T01

• Best path via traceback from final state

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5

2 Statistical sequence recognition

4 The Hidden Markov Model (HMM)

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8

• Probability of observations given model is:

• How do observations depend on states?

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9

phone & word

2 Statistical sequence recognition

4 The Hidden Markov Model (HMM)

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11

- conventional statistical classification problem

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 13

Dictionary ow th r iy ... Constrained

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15

2 Statistical sequence recognition

4 The Hidden Markov Model (HMM)

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17

0.8 q=A q=B q=C 3

- (initial state probabilities p ( q 1 ) ≡ π i )

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19

S ae1 ae2 ae3 E q1 q2 q3 q4 q5

- self-loops for time dilation

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 21

• Hence, solve for M* :

• Sum over all Qk ???

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 22

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 23

• Define α n ( i ) as the probability of getting to

• Then α n + 1 ( j ) can be calculated recursively:

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 24

→ Practical way to solve for p(X | Mj) and hence

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 25

• Can be used for segmentation

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 27

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 28

0.8 0.9 0.8

Model M1 0.2 0.1 0.2

0.8 0.9 0.8

Model M2 0.2 0.1 0.2

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 29

log p(X,Q* | M) = -33.5

log p(Q* | M) = -7.5 -2-3

log p(Q* | M) = -8.3 -2-3

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 30

Model M'2 log p(X | M) = -33.5

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 31