You are on page 1of 34

EE E6820: Speech & Audio Processing & Recognition

Lecture 10:
ASR: Sequence Recognition

1 Signal template matching

2 Statistical sequence recognition

3 Acoustic modeling

4 The Hidden Markov Model (HMM)

Dan Ellis <dpwe@ee.columbia.edu>


http://www.ee.columbia.edu/~dpwe/e6820/

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 1


1 Signal template matching
• Framewise comparison of unknown word and
stored templates:
70

FIVE
60

50
FOUR
Reference

40
ONE TWO THREE

30

20

10

10 20 30 40 50 time /frames

Test

- distance metric?
- comparison between templates?
- constraints?
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 2
Dynamic Time Warp (DTW)
• Find lowest-cost constrained path:
- matrix d(i,j) of distances between input frame fi
and reference frame rj
- allowable predecessors & transition costs Txy
Lowest cost to (i,j)

{ }
D(i-1,j) + T10
Reference frames rj

T10 D(i,j) = d(i,j) + min D(i,j-1) + T01


D(i-1,j) D(i-1,j-1) + T11

T01
Local match cost
11
T
Best predecessor
D(i-1,j) D(i-1,j) (including transition cost)

Input frames fi

• Best path via traceback from final state


- have to store predecessors for (almost) every (i,j)

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 3


DTW-based recognition
• Reference templates for each possible word
• Isolated word:
- mark endpoints of input word
- calculate scores through each template (+prune)
- choose best
• Continuous speech
- one matrix of template slices;
special-case constraints at word ends
FOUR
Reference
ONE TWO THREE

Input frames
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 4
DTW-based recognition (2)
+ Successfully handles timing variation
+ Able to recognize speech at reasonable cost
- Distance metric?
- pseudo-Euclidean space?
- Warp penalties?
- How to choose templates?
- several templates per word?
- choose ‘most representative’?
- align and average?
→ need a rigorous foundation...

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 5


Outline
1 Signal template matching

2 Statistical sequence recognition


- state-based modeling

3 Acoustic modeling

4 The Hidden Markov Model (HMM)

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 6


2 Statistical sequence recognition
• DTW limited because it’s hard to optimize
- interpretation of distance, transition costs?
• Need a theoretical foundation: Probability
• Formulate as MAP choice among models:
*
M = argmax p ( M j X , Θ )
Mj
- X = observed features
- Mj = word-sequence models
- Θ = all current parameters

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 7


Statistical formulation (2)
• Can rearrange via Bayes’ rule (& drop p(X) ):
*
M = argmax p ( M j X , Θ )
Mj
= argmax p ( X M j, Θ A ) p ( M j Θ L )
Mj
- p(X | Mj) = likelihood of obs’v’ns under model
- p(Mj) = prior probability of model
- ΘA = acoustics-related model parameters
- ΘL = language-related model parameters

• Questions:
- what form of model to use for p ( X M j, Θ A ) ?
- how to find ΘA (training)?
- how to solve for Mj (decoding)?

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 8


State-based modeling
• Assume discrete-state model for the speech:
- observations are divided up into time frames
- model → states → observations:
Model Mj
Qk : q1 q2 q3 q4 q5 q6 ... states

time
X1N : x1 x2 x3 x4 x5 x6 ... observed feature
vectors

• Probability of observations given model is:


N
p( X M j) = p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k
- sum over all possible state sequences Qk

• How do observations depend on states?


How do state sequences depend on model?

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 9


The speech recognition chain
• After classification, still have problem of
classifying the sequences of frames:

sound
Feature
calculation

feature vectors
Network Acoustic
weights classifier

phone probabilities
Word models HMM
Language model decoder

phone & word


labeling

• Questions
- what to use for the acoustic classifier?
- how to represent ‘model’ sequences?
- how to score matches?
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 10
Outline
1 Signal template matching

2 Statistical sequence recognition

3 Acoustic modeling
- defining targets
- neural networks & Gaussian models

4 The Hidden Markov Model (HMM)

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 11


3 Acoustic Modeling
• Goal: Convert features into probabilities of
particular labels:
i.e find p ( q n X n ) over some state set {qi}
i

- conventional statistical classification problem


• Classifier construction is data-driven
- assume we can get examples of known good Xs
for each of the qis
- calculate model parameters by standard training
scheme
• Various classifiers can be used
- GMMs model distribution under each state
- Neural Nets directly estimate posteriors
• Different classifiers have different properties
- features, labels limit ultimate performance

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 12


Defining classifier targets
• Choice of {qi} can make a big difference
- must support recognition task
- must be a practical classification task
• Hand-labeling is one source...
- ‘experts’ mark spectrogram boundaries
• ...Forced alignment is another
- ‘best guess’ with existing classifiers, given words
• Result is targets for each training frame:

Feature
vectors

Training eh n
targets g w
time

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 13


Forced alignment
• Best labeling given existing classifier
constrained by known word sequence

Feature
vectors
time

Existing
classifier

Phone ow th r iy n s
posterior
Known word probabilities
sequence

Dictionary ow th r iy ... Constrained


alignment

Training ow th r iy
targets

Classifier
training

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 14


Gaussian Mixture Models vs. Neural Nets
• GMMs fit distribution of features under states:
- separate ‘likelihood’ model for each state qi
k 1 1 T –1
p ( x q ) = ----------------------------------
- ⋅ exp – --
- ( x – µ k Σk ( x – µk )
)
d 1⁄2 2
( 2π ) Σ k
- match any distribution given enough data
• Neural nets estimate posteriors directly
p ( q x ) = F [ ∑ w jk ⋅ F [ ∑ w ij x i ] ]
k
j j
- parameters set to discriminate classes
• Posteriors & likelihoods related by Bayes’ rule:
k k
k p ( x q ) ⋅ Pr ( q )
p ( q x ) = -----------------------------------------------
∑ ( ) ⋅ ( )
j j
p x q Pr q
j

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 15


Outline
1 Signal template matching

2 Statistical sequence recognition

3 Acoustic classification

4 The Hidden Markov Model (HMM)


- generative Markov models
- hidden Markov models
- model fit likelihood
- HMM examples

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 16


3 Markov models
• A (first order) Markov model
is a finite-state system
whose behavior depends
only on the current state
• E.g. generative Markov model:
.8 .8
S qn+1
.1 p(qn+1|qn) S A B C E
A .1 B
S 0 1 0 0 0
.1 .1 A 0 .8 .1 .1 0
.1 .1 qn B 0 .1 .8 .1 0
C C 0 .1 .1 .7 .1
.1 E 0 0 0 0 1
E
.7
SAAAAAAAABBBBBBBBBCCCCBBBBBBCE

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 17


Hidden Markov models
• Markov models where state sequence Q = {qn}
is not directly observable (= ‘hidden’)
• But, observations X do depend on Q:
- xn is rv that depends on current state: p ( x q )
State sequence
Emission distributions AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC
q=A q=B q=C Observation
p(x|q)

0.8
3 sequence
0.6

xn
2
0.4
0.2 1

0 0
p(x|q)

0.8 q=A q=B q=C 3


0.6
xn 2
0.4
0.2 1

0 0
0 1 2 3 4 0 10 20 30
observation x time step n
- can still tell something about state seq...
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 18
(Generative) Markov models (2)
• HMM is specified by:
- transition probabilities p ( q n q n – 1 ) ≡ a ij
j i

- (initial state probabilities p ( q 1 ) ≡ π i )


i

- emission distributions p ( x q ) ≡ b i ( x )
i

- states qi • k a t •

k a t •
• 1.0 0.0 0.0 0.0
- transition • k a t •
k 0.9 0.1 0.0 0.0
probabilities aij a 0.0 0.9 0.1 0.0
t 0.0 0.0 0.9 0.1

• k a t •
- emission
distributions bi(x)
p(x|q)
x

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 19


Markov models for speech
• Speech models Mj
- typ. left-to-right HMMs (sequence constraint)
- observation & evolution are conditionally
independent of rest given (hidden) state qn

S ae1 ae2 ae3 E q1 q2 q3 q4 q5

x1 x2 x3 x4 x5

- self-loops for time dilation

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 20


Markov models for sequence recognition
• Independence of observations:
- observation xn depends only current state qn
p ( X Q ) = p ( x 1, x 2, …x N q 1, q 2, …q N )
= p ( x1 q1 ) ⋅ p ( x2 q2 ) ⋅ … p ( x N q N )

∏n = 1 ∏n = 1 b q n ( x n )
N N
= p ( xn qn ) =

• Markov transitions:
- transition to next state qi+1 depends only on qi
p ( Q M ) = p ( q 1, q 2, …q N M )
= p ( q N q 1 …q N– 1 ) p ( q N– 1 q 1 …q N– 2 )…p ( q 2 q 1 ) p ( q 1 )

= p ( q N q N– 1 ) p ( q N– 1 q N– 2 )…p ( q 2 q 1 ) p ( q 1 )

= p ( q1 ) ∏ p ( q n q n –1 ) = π q ∏
N N
n=2 1
a
n = 2 q n–1 q n

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 21


Model fit calculation
• From ‘state-based modeling’:


N
p( X M j) = p ( X 1 Q k, M j ) ⋅ p ( Q k M j )
all Q k

• For HMMs:
∏n = 1 b q n ( x n )
N
p( X Q) =

p ( Q M ) = πq ⋅ ∏
N
1 n=2
aqn–1 qn

• Hence, solve for M* :


- calculate p ( X M j ) for each available model,
scale by prior p ( M j ) → p ( M j X )

• Sum over all Qk ???

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 22


Summing over all paths
Model M1 xn Observations Observation
0.7 0.8 p(x|B)
x1, x2, x3 likelihoods
p(x|q) x1 x2 x3
0.9 0.2 0.2 p(x|A)
S A B E
q{
A 2.5 0.2 0.1
0.1 0.1 1 2 3 n B 0.1 2.2 2.3
S A B E Paths
S • 0.9 0.1 • E
A • 0.7 0.2 0.1 0.8 0.8 0.2

States
B • • 0.8 0.2 B
0.1 0.2 0.2 0.1
E • • • 1 A 0.7 0.7
S 0.9
0 1 2 3 4 time n
All possible 3-emission paths Qk from S to E
q0 q1 q2 q3 q4 p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)
S A A A E .9 x .7 x .7 x .1 = 0.0441 2.5 x 0.2 x 0.1 = 0.05 0.0022
S A A B E .9 x .7 x .2 x .2 = 0.0252 2.5 x 0.2 x 2.3 = 1.15 0.0290
S A B B E .9 x .2 x .8 x .2 = 0.0288 2.5 x 2.2 x 2.3 = 12.65 0.3643
S B B B E .1 x .8 x .8 x .2 = 0.0128 0.1 x 2.2 x 2.3 = 0.506 0.0065
Σ = 0.1109 Σ = p(X | M) = 0.4020

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 23


The ‘forward recursion’
• Dynamic-programming-like technique to
calculate sum over all Qk

• Define α n ( i ) as the probability of getting to


state qi at time step n (by any path):
i n i
α n ( i ) = p ( x 1, x 2, …x n, q n =q ) ≡ p ( X 1, q n )

• Then α n + 1 ( j ) can be calculated recursively:


S
αn+1(j) = [i=1
Σ αn(i)·aij]·bj(xn+1)
Model states qi

ai+1j bj(xn+1)
αn(i+1) aij

αn(i)
Time steps n

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 24


Forward recursion (2)
• Initialize α 1 ( i ) = π i ⋅ b i ( x 1 )

S
∑ αN ( i )
N
• Then total probability p ( X 1 M) =
i=1

→ Practical way to solve for p(X | Mj) and hence


perform recognition

p(X | M1)·p(M1)
p(X | M2)·p(M2)
Observations
X Choose best

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 25


Optimal path
• May be interested in actual qn assignments
- which state was ‘active’ at each time frame
- e.g. phone labelling (for training?)
• Total probability is over all paths...
• ... but can also solve for single best path
= “Viterbi” state sequence
j
• Probability along best path to state q n + 1 :
* *
α n + 1 ( j ) = max { α n ( i )a ij } ⋅ b j ( x n + 1 )
i
- backtrack from final state to get best path
- final probability is product only (no sum)
→ log-domain calculation just summation
• Total probability often dominated by best path:
*
p ( X, Q M ) ≈ p ( X M )
E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 26
Interpreting the Viterbi path
• Viterbi path assigns each xn to a state qi
- performing classification based on bi(x)
- ... at the same time as applying transition
constraints aij
Inferred classification

xn
2

0
0 10 20 30
Viterbi labels: AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC

• Can be used for segmentation


- train an HMM with ‘garbage’ and ‘target’ states
- decode on new data to find ‘targets’, boundaries
• Can use for (heuristic) training
- e.g. train classifiers based on labels...

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 27


Recognition with HMMs
• Isolated word
- choose best p ( M X ) ∝ p ( X M ) p ( M )
Model M1
w ah n p(X | M1)·p(M1) = ...

Input
Model M2
t uw p(X | M2)·p(M2) = ...

Model M3
th r iy p(X | M3)·p(M3) = ...

• Continuous speech
- Viterbi decoding of one large HMM gives words

p(M1) w ah n
Input
p(M2)
t uw
sil
p(M3)
th r iy

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 28


HMM example:
Different state sequences

0.8 0.9 0.8

Model M1 0.2 0.1 0.2


S K A T E

0.8 0.9 0.8

Model M2 0.2 0.1 0.2


S K O T E

p( O)
p( K )
p( T)
A)
p(x | q)

x|
x|

x|
x|
p(
0.8
Emission
0.6
distributions 0.4
0.2
0
0 1 2 3 4 x

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 29


Model inference:
Emission probabilities
4
Observation A
sequence T
K2
xn O0
0 2 4 6 8 10 12 14 16 18
time n / steps
Model M1 4
log p(X | M) = -32.1
state alignment 2

log p(X,Q* | M) = -33.5


0
0
log trans.prob -1

log p(Q* | M) = -7.5 -2-3


0
log obs.l'hood -5
log p(X | Q*,M) = -26.0 -10

Model M2 4
log p(X | M) = -47.0
state alignment 2
*
log p(X,Q | M) = -47.5
0
0
log trans.prob -1

log p(Q* | M) = -8.3 -2-3


0
log obs.l'hood -5
log p(X | Q*,M) = -39.2-10

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 30


Model inference:
Transition probabilities
0.9 0.9
Model M'1 0.18 0.1 Model M'2 0.05 0.1
0.8 A 0.8 0.8 A 0.8
0.2 0.2
0.9 0.9
S K T E S K T E
0.1 0.1
0.02 O 0.15 O
4
state alignment
2

0
0
log obs.l'hood -5
log p(X | Q*,M) = -26.0 -10
0 2 4 6 8 10 12 14 16 18
time n / steps
Model M'1 log p(X | M) = -32.2
log p(X,Q* | M) = -33.6 0
-1
log trans.prob -2
log p(Q* | M) = -7.6 -3

Model M'2 log p(X | M) = -33.5


log p(X,Q* | M) = -34.9 0
log trans.prob -1
-2
log p(Q* | M) = -8.9 -3

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 31


Validity of HMM assumptions
• Key assumption is conditional independence:
Given qi, future evolution & obs. distribution
are independent of previous events
- duration behavior: self-loops imply exponential
distribution
p(N = n)
γ 1−γ
γ(1−γ)

n
- independence of successive xns

p(xn|qi)

xn

n
p( X ) = ∏ p ( xn q ) ?
i

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 32


Recap: Recognizer Structure

sound
Feature
calculation

feature vectors
Network Acoustic
weights classifier

phone probabilities
Word models HMM
Language model decoder

phone & word


labeling

• Know how to execute each state


• .. training HMMs?
• .. language/word models

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 33


Summary
• Speech is modeled as a sequence of features
- need temporal aspect to recognition
- best time-alignment of templates = DTW
• Hidden Markov models are rigorous solution
- self-loops allow temporal dilation
- exact, efficient likelihood calculations

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2002-04-15 - 34

You might also like