Professional Documents
Culture Documents
Speech Recognition & Synthesis
Speech Recognition & Synthesis
Dan Jurafsky
IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes,
as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try to
accurately give credit on each slide.
Learning
EM for HMMs (the “Baum-Welch Algorithm”)
Embedded Training
Training mixture gaussians
Disfluencies
We want:
t (i, j) P(qt i,qt 1 j | O, )
We’ve got:
not _quite_ t (i, j) P(qt i,qt 1 j,O | )
Which we compute as follows:
We want:
t (i, j) P(qt i,qt 1 j | O, )
We’ve got:
not _quite_ t (i, j) P(qt i,qt 1 j,O | )
Since:
We need:
not _ quite _ t (i, j)
t (i, j)
P(O | )
LSA 352 Summer 2007 14
From not-quite- to
Now,
t1 ( j)aijc jmb jm (ot ) j (t)
tm ( j) i1
F (T)
T T T
Instead:
We’ll train each phone HMM embedded in an entire
sentence
We’ll do word/phone segmentation and alignment
automatically as part of training process
Transition probabilities:
set to zero any that you want to be “structurally
zero”
– The probability computation includes previous value of
aij, so if it’s zero it will never change
Set the rest to identical values
Likelihoods:
initialize and of each state to global mean and
variance of all training data
Viterbi Baum-Welch
n ij
aˆ ij For all pairs of emitting states,
1 <= i, j <= N
ni
n (s.t.o v )
bˆ j (v k ) j t k
nj
Where nij is number of frames with transition from i to j in best path
And nj is number of frames where state j is occupied
1
i
2
(ot i ) s.t. qt i
2
N i t 1
Viterbi training for mixture Gaussians is more
complex, generally just assign each observation to 1
mixture
In log space:
Repeating:
Where:
Disfluencies
Characteristics of disfluences
Detecting disfluencies
MDE bakeoff
Fragments
Experiment 2:
Parts of SWBD hand-annotated for clauses
Build FP-model by deleting only medial FPs
Now prediction of post-FP word (perplexity) improves
greatly!
Siu and Ostendorf found same with “you know”
Syntax Only
Precision: 96%
Syntax and Semantics
Correction: 70%
LSA
Liu et 352 Summer 2007
al 2003 68
Kinds of disfluencies
Repetitions
I * I like it
Revisions
We * I like it
Restarts (false starts)
It’s also * I like it
Conventions:
./ for statement SU boundaries,
<> for fillers,
[] for edit words,
* for IP (interruption point) inside edits
And <uh> <you know> wash your clothes wherever
you are ./ and [ you ] * you really get used to the
outdoors ./
CTS BN
Training set (words) 484K 182K
2 stage
Decision tree plus N-gram LM to decide boundary
Second maxent classifier to decide subtype
Current error rates:
Finding boundaries
– 40-60% using ASR
– 26-47% using transcripts
Function 12 4%
1 149 52%
2 25 9%
3 1 0.3%
hypothesis
complete fragment
fragment 43 101
Feature %
jitter .272
Energy slope difference between current and .241
following word
Ratio between F0 before and after boundary .238
Average OQ .147
Position of current turn 0.084
Pause duration 0.018
但我﹣我問的是
I- I was asking…
他是﹣卻很﹣活的很好
He very- lived very well
Disfluencies
Characteristics of disfluences
Detecting disfluencies
MDE bakeoff
Fragments