So far

• Historical introduction • Mathematical background (e.g., pattern classification, acoustics) • Feature extraction for speech recognition (and some neural processing) • What sound units are typically defined • Audio signal processing topics (pitch extraction, perceptual audio coding, source separation, music analysis) • Now – back to pattern recognition, but include time

Deterministic Sequence Recognition

Sequence recognition for ASR
• ASR = static pattern classification + sequence recognition • Deterministic sequence recognition: template matching • Templates are typically word-based; don’t need phonetic sound units per se • Still need to put together local distances into something global (per word or utterance)

MFCC or PLP) – Cepstral vector at time n called xn .g.. statistical: – 25 ms windows (e.Front end analysis • Basic approach the same for deterministic. 10 ms steps (a frame) – Some kind of cepstral analysis (e.g. Hamming)..

Speech sound categories • Words. mostly words • For template-based ASR. phones most common • For template-based ASR. local distances based on examples (reference frames) versus input frames .

From Frames to Sequence • Easy if local matches are all correct (never happens!) • Local matches are unreliable • Need measure of goodness of fit • Need to integrate into global measure • Need to consider all possible sequences .


Templates: Isolated Word Example • • • • • Matrix for comparison between frames Word template = multiple feature vectors Reference template = X kref Input template = X in Need to find D( X kref . X in ) .

Templates Matching Problem • • • • Time Normalization Which references to use Defining distances/costs Endpoints for input templates .

Time Normalization • Linear Time Normalization • Nonlinear Time Normalization – Dynamic Time Warp (DTW) .


Linear Time Normalization: Limitations • Speech sounds stretch/compress differently • Stop consonants versus vowels • Need to normalize differently .


Generalized Time Warping • Permit many more variations • Ideally. compare all possible time warpings • Vintsyuk (1968): use dynamic programming .

best path includes best partial path to grid point • Classic example: knapsack problem .Dynamic programming • Bellman optimality principle (1962): optimal policy given optimal policies from sub problems • Best path through grid: if best path goes through grid point.

we can compute the final answer knowing the value of adding items. different value • Goal: maximize value in sack • Key point 1: If max size is 10.Knapsack problem • Stuffing a sack with items. and we know values of solutions for max size of 9. • Key point 2: Point 1 sounds recursive. but can be made efficiently nonrecursive by building a table .

.Basic DTW step w/ simple local constraints.j) cell has local distance d and cumulative distortion D. Each (i. The eqn shows the basic computational step.

Sakoe • Let D(i.j) = local distance between frame i in input and frame j in reference • Let p(i. Bridle.j) = total distortion up to frame i in input and frame j in reference • Let d(i.Dynamic Time Warp (DTW) • Apply DP to ASR: Vintsyuk.j)) .j) = set of possible predecessors to frame i in input and frame j in reference • D(i. j) + minp(i.j) D(p(i.j) = d(i.

find best D in last column of input (5) Choose the word for the template with smallest D .j) = d(0.DTW steps (1) Compute local distance d in 1st column(1st frame of input) for each reference template. Let D(0. compute d(i. j=0. repeat for each frame in each template.j) add to min of all possible predecessor values of D to get local value of D. (3) Repeat (2) for each column to the end of input (4) For each template.j) for each cell in each template (2) For i=1 (2nd column).

Ntemplates) (store current column and previous column) • Constant reduction: global constraints • Constant reduction: local constraints . though can just be O(Nframesref . Ntemplates) • Storage. Nframesin .DTW Complexity • O(Nframesref .

Typical global slope constraints for dynamic programming .



Which reference templates? • All examples? • Prototypes? • DTW-based global distances permit clustering .

find template with minimum value for maximum distance.DTW-based K-means • (1) Initialize (how many. call it the center • (4) Repeat (2) and (3) until some stopping criterion is reached • (5) Use center templates as references for ASR . where) • (2) Assign examples to closest center (DTW distance) • (3) For each cluster.

g. e. e.Defining local distance • • • • Normalizing for scale Cepstral weighting Perceptual weighting.g. with ANN. JND Learning distances. statistics ...

reverb.Endpoint detection: big problem! • • • • • • Sounds easy Hard in practice (noise. gain issues) Simple systems use energy. time thresholds More complex ones also use spectrum Can be tuned Not robust .


Connected Word ASR by DTW • • • • • Time normalization Recognition Segmentation Can’t have templates for all utterances DP to the rescue .

H. Bridle: one stage Ney explanation Ney.” IEEE Trans. Bridle. 32: 263-271.. Speech Signal Process.DP for Connected Word ASR by DTW • • • • Vintsyuk. Acoust. 1984 . Sakoe Sakoe: 2-level algorithm Vintsyuk. “The use of a one-stage dynamic programming algorithm for connected word recognition.

Connected Algorithm • In principle: one big distortion matrix (for 20. 1000 frame input [10 seconds] would be 109 cells!) • Also required. 50 frames/word. not as sequential steps .000 words. backtracking matrix (since word segmentation not known) • Get best distortion • Backtrack to get words • Fundamental principle: find best segmentation and classification as part of the same process.

DTW path for connected words .

backtracking matrix points back to best previous cell • Mostly just need backtrack to end of previous word • Simplifications possible .DTW for connected words • In principle.

Storage efficiency • Distortion matrix -> 2 columns • Backtracking matrix -> 2 rows • “From template” points to template with lowest cost ending here • “From frame” points to end frame of previous word .


More on connected templates • • • • “Within word” local constraints “Between word” local constraints Grammars Transition costs .

Ron Cole (1983) . time norm. all segmentations considered • Same feature vectors used everywhere • Could segment separately. using acousticphonetic features cleverly • Example: FEATURE. recognition.Knowledge-based segmentation • DTW combines segmentation.

Limitations of DTW approach • • • • • • No structure from subword units Average or exemplar values only Cross-word pronunciation effects not handled Limited flexibility for distance/distortion Limited mathematical basis -> Statistics! .

Epilog: “episodic” ASR • Having examples can get interesting again when there are many of them • Potentially an augmentation of stat methods • Recent experiments show decent results • Somewhat different properties -> combination .

The rest of the course • • • • • • Statistical ASR Speech synthesis Speaker recognition Speaker diarization Oral presentations on your projects Written report on your project .

Class project timing • Week of April 30: no class Monday. 12 minutes each + 3 minutes for questions • 2 oral presentations by pairs – 17 minutes each + 3 minutes for questions • 3:10 PM to 6 PM with a 10 minute mid-session break • Written report due Wednesday May 9. double class Wednesday May 2 (is that what people want?) • 8 oral presentations by individuals. no late submissions (email attachment is fine) .

Sign up to vote on this title
UsefulNot useful