You are on page 1of 5

LING 285: Final Project 3

Due uploaded to Blackboard: Monday 5/6, 8:00pm

The purpose of the Final Project is to give you an in-depth understanding of how speech
synthesis and speech recognition systems work by using all of the skills we’ve learned in class to
work, step-by-step, through the process of synthesis and recognition. It’s kind of like a big lab
project. You may work in groups if you wish, but please note that each individual must submit a
separate Final Project to Blackboard that is their own work.

Part 3: Recognition

1. Below, is a spectrogram of an unknown word that is the input to a speech recognition


program you’ll be building through the next few questions. (6 pts)

a. Convert the spectrogram of the unknown word into a matrix with 10 columns and 15
rows. Use a scale of 1-9, where 1 represents the minimum amplitude and 9 represents
the maximum. This should be a little different for each person in the class. A grid
has already been provided over the spectrogram.
i. This is given in the first matrix of my script.
b. How many Hertz are represented by each number in your matrix?
i. 3000 Hz across 15 rows means each number represents 200 Hz.
c. How many milliseconds are represented by each number in your matrix?
i. 500 ms across 10 time slots means each number represents 50 ms.
2. Below, are three matrixes that each have 10 columns and 15 rows. Write a Matlab script that
can determine, using Template Matching Speech Recognition, whether the Unknown matrix
is more likely the word “forward” or “back.” (10 pts)
a. Your script will require for-loops.
b. Your script should center all three matrices and then find the inner product.
3.
k = [4 5 4 3 3 3 3 3 3 6; 4 5 6 4 3 3 This line establishes a matrix with
2 3 3 5; 4 5 4 5 3 3 2 3 4 4; 4 5 4 6 data from the unknown spectrogram.
6 4 4 5 4 4; 4 5 4 6 7 5 5 7 4 4; 3 4
4 6 7 4 5 7 4 5; 3 4 5 6 7 5 5 6 3 4;
2 3 5 7 8 6 5 4 2 3; 1 2 5 8 7 5 5 4
3 3; 1 2 6 7 6 5 5 5 3 3; 1 2 6 7 7 6
5 5 4 3; 1 2 5 7 7 5 5 5 4 3; 1 2 5 7
7 6 5 6 4 3; 3 3 5 6 6 5 5 5 4 3; 3 3
4 4 4 4 4 5 4 3];
This line begins a for-loop, one iteration per
time slot.
for a = [1:10] This line centers the data, subtracting an
u(:,a) = k(:,a) -(sum(k(:,a))/15) average from each number.
end
This line ends the for-loop.
m = [2 4 5 3 2 2 2 2 2 6; 2 4 6 3 2 2
2 2 2 5; 2 4 5 4 2 2 2 3 3 4; 2 4 5 5 This line creates a matrix for the data
6 4 4 4 3 3; 2 4 5 6 7 4 4 5 3 5; 2 3 from the word “forward”.
5 5 6 3 4 6 4 5; 2 3 5 4 6 3 4 7 3 3;
2 2 4 6 8 3 5 7 2 2; 1 2 3 9 9 7 6 6
1 2; 1 2 4 9 8 7 7 3 1 2; 1 2 7 8 7 6
4 4 1 2; 1 1 6 8 7 7 7 5 4 2; 1 1 7 8
7 7 7 5 4 2; 2 2 2 2 3 4 4 4 3 2; 2 2
2 2 3 4 3 3 2 2];

for q = [1:10] This line begins a for-loop, one iteration per


f(:,q) = m(:,q) -(sum(m(:,q))/15) time slot.
This line centers the data from “forward”,
end subtracting an average from each number.
This line ends the for-loop.
c = [9 9 8 7 7 5 1 5 4 4; 8 9 8 7 8 5
This line creates a matrix for the word
1 5 3 3; 8 8 7 7 7 5 1 5 3 3; 8 8 8 8 “back”.
7 5 1 6 4 3; 9 9 9 8 8 6 1 6 5 4; 9 9
9 8 8 6 1 6 4 4; 7 9 9 8 7 6 1 5 3 3;
6 8 7 7 6 5 1 5 2 2; 6 8 8 8 6 4 1 4
1 1; 7 9 9 8 7 3 1 3 1 1; 8 9 8 7 5 1
1 2 1 1; 9 8 7 5 2 1 1 2 1 1; 5 6 5 4
2 1 1 2 1 1; 4 5 4 3 3 2 1 2 2 2; 4 4
3 2 2 2 2 2 2 2];

for v = [1:10] This line begins a for-loop, one iteration per


b(:,v) = c(:,v) -(sum(c(:,v))/15) time slot.
end This line centers the data from “back”,
subtracting an average from each number.
This line ends the for-loop.
for h = (1:10)
sum(u(:,h).*f(:,h)) This line begins a for-loop for finding the
inner product.
end This line finds the inner product of the
unknown data and “forward”.
for y = (1:10) This line ends the for-loop.
sum(u(:,y).*b(:,y)) This line begins a for-loop for finding the
end inner product.
This line finds the inner product of the
unknown data and “back”.
This line ends the for-loop.

4. Which word is the Unknown? Explain how centering and inner product provide a
number for how similar a template is to an unknown input. (10 pts)
a. Your answer should be no more than 250 words long.
According to my MATLAB script, the unknown word is “forward”. After each
spectrogram is converted to numeric data, centering ensures that each word is compared
accurately despite discrepancies in loudness or numeric evaluation, as each number in a
matrix is re-weighted relative to the other numbers around it, zero being the center value.
After centering, the inner product quantifies the similarity of each two matrices by
multiplying each number in the first matrix by each number in the second matrix’s equivalent
spots, and then by adding up the new values for each column. If indexes of equivalent spots
in the two matrices are similar, they’ll more likely result in positive numbers, as two positive
numbers multiplied result in a larger positive number, as do two negative numbers.
Consequently, similar columns have higher totals, while dissimilar columns have lower
totals, in part because any negative number indicates dissimilarity.

5. An HMM Speech Recognition System is designed to recognize the words “forward” and
“back.” A user of the system inputs 4 acoustic signals. All 4 could be the word
“forward”; all 4 could be the word “back”; or they might be a combination of the two
words. Below is a Hidden Markov Model showing the probability of “forward” and
“back” being input, as well as the probability of the two words manifesting as the input
acoustics. What sequence of words is the most likely input, and what is the probability of
that sequence? (22 pts)
a. You’ll need to create a Viterbi Trellis and then use Dynamic Programming to find
the best path.
b. Be sure to show your work. We need to see that you used Dynamic
Programming: meaning that it needs to be clear that you re-used values.
c. Your answer must be drawn/typed into the document you upload. Do NOT take a
picture of work you did by hand. (You will lose 5 pts per image if you do.)
Stage 5
Back 4 (Utterance 3) à End
.5(1) = .5

Forward 4 (Utterance 1 or 2) à End


.3(1) = .3

Stage 4
Back 3 (Utterance 3) à Back 4 (Utterance 3)
.5(.4) * .5 = .1
Back 3 (Utterance 3) à Forward 4 (Utterance 1 or 2)
.5(.6) * .3 = .09

Forward 3 (Utterance 1 or 2) à Back 4 (Utterance 3)


.3(.7) * .5 = .105
Forward 3 (Utterance 1 or 2) à Forward 4 (Utterance 1 or 2)
.3(.3) * .3 = .027

Stage 3
Back 2 (Utterance 3) à Back 3 (Utterance 3)
.5(.4) * .1 = .02
Back 2 (Utterance 3) à Forward 3 (Utterance 1 or 2)
.5(.6) * .105 = .0315

Forward 2 (Utterance 1 or 2) à Back 3 (Utterance 3)


.3(.7) * .1 = .021
Forward 2 (Utterance 1 or 2) à Forward 3 (Utterance 1 or 2)
.3(.3) * .105 = .00945

Stage 2
Back 1 (Utterance 3) à Back 2 (Utterance 3)
.5(.4) * .0315 = .0063
Back 1 (Utterance 3) à Forward 2 (Utterance 1 or 2)
.5(.6) * .021 = .0063

Forward 1 (Utterance 1 or 2) à Back 2 (Utterance 3)


.3(.7) * .0315 = .006615
Forward 1 (Utterance 1 or 2) à Forward 2 (Utterance 1 or 2)
.3(.3) * .021 = .00189

Stage 1
Start à Back 1 (Utterance 3)
.4 * .0063 = .00252
Start à Forward 1 (Utterance 1 or 2)
.6 * .006615 = .003969
As shown, the most probable sequence of words is:

Start à Forward 1 (Utterance 1 or 2) à Back 2 (Utterance 3) à Forward 3 (Utterance 1 or 2) à


Back 4 (Utterance 3) à End

This input has a .397% chance of happening.

You might also like