You are on page 1of 49

Speech Synthesis and Recognition

Automatic Speech Recognition


Lecture 10

Scott Moisik

1
The Stormiest Seas of NLP
2

You will be challenged!


Survive! Learn!
3

I’m the captain, and I’ve plotted the best (easiest) course possible!
Speech Synthesis and Recognition
• Lecture 10: Automatic Speech Recognition 4

– Speech recognition by humans


– ASR overview and evaluation (WER)
– ASR architecture
• More practice with KlattGrid
YouTube Close-Caption Fail
5

gang ... killed instead of game... killed Chuckle World instead of Chacoan World
YouTube Close-Caption Fail
6

chocolate instead of Chacoan


Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 7
(although machines are continuously getting better)
• Automatic speech recognition (ASR) is similar to human speech recognition
– Some aspects of the human auditory system have been integrated into
speech recognition systems
• The processing of the speech signal in ASR systems employs several
techniques which are based on how human hearing works (i.e., in
relation to speech perception), especially the non-linear mapping
of intensity and frequency
Linear Functions Nonlinear Functions
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 8
(although machines are continuously getting better)
• Automatic speech recognition (ASR) is similar to human speech recognition
– The nature of lexical access (retrieving words from the mental lexicon)
• Word frequency: More frequent words are recognized faster and
more robustly (e.g., in noisy environments), and with less stimuli
ASR uses frequency information in N-gram language models

• Parallelism: Multiple competing words activated at the same time


ASR recognizes words and word strings in terms of many
competing hypothesizes as to the correct utterance and
chooses the most probable of these
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 9
(although machines are continuously getting better)
• Automatic speech recognition (ASR) is similar to human speech recognition
– The nature of lexical access (retrieving words from the mental lexicon)
• Cue-based processing: Information from many sensory modalities is
integrated in word retrieval, making it more robust
– The phoneme restoration effect makes it possible to detect
words with missing phonemes (e.g., hi<cough>tory → history)
– The McGurk effect shows that visual input is integrated (and
can even interfere) with speech perception
ASR incorporates multiple sources of information into the process
of recognition, which makes it more robust (although not many
systems integrate visual information yet)
The McGurk Effect
Is he saying [bɑ bɑ]… or [dɑ dɑ]… or [gɑ gɑ]? 10

https://www.youtube.com/watch?v=aFPtc8BVdJk&hl=en-GB&gl=SG
Video of [gɑ] with audio of [bɑ]: an intermediate, [dɑ] or ([ɖɑ] ?), tends to be perceived
because of the visual and aural signal integration
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 11
(although machines are continuously getting better)
• Automatic speech recognition (ASR) is similar to human speech recognition
– The nature of lexical access (retrieving words from the mental lexicon)
• Word association: Word access is faster if a semantically related
word was recently heard (e.g., fishing primes bass in the sentence I
went fishing for bass)
• Repetition priming: Word access is faster if that word was heard
recently

ASR systems have been developed which attempt to model


both of these aspects of human lexical access to good effect
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 12
(although machines are continuously getting better)
• But ASR also differs from human speech recognition is several ways
– Processing style: ASR systems emphasize optimization over the entire
utterance but human listeners emphasize on-line processing
• This means that we incrementally process incoming speech,
segmenting it and assigning interpretations as the speech is
received
Shadow Activity
Find a partner and take turns trying to shadow (copy) each other’s speech as 13
fast as you can (take turns - one leader one shadow)
• Apparently we can shadow another person’s speech with as little as a
250ms delay
– Errors made by shadowers are appropriate to the syntactic and
semantic contexts, indicating interpretation happens within this 250ms
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 14
(although machines are continuously getting better)
• But ASR also differs from human speech recognition is several ways
– Processing style: ASR systems emphasize optimization over the entire
utterance but human listeners emphasize on-line processing
• The TRACE neural network model attempted to account for online
processing as parallel competing activation across three levels of
linguistic structure (feature, phonemes, and words)
Activation of a word inhibits
the other words

[p] & [t] voiced?


Wah? Detection of a feature excites
evidently, not
phoneticians!
a phoneme “hypothesis”, which
in turn excites a word “hypothesis”
McClelland, J.L., & Elman, J.L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86.
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 15
(although machines are continuously getting better)
• But ASR also differs from human speech recognition is several ways
– Neighbourhood effects: Human word perception is influenced by a
word’s neighbourhood (the words which are phonologically similar to it)
• Words with large neighbourhoods are accessed more slowly than
those with fewer neighbours
– Prosodic knowledge: ASR systems do not tend to model prosodic
structure although it has been shown that humans do integrate this
type of information
Incorporation of prosodic knowledge could greatly improve ASR
and is thus an important future direction of research
Speech Synthesis and Recognition
• Lecture 10: Automatic Speech Recognition 16

– Speech recognition by humans


– ASR overview and evaluation (WER)
– ASR architecture
• More practice with KlattGrid
Automatic Speech Recognition
There are two broad goals… 17

1) Automatic speech recognition (ASR): receiving an acoustic signal and


mapping to a string of words
We will focus on this topic

2) Automatic speech understanding (ASU): attempt to generate meaning of


sentence underlying the string of words

This is an advanced topic left for future study, but


the foundation for elements of this are in this course
Automatic Speech Recognition
ASR is not a “solved problem” in general, but limited task domain applications 18
have been successful
• Applications: mainly human-computer interaction (alternative means of input
besides the keyboard, mouse, and touch screen)
– Hands- or eyes-busy tasks: Operating machinery or manipulating objects
(or handling babies!)
– Telephony: User verbally responds to prompts for information (and then
curses at the system)
– Disabilities: Text of programs with audio presented on screen (i.e., closed
captioning)
– Dictation: Transcription of a monologue by a single speaker (popular
amongst lawyers)
– Spying: many agencies run ASR on phone conversations and search for
keywords
– Data processing: Indexing audio data
– Research: Automatic speech segmentation for phonetic research
Four Task Parameters of ASR
1) Vocabulary size 19

• Easy: Limited task domains, e.g., 2-word (yes and no) or 10-word (digits)
• Difficult: open ended vocabularies (+20K words)
2) Fluency of speech
• Easy: each word uttered in isolation (and separated from others by
pauses); tends to be how humans speak to computers (simplified speech)
• Difficult: continuous speech where one word flows naturally into another
(i.e., at internal open junctures)
Four Task Parameters of ASR
3) Noise 20

• Easy: clean recording with good quality equipment and room acoustics
• Difficult: poor recording and loud background noise with other sources of
speech, chickens, etc.
4) Speaker variation
• Easy: recognizing the speech close to that which the system was trained
on (dialect and accent, physical characteristics)
• Difficult: recognizing the speech of foreign accents and children (small
vocal tracts with resonances that differ greatly from adults)
Our Goal
Our goal in this course is to describe the architecture of large-vocabulary 21
continuous speaker-independent speech recognition system
• Large vocabulary: 20K to 60K words
• Continuous: words flow together naturally
• Speaker-independent: system can recognize the speech of individuals it
was not trained on
Steps in an ASR System
Automatic Speech Recognition (ASR) systems operate according to the 22
following steps:
1) Digital sampling of speech (recording the speech signal)
2) Acoustic signal processing (converting the speech samples into particular
measurable units)
3) Recognition of sounds, groups of sounds, and words
• This step may or may not use more sophisticated analysis of the
utterance to help (e.g., a [t] might sound like a [d], and so word
information might be needed)
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 23
robustness tradeoff:
• Accuracy: The system correctly recognizes the words spoken
• Robustness: The system performs well under different circumstances (e.g.,
it can handle speaker and recording variability)
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 24
robustness tradeoff:
• Speaker dependent: Works for a single speaker (e.g., dictation software)
– This type is accurate for the speaker it was trained on but not robust
because it cannot generalize to other speakers
• Speaker independent: Works for a speaker of a given variety of language
(e.g., American English, Singaporean Mandarin, etc.)
– This type is robust as it can operate with many different speakers but
it may have lower accuracy than speaker dependent systems
• Speaker adaptive: The system starts general but has machine learning
algorithms which allow it to improve accuracy over time with more
exposure to data and feedback
– The system may identify traits of the speaker and adapt by using
different models for that speaker class, making it also more robust
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 25
robustness tradeoff:
• ASR systems have differing sizes and types of vocabularies
– They range from tens of words to tens of thousands of words
• The more words, the more robust but the less accurate
– They are normally very domain-specific, e.g., flight vocabulary
• The more specific, the more accurate but the less robust
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 26
robustness tradeoff:
• Continuous speech vs. isolated-word systems:
– Continuous speech systems: Words are connected together and not
separated by pauses
• These are less accurate because natural juncture and reduction
processes make finding word boundaries difficult, but they can
handle more variation in what is said, so they are more robust
– Isolated-word systems: Single words recognized at a time, requiring
pauses to be inserted between words
• These are accurate because it is easier for the system to find word
boundaries but they are not robust because of the limitation to
isolated words (the user must accommodate her speech for the
system)
Word Error Rate in ASR
Word Error Rate (WER) is very important in Natural Language Processing in 27
general as it successfully enables us to objectively test systems
• It works by comparing the system output to a target reference
– First find the sum of substitutions (S), deletions (D), and insertions (I)
required to make the output and reference match are determined
(using an algorithm called Minimum Edit Distance...)
– Then normalize by dividing by the length (N) of the reference (and
interpret as percentage, if you like)
S+D+I
𝑊𝐸𝑅 =
𝐍
Reference: I want to recognize speech today
System: I want wreck a nice peach today
Evaluation: D S I I S
2+1+2
𝑊𝐸𝑅 = × 100 = 83%
𝟔
Word Error Rate in ASR
Word Error Rate (WER) is very important in Natural Language Processing in 28
general as it successfully enables us to objectively test systems
• It correlates well with the task
– Easy Task: Dictation ~1-10% WER
• System adapted to the speaker through training
• Low noise environment (e.g., quiet office or laboratory space)
– Hard task: Noisy room, multiple speakers +50% WER
• Reducing WER is always a good thing
– A WER of 0% means perfect results (assuming a correct reference)
• Competitions were held to see who could get the lowest WER
– Speech Recognition had 10 years of rapid improvement
– It has slowed down now
Word Error Rate in ASR
Word Error Rate (WER) is very important in Natural Language Processing in 29
general as it successfully enables us to objectively test systems
• Results of various DARPA (Defence Advanced Research Projects Agency, US
Department of Defence) competitions

Task Vocab WER (%) WER (%) adapted


Digits 11 0.4 0.2
Dialogue (travel) 21,000 10.9 —
Dictation (Wall Street Journal) 5,000 3.9 3.0
Dictation (Wall Street Journal) 20,000 10.0 8.6
Dialogue (noisy, army) 3,000 42.2 31.0
Phone Conversations 4,000 41.9 31.0
Why is ASR So Difficult?
Several factors conspire to make ASR difficult 30

• Speaker variability (physical and language differences)


– Gender, accent (both dialectal and foreign), dialect (especially lexical
differences), individual differences, idiolect
• Speech and language consist of many rare events
• 300 out of 2000 diphones in the core set for the AT&T NextGen
system occur only once in a 2-hour speech database
• In the English newswire text corpus, which has 197,214 lines of
text, you can collect character 4-grams or 5-grams
– 7% of lines contain at least one 4-gram that only occurs once
in 10,000,000 4-grams
– 21% of lines contain at least one 5-gram that only occurs once
in 10,000,000 5-grams
Eleven!
31

https://www.youtube.com/watch?v=5FFRoYhTJQQ
Speech Synthesis and Recognition
• Lecture 10: Automatic Speech Recognition 32

– Speech recognition by humans


– ASR overview and evaluation (WER)
– ASR architecture
• More practice with KlattGrid
ASR Approach
Given a received waveform (from an audio recording): 33

• Assume that it is the output of a process and that behind this is a target
string of words, the source, which has been passed through a “noisy-
channel”
• The “noise” makes it difficult to recover the original string of words
• Build a model of the channel so the distortion it causes to any source can
be applied to all possible sets of strings of words
• Choose the string of words which most closely matches the actual output
Noisy-channel Metaphor
34
source sentence received sentence

???

I like rabbits input noise output

decoder:
rhubarb is nice → noisy 1
a hole in one → noisy 2
...
I like rabbits → noisy n
best guess at source

I like rabbits
What is the most likely sentence out of all sentences in the
language L (e.g., English) given some acoustic input A?
Decoding in ASR
The term “decoding” is used in ASR in a special way: 35

• Essentially the acoustic speech product is an encoding of the underlying


words
• This is just like how a number cypher can be used to encode letters of the
alphabet to give a code
– Given 1265119 and assuming a simple numeric encoding of the
alphabet (A = 1, B = 2, ...), we need to “decode” this into a message
and there are many possible parses
• 1265019
• 12 6 5 11 19
• 1 26 5 1 11 19
• ...

The Viterbi decoding algorithm allows us to choose the most probable


underlying sequence that was encoded into the observable units
Noisy-channel Model
To create the noisy-channel model, we need to specify a metric for 36
determining the best match to decoder input
• Probability is used, specifically Bayesian inference
• We cannot literally search through all possible sentences (the set of all
English sentences is too big!)
– Use an efficient algorithm to search the space of sentences,
specifically the Viterbi decoding algorithm
Find the Best Match to the Input
We have the following ingredients then: 37

• Treat acoustic input O as a sequence of t observations:


– O = o1, o2, o3,..., ot
• Treat any given sentence W as a string of n words
– W = w1, w2, w3, ...., wn
The best match to the acoustic input O is the sentence W* such that

𝑾 = argmax 𝑃(𝑾|𝑶)
𝑾∈𝐿
find the sequence W which maximizes the probability P(W|O)
How do we compute P(W|O)? Not easy since we’ve never seen O before!

We use Bayes’ rule to simplify this expression into


something easier to work with
Bayes’ Rule to find P(W|O)
Any probability P(x|y) can be reframed using Bayes’ rule to help us compute 38
the probabilities
Baye’s Rule (General) Baye’s Rule Applied to P(W|O)

𝑃 𝑦 𝑥 𝑃(𝑥) 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑃(𝑥|𝑦) = 𝑃 𝑾𝑶 =
𝑃(𝑦) 𝑃(𝑶)

probability of the acoustic observation probability of the


sequence O given the word sequence W word sequence W


𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾 = argmax
𝑾∈𝐿 𝑃(𝑶)
probability of the
acoustic observation sequence O
Simplifying
Luckily, P(O) does not change, for each potential sentence W we have the 39
same acoustic observation O, so it does not tell us anything about maximizing
the function, thus we can drop it

likelihood prior
How probable is the observed How probable is the
data given the model? given sentence?

𝑾 = argmax 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾∈𝐿
P(W), Prior Probability
P(W), prior probability: computed by language model L 40

• An example of a language model would be an N-gram model

likelihood prior

𝑾 = argmax 𝑃 𝑂 𝑾 𝑃(𝑾)
𝑾∈𝐿
Language Model L
P(O|W), Observation Likelihood
P(O|W), observation likelihood: computed by the acoustic model 41

• The acoustic model is constructed in relation to HMMs, which can be used


to compute the probability

likelihood prior

𝑾 = argmax 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾∈𝐿
The Decoding Architecture
I like rabbits
42

cepstral feature extraction

O MFCC
Features P(O|W)
P(W)
Gaussian Acoustic model

N-gram language phone


model likelihoods
HMM lexicon W*

Viterbi Decoder I like rabbits


𝑾∗ = argmax 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾∈𝐿
Hidden Markov Model and Speech
43
What is hidden from us: the sequence of words W
What we have: an acoustic observation sequence O
Goal: Map from the acoustic observations O (known) to the
sequence of words W (unknown)

What is the nature of the acoustic observations?


• Each observation is stored as a feature vector containing a
number of different pieces of useful acoustic information
• Observations are typically sampled every ~10 ms (so we have
100 feature vectors per second)

More on the acoustic observations next lecture


Hidden Markov Model and Speech
Nature of the hidden states 44

• Application dependent
– In simple tasks these could be whole words: Yes|No, 0|1|2|...|9
– In complex tasks these are probably phones (not whole words)
• A word is then a concatenation of the HMM states (phones)

a11 a22 a33 a44


a01 a12 a23 a34 a45
S0 s1 ih2 k3 s4 E5
Hidden Markov Model and Speech
Nature of the hidden states 45

• No arbitrary transitions (i.e., cannot go from any state to any other state,
cannot go to earlier states)
• Transitions constrained to be sequential like phones in speech
• Loops allow phones to repeat and this accounts for variation in phone
duration

a11 a22 a33 a44


a01 a12 a23 a34 a45
S0 s1 ih2 k3 s4 E5
Hidden Markov Model and Speech
Nature of the hidden states 46

• Finer grained representation of phones as having internal states or


subphones is desirable

a single phone (e.g., [s]) with internal temporal structure


a11 a22 a33
a01 a12 a23 a34
S0 s1 s2 s3 E4
onset: sustain: release:
initial movement held part of movement away
into constriction constriction from constriction
Hidden Markov Model and Speech
Nature of the hidden states 47

• Finer grained representation of phones as having internal states or


subphones is desirable

the word sit modeled with three phones


each built from three subphone states

S0 s0 s1 s2 ih0 ih1 ih2 t0 t1 t2 E5


Hidden Markov Model and Speech
Nature of the hidden states 48

• Finer grained representation of phones as having internal states or


subphones is desirable

the words sit and zit modeled with three phones


each built from three subphone states
sil

S0 s0 s1 s2 ih0 ih1 ih2 t0 t1 t2 E5

sil
z0 z1 z2 ih0 ih1 ih2 t0 t1 t2

sil = silence
The Decoding Architecture
I like rabbits
49

cepstral feature extraction

O MFCC
Features P(O|W)
P(W)
Gaussian Acoustic model

N-gram language phone


model likelihoods
HMM lexicon W

Viterbi Decoder I like rabbits


𝑾∗ = argmax 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾∈𝐿

You might also like