Professional Documents
Culture Documents
Scott Moisik
1
The Stormiest Seas of NLP
2
I’m the captain, and I’ve plotted the best (easiest) course possible!
Speech Synthesis and Recognition
• Lecture 10: Automatic Speech Recognition 4
gang ... killed instead of game... killed Chuckle World instead of Chacoan World
YouTube Close-Caption Fail
6
https://www.youtube.com/watch?v=aFPtc8BVdJk&hl=en-GB&gl=SG
Video of [gɑ] with audio of [bɑ]: an intermediate, [dɑ] or ([ɖɑ] ?), tends to be perceived
because of the visual and aural signal integration
Speech Recognition by Humans
Humans are about five (5) times better than machines at speech recognition 11
(although machines are continuously getting better)
• Automatic speech recognition (ASR) is similar to human speech recognition
– The nature of lexical access (retrieving words from the mental lexicon)
• Word association: Word access is faster if a semantically related
word was recently heard (e.g., fishing primes bass in the sentence I
went fishing for bass)
• Repetition priming: Word access is faster if that word was heard
recently
• Easy: Limited task domains, e.g., 2-word (yes and no) or 10-word (digits)
• Difficult: open ended vocabularies (+20K words)
2) Fluency of speech
• Easy: each word uttered in isolation (and separated from others by
pauses); tends to be how humans speak to computers (simplified speech)
• Difficult: continuous speech where one word flows naturally into another
(i.e., at internal open junctures)
Four Task Parameters of ASR
3) Noise 20
• Easy: clean recording with good quality equipment and room acoustics
• Difficult: poor recording and loud background noise with other sources of
speech, chickens, etc.
4) Speaker variation
• Easy: recognizing the speech close to that which the system was trained
on (dialect and accent, physical characteristics)
• Difficult: recognizing the speech of foreign accents and children (small
vocal tracts with resonances that differ greatly from adults)
Our Goal
Our goal in this course is to describe the architecture of large-vocabulary 21
continuous speaker-independent speech recognition system
• Large vocabulary: 20K to 60K words
• Continuous: words flow together naturally
• Speaker-independent: system can recognize the speech of individuals it
was not trained on
Steps in an ASR System
Automatic Speech Recognition (ASR) systems operate according to the 22
following steps:
1) Digital sampling of speech (recording the speech signal)
2) Acoustic signal processing (converting the speech samples into particular
measurable units)
3) Recognition of sounds, groups of sounds, and words
• This step may or may not use more sophisticated analysis of the
utterance to help (e.g., a [t] might sound like a [d], and so word
information might be needed)
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 23
robustness tradeoff:
• Accuracy: The system correctly recognizes the words spoken
• Robustness: The system performs well under different circumstances (e.g.,
it can handle speaker and recording variability)
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 24
robustness tradeoff:
• Speaker dependent: Works for a single speaker (e.g., dictation software)
– This type is accurate for the speaker it was trained on but not robust
because it cannot generalize to other speakers
• Speaker independent: Works for a speaker of a given variety of language
(e.g., American English, Singaporean Mandarin, etc.)
– This type is robust as it can operate with many different speakers but
it may have lower accuracy than speaker dependent systems
• Speaker adaptive: The system starts general but has machine learning
algorithms which allow it to improve accuracy over time with more
exposure to data and feedback
– The system may identify traits of the speaker and adapt by using
different models for that speaker class, making it also more robust
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 25
robustness tradeoff:
• ASR systems have differing sizes and types of vocabularies
– They range from tens of words to tens of thousands of words
• The more words, the more robust but the less accurate
– They are normally very domain-specific, e.g., flight vocabulary
• The more specific, the more accurate but the less robust
Types of ASR Systems
There are different types of system that differ in terms of an accuracy- 26
robustness tradeoff:
• Continuous speech vs. isolated-word systems:
– Continuous speech systems: Words are connected together and not
separated by pauses
• These are less accurate because natural juncture and reduction
processes make finding word boundaries difficult, but they can
handle more variation in what is said, so they are more robust
– Isolated-word systems: Single words recognized at a time, requiring
pauses to be inserted between words
• These are accurate because it is easier for the system to find word
boundaries but they are not robust because of the limitation to
isolated words (the user must accommodate her speech for the
system)
Word Error Rate in ASR
Word Error Rate (WER) is very important in Natural Language Processing in 27
general as it successfully enables us to objectively test systems
• It works by comparing the system output to a target reference
– First find the sum of substitutions (S), deletions (D), and insertions (I)
required to make the output and reference match are determined
(using an algorithm called Minimum Edit Distance...)
– Then normalize by dividing by the length (N) of the reference (and
interpret as percentage, if you like)
S+D+I
𝑊𝐸𝑅 =
𝐍
Reference: I want to recognize speech today
System: I want wreck a nice peach today
Evaluation: D S I I S
2+1+2
𝑊𝐸𝑅 = × 100 = 83%
𝟔
Word Error Rate in ASR
Word Error Rate (WER) is very important in Natural Language Processing in 28
general as it successfully enables us to objectively test systems
• It correlates well with the task
– Easy Task: Dictation ~1-10% WER
• System adapted to the speaker through training
• Low noise environment (e.g., quiet office or laboratory space)
– Hard task: Noisy room, multiple speakers +50% WER
• Reducing WER is always a good thing
– A WER of 0% means perfect results (assuming a correct reference)
• Competitions were held to see who could get the lowest WER
– Speech Recognition had 10 years of rapid improvement
– It has slowed down now
Word Error Rate in ASR
Word Error Rate (WER) is very important in Natural Language Processing in 29
general as it successfully enables us to objectively test systems
• Results of various DARPA (Defence Advanced Research Projects Agency, US
Department of Defence) competitions
https://www.youtube.com/watch?v=5FFRoYhTJQQ
Speech Synthesis and Recognition
• Lecture 10: Automatic Speech Recognition 32
• Assume that it is the output of a process and that behind this is a target
string of words, the source, which has been passed through a “noisy-
channel”
• The “noise” makes it difficult to recover the original string of words
• Build a model of the channel so the distortion it causes to any source can
be applied to all possible sets of strings of words
• Choose the string of words which most closely matches the actual output
Noisy-channel Metaphor
34
source sentence received sentence
???
decoder:
rhubarb is nice → noisy 1
a hole in one → noisy 2
...
I like rabbits → noisy n
best guess at source
I like rabbits
What is the most likely sentence out of all sentences in the
language L (e.g., English) given some acoustic input A?
Decoding in ASR
The term “decoding” is used in ASR in a special way: 35
𝑃 𝑦 𝑥 𝑃(𝑥) 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑃(𝑥|𝑦) = 𝑃 𝑾𝑶 =
𝑃(𝑦) 𝑃(𝑶)
∗
𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾 = argmax
𝑾∈𝐿 𝑃(𝑶)
probability of the
acoustic observation sequence O
Simplifying
Luckily, P(O) does not change, for each potential sentence W we have the 39
same acoustic observation O, so it does not tell us anything about maximizing
the function, thus we can drop it
likelihood prior
How probable is the observed How probable is the
data given the model? given sentence?
∗
𝑾 = argmax 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾∈𝐿
P(W), Prior Probability
P(W), prior probability: computed by language model L 40
likelihood prior
∗
𝑾 = argmax 𝑃 𝑂 𝑾 𝑃(𝑾)
𝑾∈𝐿
Language Model L
P(O|W), Observation Likelihood
P(O|W), observation likelihood: computed by the acoustic model 41
likelihood prior
∗
𝑾 = argmax 𝑃 𝑶 𝑾 𝑃(𝑾)
𝑾∈𝐿
The Decoding Architecture
I like rabbits
42
O MFCC
Features P(O|W)
P(W)
Gaussian Acoustic model
• Application dependent
– In simple tasks these could be whole words: Yes|No, 0|1|2|...|9
– In complex tasks these are probably phones (not whole words)
• A word is then a concatenation of the HMM states (phones)
• No arbitrary transitions (i.e., cannot go from any state to any other state,
cannot go to earlier states)
• Transitions constrained to be sequential like phones in speech
• Loops allow phones to repeat and this accounts for variation in phone
duration
sil
z0 z1 z2 ih0 ih1 ih2 t0 t1 t2
sil = silence
The Decoding Architecture
I like rabbits
49
O MFCC
Features P(O|W)
P(W)
Gaussian Acoustic model