HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20

Speech Synthesis and Recognition
Introduction
Lecture 1
Scott Moisik
August 17, 2017
1
Administrative Stuff
Key details: 2
• Coordinator: Scott Moisik, scott.moisik@ntu.edu.sg

• Lecture Time: 10:30 – 13:30, Fridays
• Office Hours: HSS-03-51, 15:30 – 17:30, Thursdays
• Venue: Tutorial Room + 109 (Block SS4 SS4-01-37)
• Readings: To be posted on NTU learn
Outline and Schedule
3
• Available on NTUlearn (in the course outline)

Assessment
Continuous Assessment (100%): 4
• Quiz 1 (20%)
– Mainly short answer
• Quiz 2 (20%)
– Mainly short answer
• Assignment 1 (25%): Synthesis
– Using Praat and VocalTractLab to create synthetic speech
• Assignment 2 (25%): Recognition
– Designing a speech recognition system “on paper”
• Class participation 10%
– attendance of lectures
– attempting in-class activities
Course Requirements
• Bring your laptop computer to class for working on certain in-class 5
exercises
• Also bring your stylish cell phone (aka “hand phone”)
Lecture Notes Policy
My policy is to provide the lecture notes after the lecture (usually on the 6
same day if possible)
• There are several reasons for this
slide number
– It encourages you to take notes
– I have more time to fix mistakes before they propagate
• Note taking hint
– Do not write everything on my slides down
– Write the slide number down and take notes about any verbal
comments I make that are not on the slides (if you find them worth
noting)
Course Continuity
HG4071, The Meat of Speech (Anatomy & Physiology) 7
• Goes deep into the structure and function of the vocal tract and the
hearing apparatus
And we do sculpting!
Course Continuity
HG4070: Experimental Phonetics 8
• Focuses on experimental design, data collection, processing, and

(statistical) analysis (in Praat and R/R Studio)
• Become a mad (phonetic-phonological) scientist by developing an
experiment
My Background
Canadian!!! 9
My Background
Saskatchewan, the most boring province 10
My Background
11
Newfoundland Vancouver Island, BC

My Background
12
Manitoba Ontario
My Background
Saskatchewan, not a good tattoo 13
small bend!
My Background
Singapore, climate data 14
Country with the least amount

of variation in temperature on earth?
Singapore page, Wikipedia

My Background
Regina, Saskatchewan: Canada’s Siberia 15
−59.2 ºC!!!
Regina page, Wikipedia

My Background
Why the palm trees? 16
Fargo, 1996, Polygram Filmed Entertainment

Palm Tree Appreciation
17
My Background B
• Not a computer scientist 18
• Not a mathematician
• Not an engineer
• Not a physicist
• I’m an artist! Then a linguist!
– Life takes you on a strange path A
• Yet my work in linguistics has led me to need to know something about all
of these things
My Line of Work
• Phonetics and phonology 19
• Larynx
• Phonetic instrumentation
• Vocal tract and speech modeling
• Biomechanics and motor control
– ArtiSynth: Talk to me if you are interested
• Language and Genetics
– G[ɜ]bils project: Also looking for students
My Line of Work
Where it gets technical: 20
• Computer graphics and 3D rendering

• Approximate physical systems with models
• Acoustic models (e.g., transmission lines)
• Numerical methods for differential equations
• Image analysis: Optical flow, active shape
• Morphology: Geometric Morphometrics, etc.
Moisik @ Work
• Live laryngoscopy: looking at my larynx! 21
One of my Favourite Words
Amis, Austronesian 22
[ʡɪsʡɪs] ‘to cut hair’

An Early 3D Larynx Model
23
Computer Modeling: [ʡ]
The ArtiSynth Larynx 24
MRI of [!] clicks
25
Data from ArtiVarK, a project of G[ɜ]bils, Dan Dediu (PI), MPI for Psycholinguistics
Computer Modeling: [!]
26
created with ArtiSynth, www.artisynth.org, UBC

First, a bit of comedy...
27
https://www.youtube.com/watch?v=ias31By60N8
Lecture 1: 28
• Computer speech: science fiction and science fact

• Natural Language Processing (NLP)
• Challenges for machine speech
The Warning
Real language can get dirty sometimes (and I like to acknowledge and 29
sometimes play with this fact)
• But, I respect your values, too, so I will warn you
Materials
• Quizzes and exams: All materials are in the lectures (unless otherwise 30
announced), no surprises: quick, easy, painless
Quick, Easy, and Painless
OK, maybe not so quick...

Sailing the Seas of NLP
Natural Language Processing
Captain’s Philosophy: 31
• We are explorers
• We are ALL learners
• We are here to have fun (life is short!!!)
• Be brave: Some of these ideas might be challenging
• Some of these ideas are really damn cool: I hope I can inspire you to feel
the same way
32
Sowing Ideas Like Seeds
You will encounter some difficult concepts in this course and even some 33
equations – but take heed, my primary objective is to sow an idea like a seed
– in your mind!
Take a Zen approach:

Try to relax and just be
one with the equation
This is not a math class: there will not be any calculators required!
Linear Algebra
Linear algebra underlies almost all of the math in speech synthesis and 34
recognition but also pays huge dividends elsewhere (e.g., statistical modeling)
High school made me hate math
Gilbert Strang helped me see its beauty...
Do not miss catching Gilbert Strang’s feverish

passion for linear algebra
The Fundamental Theorem of Linear Algebra

or...
Gilbert Strang’s favourite picture
Watch Gilbert Strang’s lectures here (MIT OpenCourseWare):

https://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/index.htm
Bring Your Laptop
Be sure to bring your laptop to lecture because I will try to give everyone time 35
to build their skills in class
Praat
Praat is powerful and free software for doing acoustic phonetic analysis 36
• Praat means “talk” or “speak” (e.g., “Praat Nederlands met me”) in Dutch
• http://www.fon.hum.uva.nl/praat/
We will make extensive use of Praat in this course

VocalTractLab
The best articulatory speech synthesis application available 37
• Free to download: http://www.vocaltractlab.de/

• Developed by Peter Birkholz (University of Dresden)
– I have obtained a course registration to unlock all of the features
(more on this later)
Peter Birkholz,
a friendly chap!
Lecture 1: 38

Consider Robots
39
Consider Robots
40
robots holding a conversation: epic fail

Science Fiction and Fact
41
Winston thought for a moment, then pulled the speakwrite towards him and began
dictating in Big Brother’s familiar style: a style at once military and pedantic, and, because
of a trick of asking questions and then promptly answering them [..], easy to imitate.
George Orwell, Nineteen Eighty Four, 1949.
Dictation software is definitely science fact

But where is the line between science fact and fiction now for speech processing?
Is it a solved problem?
Recognition vs. Understanding
Luckily (?) or unluckily our dictation software does not understand what is 42
being dictated to it: it only recognizes speech
• If the software did understand what was being said, recognition might
improve because competing hypotheses as to what was said could be
ruled out by using the contextual information associated with what the
document is about
Is it recognize speech or wreck a nice beech?
• Spoken Language Understanding is much more complex and involves
modeling of the higher-order structure of language (syntax, semantics,
pragmatics, and general aspects of knowledge required for
comprehension)
– Getting to this level involves the study of Natural Language Processing
(NLP) and is not the primary focus of this course
• If you are curious, you might want to look into other courses under
the Language and Technology concentration
Science Fiction
Computers and robots in science fiction almost always seem to process 43
language effortlessly
• The user seems to never need to repeat an utterance or get frustrated
that there is misunderstanding
A robot character that had trouble with speech recognition

would not make for a very menacing villain!
Science Fiction
Do you read me, Hal? Sorry, Dave, I do not read you. You are not a book. 44
https://www.youtube.com/watch?v=UgkyrW2NiwM
2001: A Space Odyssey, 1968, Metro-Goldwyn-Mayer
Science Fiction
Real-life systems are much more inflexible and highly constrained to limited 45
conditions such as speaker, accent, style, content, and environment
• Variations in these can result in massive decline in performance
The problem is that systems are designed with many assumptions

in mind about what is important to model (too much “built-in” knowledge):
many of these assumptions could be inadequate or incorrect
If systems could develop their own internal representation of the problem,

performance could be improved
Science Fiction
46
Talk about speech recognition: Did you hear the Ewok say “That guy’s wise”?
Return of the Jedi (Star Wars: Episode VI), 1983, Lucas Film
Science Fiction
47
https://www.youtube.com/watch?v=J5TAnU7gHws
Be Right Back (Episode 1, Series 2), Black Mirror, 2013, Zeppotron

Science Fiction
48
https://www.youtube.com/watch?v=J5TAnU7gHws
Be Right Back (Episode 1, Series 2), Black Mirror, 2013, Zeppotron

Or Science Fact?
49
https://www.youtube.com/watch?v=Mp9VqigoVg8
Beyond Death, The Story of God with Morgan Freeman, Revelations Entertainment
Science Fiction
The character Data from Star Trek: TNG is a fascinating case where, although 50
speech recognition is basically flawless, there are still difficulties associated
with world knowledge, pragmatics, and socializing
https://www.youtube.com/watch?v=HiIlJaSDPaA
All Good Things (Season 7, episode 25 ), Star Trek: The Next Generation, 1993
Science Fiction
The character Data from Star Trek: TNG is a fascinating case where, although 51
speech recognition is basically flawless, there are still difficulties associated
with world knowledge, pragmatics, and socializing
https://www.youtube.com/watch?v=9FqFm_vmVnE
Starship Mine (Season 6, episode 18), Star Trek: The Next Generation, 1993
Science Fact
52
https://www.youtube.com/watch?v=gSz7WU1nH50
Erica, Osaka University and Kyoto University
Science Fiction
53
A New Hope (Star Wars: Episode IV), 1977, Lucas Film

Science Fact
54
https://www.youtube.com/watch?v=Bht96voReEo
Talking Robot Mouth, Sawada Group, Kagawa University

Science Fact
55
Waseda Talker, Takanishi Lab “/sasisuseso/”

http://www.takanishi.mech.waseda.ac.jp/top/research/voice/index.htm#4th
Science Fact
56
Waseda Talker, Takanishi Lab

Science Fact
57
Complete with a physical

vocal fold model
modal voice breathy voice creaky voice
Waseda Talker, Takanishi Lab

Computers and Speech Synthesis
It is curious that C3PO would communicate in a human language (English) to 58
R2D2 when use of a wireless signal transmission would be much more
efficient (the reason is conveying the plot to the audience)
• Computers tend to communicate visually through graphics (windows,
cursors, buttons, lists, menus, etc.)
• Speech would not necessarily be more convenient (but is used for the
visually impaired)
• Speech is used in certain types of communication scenarios, and, when it
is used, it is constrained to emulate how humans use speech
– Humans would have trouble adapting to a computer-friendly signaling
system, and we would be limited to our own sensory modalities (so
ultrasound and microwaves are out)
It is curious that C3PO would communicate in a human language (English) to 59
R2D2 when use of a wireless signal transmission would be much more
efficient (the reason is conveying the plot to the audience)
• Computers and robots do not necessarily have vocal tracts, but the
generation of sounds by such systems must proceed as if they did because
of the way humans perceive speech sounds
– Both the types of speech sounds and their dynamics are constrained
by the vocal tract and its neurophysiology and our perception systems
are tuned to the sorts of variation arising from these factors (e.g.,
gestural undershoot and coarticulation)
– Without realistic limitations on the range of possibilities, computer-
generated speech tends to sound inhuman (and is harder to process)
To talk like a human is to have the limitations of a human vocal tract

There are three ways for a computer to generate speech from scratch 60
• Be equipped with physical vocal tract hardware that it can control (or a
real-human vocal tract!)
• Be equipped with a software simulation of the vocal tract
• Be equipped with an acoustic model of speech (formant synthesizers)
– This is the most removed from how humans actually produce speech
None of these techniques produce speech that sounds
indistinguishable from that of a human
• The problems lie in our knowledge of speech dynamics and acoustics and
aerodynamics and the computational problems associated with simulating
these things
To this day, the most natural sounding synthesis is based on splicing
together pre-recorded speech in novel ways
To use the human voice is to be human: 61
• There are functional reasons why we might want a robot to have a human
voice: it seems important in many jobs (could a robot nurse or counsellor
be as effective as a human in helping you heal?)
• But if we can ever make a computer sound like a human, there is also the
problem that people will impute certain mental attributes to the machine,
such as consciousness, emotions, thoughts, desires, and so forth
– This includes emulating voice attributes associated with mental and
physiological states humans experience (but would not be experienced
by the robot, if robots could be given emotions in the first place)
• There is no reason why a machine should have the same worldview,
beliefs, or psychological states as a human
To use the human voice is to be human: 62
• People may be lead by the qualities in the synthetic voice to make invalid
inferences about its intelligence, understanding, motivation and emotional
state
• This could lead to various problems (e.g., humans trusting the machine,
falling in love with it, etc.)
Fortunately, these ethical considerations are not something we need to

worry about immediately: most synthetic speech involves translating text into
speech (text-to-speech, TTS) or has mainly functional aspects, such as
information retrieval or accessing certain features on a phone (Apple’s Siri,
Microsoft’s Cortana, Amazon’s Alexa, Google’s Google Assistant)
Science Fact
Henn-na Hotel “Weird Hotel”
(Hotel staffed by robots) 63
https://www.theguardian.com/world/2015/jul/16/japans-robot-hotel-a-dinosaur-at-reception-a-machine-for-room-service
Grim Reality (for Robots)
64
“The Henn na Hotel in Japan, translated as Strange Hotel, found that robots annoyed the guests and would
often break down. Guests complained their robot room assistants thought snoring sounds were
commands and would wake them up repeatedly during the night. Meanwhile, the robot at the front desk
could not answer basic questions. Human staff ended up working overtime to repair robots that stopped
working. One staff member said it is easier now that they are not being frequently called by guests to help
with problems with the robots, reports the Mirror.”
https://www.hotelmanagement.net/tech/japan-s-henn-na-hotel-fires-half-its-robot-workforce
Modern Text-to-Speech Systems
Text-to-speech systems automatically produce intelligible synthetic speech 65
from input text
• There are several stages:
– Pre-processing to clean-up the text (convert abbreviations, acronyms,
numbers, and so forth into pronounceable words)
– Splitting the text into prosodic phrases
– Marking to identify prosodic prominence and intonation contour
– Pronunciation of words
– Timing of phonetic elements
– Signal generation
The output can sound more or less natural depending on how the signal is
generated, but it always seems to sound as if it were read by someone who
does not understand the material
Modern Text-to-Speech Systems
Naturalness significantly breaks-down at the level of prosodic structure 66
• The reason is that this information is not specified explicitly in the text but
rather relates to the meaning and function of utterances in the text
• This structure cannot be determined by processing words in isolation
– For example, the word “toe” in “The toe nails on the other hand...”
needs to receive the main accent (“The TOE nails..”), otherwise the
phrase does not make sense (try it: e.g., “The toe NAILS on the other
hand...”?)
• For natural speech, you need to specify information structure
– You need to know what information is old, which is new, which is
important, which is contradicting previous knowledge, and so forth
Science Fact
67
https://www.youtube.com/watch?v=z2bTymnb1uE
Science Fact
68
https://www.theverge.com/2013/9/17/4596374/machine-language-how-siri-found-its-voice
Lecture 1: 69

Natural Language Processing (NLP) studies how to get computers to do useful 70
things with natural languages (broad):
• Natural Language Understanding and Generation
Computational linguistics is doing linguistics (analyzing language) using the
methods of NLP (narrow)
NLP draws on research in: 71
• Linguistics
• Theoretical Computer Science
• Artificial Intelligence
• Mathematics and Statistics
• Psychology
• Cognitive Science, etc.
What and Where is NLP
The goals of NLP can be very far-reaching 72
• Reasoning and decision-making from text

• Real-time spoken dialog
Or very down-to-earth
• Searching the Web
• Context-sensitive spelling correction
• Analyzing reading-level or authorship statistically
• Extracting company names and locations from news articles
Why is NLP Difficult
Complexity: 73
• Language has many levels each with its own set of rules and constraints
and variation
• Language is not uniform, but varies from individual to individual and from
place to place
Complexity 74
Imperfection:
• Language is full of disfluencies, corrections, omissions, and errors
• Language is conveyed in the presence of various sources of noise
Complexity 75
Imperfection
Context:
• Much of the meaning of language depends on context (pragmatics)
– The president of the united states (said now or 50 years ago)
– You can give me your homework tomorrow but you must give me your
homework today!
Complexity 76
Imperfection
Context
Ambiguity: Multiple interpretations are present at all levels of the signal
• Writing:
– Character recognition: Was that an I or an L?
– Homographs: Is that the word lead (type of metal) or lead (verb)?
• Speech:
– Segmentation problem: Was that those low clouds or those slow
clouds? Sin tax or syntax? Mairzy Dotes?
– (Near) Homophony: What did you mean when you said you want to
take a pea? Did you say 16 or 60?
Complexity 77
Imperfection
Context
Ambiguity: Multiple interpretations are present at all levels of the signal
• Writing
• Speech
• Grammar:
– Syntactic ambiguity:
• I once shot an elephant in my pyjamas
– Semantic:
• Lexical: I like to hang out at the bank
• Quantifier: Every student likes a dog (i.e., one specific dog or a
unique dog for each student?)
According to Bondo
Natural language is: 78
• highly ambiguous at all levels

• complex and subtle
• fuzzy, probabilistic
• interpretation involves combining evidence
• involves reasoning about the world
• embedded in a social system of people interacting
– persuading, insulting and amusing each other
– changes over time
Grammar and Meaning
To model meaning and information structure in an utterance, we need to not 79
only model how meaning is constructed through grammar, but also the very
nature of knowledge representation itself: you need to know what the words
in the utterance relate to in the world
• A major problem is the extreme degree to which language is ambiguous
• The knowledge needed to disambiguate is not found in the utterance itself

but in the link between the utterance and the context of utterance
Grammar and Meaning
The sentence, time flies like an arrow, is said to have upwards of 100 interpretations 80
• time can be a noun (natural phenomenon or herb?) or verb
• flies can be an insect, a zipper on pants, a verb, etc.
• So which meaning is intended?
– “The nature of time is that it moves in a straight-line like an arrow?”
– “A certain type of flying insect displays fondness for arrows”
– etc.
Providing context helps to

disambiguate the meaning...
Ambiguity of take
2001: A Space Odyssey (Screenplay), 1968: 81
• HAL: I’m sorry Frank. I think you missed it. Queen to bishop 3, bishop takes
queen, knight takes bishop, mate.
take meaning “capture” or “get possession of”
• HAL: We can certainly afford to be out of communication for the short time
it will take to replace it.
take meaning “to last over a specified length of time”
• HAL: I honestly think you ought to sit down calmly, take a stress pill, and
think things over.
take meaning “consume” or “ingest”
Pragmatics
82
Baley had sat down during the course of his last speech and now he tried to rise again,
but a combination of weariness and the depth of the chair defeated him. He held out his
hand petulantly. ‘Give me a hand, will you, Daneel?’
Daneel stared at his own hand. ‘I beg your pardon, Partner Elijah?’
Baley silently swore at the other’s literal mind and said, ‘Help me out of the chair.’
Isaac Asimov, The Naked Sun, 1957
• Not only do computers and robots need to understand the meaning of

language but also they need to understand its use: that is, they need to be
able to infer the intentions behind an utterance – this is linguistic
pragmatics
The problem is that human language is often very indirect

in the way we express meaning
We say one thing but have a subtext or implied
meaning, called implicatures
R(obot) Daneel Olivaw

Pragmatics
Even just a simple utterance such as this shirt is frayed relies on a large 83
amount of understanding about human nature and the nature of the world
• A robot might take this to simply mean that “some shirt X has the property
of being frayed”
• A human might utter this with the implicature “we need to order new
shirts” or “you are not doing enough shopping for me” or “we are poor”
– But to understand these implicatures much knowledge is required
• Frayed shirts are unsightly, frayed shirts give an impression of
poverty, frayed shirts may indicate that the wearer is a bachelor,
this man is concerned that not enough money is being spent on
his clothes, this man is unhappy that no one cares about him
enough to be concerned for his appearance...
• And further more: why is he worried about his appearance? Why
hasn't he done anything about it himself?
Understanding even a simple sentence is no simple matter and depends
on a sophisticated model of the world from a human perspective
Musical Interlude: Gopher Tuna
https://www.youtube.com/watch?v=nIwrgAnx6Q8
Chatbots
Chatbots (or chatterbots) are systems designed to simulate natural human 86
conversation
• They are designed with the intent to convince the user that the system is
actually a real human
• One goal is to advance artificial intelligence along the lines specified by
Alan Turing’s “imitation game” or the “Turing test”
In the Turing test, an interrogator (C) attempts to

guess which of two other players (A and B) is a
machine using only text-based communication
Failure to reliably discern that the computer player

(here A) is indeed a computer means that it passes
the test and therefore displays intelligent behaviour
indistinguishable from that of a human
Eliza: An early chatbot
87
Eliza is early NLP software
It imitates a psychotherapist by
using a simple pattern-matching
approach (e.g., identifying and
replacing pronouns appropriately)
• You say: “My head hurts”
• Eliza responds: “Why do you say
your head hurts”
http://cyberpsych.org/eliza/#.WYgzpYh96Uk
Google: “cyberpsych eliza”
ELIZA, Joseph Weizenbaum, MIT, 1964-1966

Eviebot Chatbot
88
https://www.eviebot.com/en/
My dad’s name is Dave. What’s my dad’s name, Evie? created by Rollo Carpenter and Existor
Other Chatbot Avatars
89
chimbot boibot pewdiebot

Chatbot Tricks
Chatbots use a lot of tricks to convince users that they are talking to a human 91
and not a piece of software
• Echoing questions or replying with a question
– Human: “Who is the president of Brazil?”
– System: “I don’t know, do you?”
• The designers studying common input patterns and modify the system
accordingly with pattern-matching rules that facilitate interaction
– For example, Eliza will pick out key words in an utterance and build an
utterance around that
• Human: “I went to the mall with my father”
• System: “Tell me more about your father”
• More advanced systems will learn from user input over time, allow for
reference to previous terms, and handle anaphoric expressions (e.g.,
pronouns)
Chatbot Tricks
Ultimately, chatbots only simulate the flow of a conversation and not the 92
deeper understanding of a conversation
• They often thus degenerate into trivial and contentless exchanges of fixed
expressions
• There is no exchange of information and the various features exhibited by
human conversation associated with this (such as problem solving,
persuasion, consolation, etc.)
Conversation
Even with successful speech production and recognition, linguistic processing, 93
and pragmatic analysis, machines still need to demonstrate understanding
through appropriate action
• There are two sides to this
– Functional responses to conversation need to be made, such as
performing a certain task when requested (like Siri)
– Meta-communicative responses to indicate that the conversation is
successful or not
• These include verbal (“I see”, “uh-huh”, “sure”, “wow!”, etc.) and
non-verbal cues that indicate attention and understanding (head-
nodding, eye-contact, body posture, etc.)
• Turn-taking is also carefully regulated by various phonetic means
(changes in pitch, rate, clicks, ...), but conversations can also
operate with overlap and the interlocuters finishing each other’s
sentences
Conversation
Even with successful speech production and recognition, linguistic processing, 95
and pragmatic analysis, machines still need to demonstrate understanding
through appropriate action
• Given the challenges of building a general purpose conversation system,
often commercial tasks are restricted to very narrow parameters in regard
to what is said to the user and what the user can say
– Such systems are used especially for the general public when there is
no opportunity to “enrol” speakers and the variation of speech
encountered is very large and of poor quality (e.g., over the
telephone)
– For example, speech enquiry systems (for air plane tickets, movie
show times, etc.) provide users with an interactive means to retrieve
certain types of information
Ambiguity in Language
• think-pair-share: how many interpretations? 96
I made her duck

• I made her duck: 97
– “I cooked waterfowl for her”

– “I cooked waterfowl belonging to her”
– “I created the (plaster?) duck she owns”
– “I caused her to quickly lower her head or body”
– “I waved my magic wand and turned her into undifferentiated
waterfowl”
– ...
• think-pair-share: What do you think it means? 98
I saw the kid with the cat

[I saw a kid with a cat]1
99
saw = ? kid = ? with = ?
100
101
102
I saw a kid with a cat
• saw 103
– to cut?
– see (past tense)?
• kid
– young human?
– young goat?
• with
– together?
– instrumental?
• Miners refuse to work after death 104
• Stolen painting found by tree
• Milk drinkers are turning to powder
• Drunk gets nine months in violin case
• Panda mating fails; Veterinarian takes over
• Astronaut takes blame for gas in space craft
• Grandmother of eight makes hole in one
• Lack of brains hinders research
• Iraqi Head Seeks Arms
• Juvenile Court to Try Shooting Defendant
• Teacher Strikes Idle Kids
• British Left Waffles on Falkland Islands
• Ban on Nude Dancing on Governor’s Desk
Ambiguity in Computers
• I found a bug in this computer program 105
actual bug (moth) trapped in relay found by Grace Hopper in Mark II Computer
Probabilistic Models of Language
To handle this ambiguity and to integrate evidence from multiple levels we 106
turn to the tools of probability:
• Bayesian Classifiers (not rules)
• Hidden Markov Models (not Deterministic Finite Automatons)
• Probabilistic Context-Free Grammars
• Ranking Models
• . . . other tools of Machine Learning, AI, Statistics
Lecture 1: 107

Challenges for Machine Speech
Take the example of dictation software again: 108
• You need to train the software in an “enrolment” session (e.g., reading

100 given sentences)
• The system may create a language model of your word usage and writing
style based on existing documents
• The systems can learn but you need to provide them with feedback
• Even after training and learning, you need to modify how you speak: too
much variation in articulation, rate, and voice quality impairs performance
• Your dictation style may not match the style used in training the system
during its original design
• Recognition is influenced by how well the words match the system’s
statistical language model
But with patience and dedication,

low word error rates (~5%) can be achieved
Machines vs. Humans
Can we be more precise about just how good machines are (or were) at 109
speech recognition tasks?
Machines vs. Humans
Comparisons between the error rate of machines and humans on a range of 110
tasks (Lippmann 1997):
• Clean digits: machines 7 times worse than humans (0.72% vs 0.1% error)
• Alphabetic letters: machines 3 times worse (5% vs. 1.6% error)
• 1000 word vocabulary task: machines 8 times worse (17% vs. 2% error)
• 5000 word task in quiet: machines 8 times worse (7.2% vs. 0.9% error)
– but at a signal-to-noise ratio of 10dB, machines 11 times worse than
humans (12.8% vs. 1.1% error)
• Phrases extracted from spontaneous speech: machines 11 times worse
(43% vs. 4% error)
Machines are worse than humans

by an order of magnitude (a factor of 10)!
Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication,

22(1), 1–15. https://doi.org/10.1016/S0167-6393(97)00021-6
Challenges for Machines
Some challenges we face working with machine speech include: 111
• Signal capture I’m so

• Acoustic feature vectors confused!
• Acoustic-phonetic modeling
• Phonetic-phonological modeling
• Phonological structure of words
• Pronunciation variability
• Use of the lexicon
Signal capture: 112
• Audio recording is arguably better than human hearing in some ways

– Higher signal-to-noise ratio, better frequency response, linearity (e.g.,
representation of frequency not dependent on frequency)
Nice and flat German microphone engineering

Signal capture: 113
• But humans have distinct advantages

– Hearing system is tuned to speech (resonances of the outer ear –
pinna and concha, external auditory meatus – provide gain in the
upper-mid frequency range useful for speech)
Signal capture: 114
• But humans have distinct advantages

– Devices often do not have binaural hearing and a moveable head to
adapt to different environments (e.g., cocktail parties) nor eyes to see
the speakers face to use visual cues in recognition
And of course, moving one’s head should not create so much noise either!
https://www.youtube.com/watch?v=l394k-h2lAc
Acoustic feature vectors: 115
• The field has converged on the use of cepstral coefficients (more on these
later!) obtained from observations made over the speech signal and
stored as acoustic feature vectors
• But this assumes that speech is best broken into frames, or windows, and
that, within a frame, the speech is effectively stationary
A vector in a three-
dimensional space
• Speech has much finer structure to it and it seems human make use of this
– Harmonic structure: The way acoustic analysis is performed works
very well to capture source and filter characteristics, but this results in
neglecting a lot of finer details conveyed in the harmonic spectrum
(e.g., phonatory quality)
– Fine temporal structure: e.g., transients in time and frequency, such as
plosive bursts, pitch changes in contour tones, diphthongs or
“monophthongs” with unstable formants, are all relevant here
• This detail gets disregarded as noise but if we could parameterize it

correctly (model it well) it could help with recognition
– Can acoustic feature vectors be optimized for specific speakers or even
languages?
– Listener hearing is adapted to improve discriminability relative to the
phonological system; can this be incorporated?
– How can noisy signals be dealt with effectively?
• Do we try to model environment noise and speech?
– How about modeling properties of voice quality (in the sense of the
long-term quality of speech, not just the phonatory quality)?
Acoustic-phonetic modeling: 118
• Most systems use a pattern matching system to map between acoustics

and phonetic labels
– Hidden Markov Models (HMMs) (more on these later)
– Recurrent Neural Networks (RNNs) (not in this course)
A Hidden Markov Model helps map

observations (O) to hidden states (H) that A Recurrent Neural Network has
produce the observations connections that “recur” to a
previous layer
Acoustic-phonetic modeling: 119
• But this has weaknesses...

– What are the best phonetic units to use?
• Too many units reduces the accuracy in fitting a given acoustic
feature vector to the model
• Too few units may mean useful information in discriminating
words is lost
– Frames are not independent since their spectral content is highly
correlated and governed by the movements of the articulators
• Phonetic units are highly context dependent, so their frames will
not be context independent and normally distributed
• This can also be an advantage since segmental identity is
distributed across frames
– Models are trained on multiple speakers so they cannot exploit
correlations across segments associated with speaker characteristics
(vocal tract length, spectral tilt tendency)
Phonetic-phonological modeling: 120
• It is assumed that the recognized units are realization of underlying
phonological structure of each word
• The modeling is often just based on “table substitution” (with no accounting
of alternatives or substitution probabilities)
• Often phonetic units are just “phone-in-context” models
– For example, the /ɪ/ in /pɪt/ is modeled as the type of [ɪ] in the /#p_t/
context, while a different [ɪ] is used for the /#b_t/ context
– But this causes confusion when the [ɪ] in the /#p_t/ context is applied to
the vowel in /spɪt/, even though the /p/ here is realized as unaspirated
and thus better fits with the [ɪ] is used for the /#b_t/ context
• There is no account made of simplifications occurring when words are placed
into utterance context
– For example, assimilation, elision, epenthesis, vowel reduction, and
lenition
• There is a bias towards segmental modeling but phonetic variation relates to
prosodic and lexical factors, too
121
Phonological structure of words:
• Recognizers assume a segmental analysis of words (e.g., /pɪt/ is
decomposable into segmental units, /p/, /ɪ/, and /t/) which focuses
on the notion of contrast
• If we could automatically determine the segmental boundaries
from the speech stream, then all we would need is a classifier to
distinguish one segment from the next (e.g., /p/ vs. /b/ in /pɪt/ vs.
/bɪt/)
Pronunciation variability: 123
• Speech is rife with variation

• The more variation, the worse the performance
– One known speaker is easier than many unknown speakers
– One accent group is easier than multiple groups
– Utterance context, speech style, and rate are all variable
– Environment (e.g., ambient noise or multiple speakers in the
background)
• Performance can be poor even when information about variation can be
supplied
• Often the variation is lumped together in the modeling or treated as noise
rather than as useful systematic and predictable information
Use of the lexicon: 124
• Machines are forced to go from acoustics to phonetics to phonology

before words can be identified
• But the human brain can bring in a lot of top-down information from the
lexicon to aid in recognition decoding
• Recognizers could be designed to operate on the fact that words share
phonological structure, by-passing phonetic representation and instead
activate word hypotheses
• Such a system would benefit from treating acoustic variability in terms of
lexical neighbourhoods (e.g., words with few lexical neighbours are more
variable than those with more)
Areas for Improvement
Modeling variability of segmental phonological units 125
• Recognizers model segmental structure of words formed using a finite

phonetic inventory and dictionary-based pronunciations
• A data-driven approach could replace this:
1) Establishes the bottom-up acoustic components relevant for
describing speech
2) Model how the components vary across words
3) Predict the phonetic framework of a word from its phonological
structure, its speaker, accent, style, context, and so forth
Computing word probability from its phonetic elements 126
• A recognizer needs to compute the probability of a word from the acoustic

evidence
• The current approach computes the overall word probability from the
phone probabilities
• Models can integrate properties of words such as their prosodic structure
or lexical predictability to improve on this
– For example, the reduction of library to [laɪbɹi] (with elision of [ɹɛ]) is
possible but one cannot reduce remark in the same way (i.e.,
pronouncing it as [maɹk])
Modeling extraneous variability 127
• Speech varies by speaker, context, environment and occasion

• Humans can recognize these details (e.g., X was spoken by a speaker of
Canadian English in a formal style in a barn surrounded by several cows)
– Machines should be given this ability
• The tendency is to lump this variation together as noise
• This variation can be modeled with dedicated subsystems
– Word recognition would be based on the word that maximizes the
probability of the word combined with its speaker-type, accent, and
environment categories (and so forth)
Summary
There is a large gap between science fiction and science fact 128
• Major problems in speech synthesis include:

– How to generate prosodic structure suitable to the information
structure and communication goals of an utterance;
– How to produce speech using a representation of the vocal tract
rather than pre-recorded human speech
• Major problems in speech recognition include:
– Recognizing speech in everyday situations and under suboptimal
conditions (e.g., casual speech in a noisy environment)
• Grammar and meaning: ambiguity is very difficult
• Pragmatics: problems with determining the implications beyond literal and
direct meanings, which requires deeper understanding of the speaker’s
intentions and knowledge about the world
Human language is alien to machines
It may be useful to think of it as a vast cultural difference

References
• Huckvale, M. (2003, April). Conversational Machines in Science Fiction and 129
Science Fact. University College London.
• Huckvale, M. (2003, May). Why are machines less proficient than humans
at recognising words? University College London.
These will be posted on NTUlearn for your reading enjoyment

HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20

Uploaded by

Copyright:

Available Formats

Speech Synthesis and Recognition

• Coordinator: Scott Moisik, scott.moisik@ntu.edu.sg

• Available on NTUlearn (in the course outline)

• Focuses on experimental design, data collection, processing, and

Newfoundland Vancouver Island, BC

Country with the least amount

Singapore page, Wikipedia

Regina page, Wikipedia

Fargo, 1996, Polygram Filmed Entertainment

• Not a computer scientist 18

• Computer graphics and 3D rendering

[ʡɪsʡɪs] ‘to cut hair’

created with ArtiSynth, www.artisynth.org, UBC

• Computer speech: science fiction and science fact

Quick, Easy, and Painless

OK, maybe not so quick...

Take a Zen approach:

Do not miss catching Gilbert Strang’s feverish

The Fundamental Theorem of Linear Algebra

Watch Gilbert Strang’s lectures here (MIT OpenCourseWare):

We will make extensive use of Praat in this course

• Free to download: http://www.vocaltractlab.de/

• Computer speech: science fiction and science fact

robots holding a conversation: epic fail

Dictation software is definitely science fact

A robot character that had trouble with speech recognition

The problem is that systems are designed with many assumptions

If systems could develop their own internal representation of the problem,

Be Right Back (Episode 1, Series 2), Black Mirror, 2013, Zeppotron

Be Right Back (Episode 1, Series 2), Black Mirror, 2013, Zeppotron

A New Hope (Star Wars: Episode IV), 1977, Lucas Film

Talking Robot Mouth, Sawada Group, Kagawa University

Waseda Talker, Takanishi Lab “/sasisuseso/”

Waseda Talker, Takanishi Lab

Complete with a physical

modal voice breathy voice creaky voice

Waseda Talker, Takanishi Lab

To talk like a human is to have the limitations of a human vocal tract

Fortunately, these ethical considerations are not something we need to

• Computer speech: science fiction and science fact

• Reasoning and decision-making from text

• highly ambiguous at all levels

• The knowledge needed to disambiguate is not found in the utterance itself

Providing context helps to

• Not only do computers and robots need to understand the meaning of

The problem is that human language is often very indirect

R(obot) Daneel Olivaw

In the Turing test, an interrogator (C) attempts to

Failure to reliably discern that the computer player

Google: “cyberpsych eliza”

ELIZA, Joseph Weizenbaum, MIT, 1964-1966

chimbot boibot pewdiebot

I made her duck

– “I cooked waterfowl for her”

I saw the kid with the cat

• Computer speech: science fiction and science fact

• You need to train the software in an “enrolment” session (e.g., reading

But with patience and dedication,

Machines are worse than humans

Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication,

• Signal capture I’m so

• Audio recording is arguably better than human hearing in some ways

Nice and flat German microphone engineering

• But humans have distinct advantages