Professional Documents
Culture Documents
HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20
HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20
Introduction
Lecture 1
Scott Moisik
August 17, 2017
1
Administrative Stuff
Key details: 2
Manitoba Ontario
My Background
Saskatchewan, not a good tattoo 13
small bend!
My Background
Singapore, climate data 14
−59.2 ºC!!!
• Not a mathematician
• Not an engineer
• Not a physicist
• I’m an artist! Then a linguist!
– Life takes you on a strange path A
• Yet my work in linguistics has led me to need to know something about all
of these things
My Line of Work
• Phonetics and phonology 19
• Larynx
• Phonetic instrumentation
• Vocal tract and speech modeling
• Biomechanics and motor control
– ArtiSynth: Talk to me if you are interested
• Language and Genetics
– G[ɜ]bils project: Also looking for students
My Line of Work
Where it gets technical: 20
Data from ArtiVarK, a project of G[ɜ]bils, Dan Dediu (PI), MPI for Psycholinguistics
Computer Modeling: [!]
26
https://www.youtube.com/watch?v=ias31By60N8
Speech Synthesis and Recognition
Lecture 1: 28
• We are explorers
• We are ALL learners
• We are here to have fun (life is short!!!)
• Be brave: Some of these ideas might be challenging
• Some of these ideas are really damn cool: I hope I can inspire you to feel
the same way
32
Sowing Ideas Like Seeds
You will encounter some difficult concepts in this course and even some 33
equations – but take heed, my primary objective is to sow an idea like a seed
– in your mind!
This is not a math class: there will not be any calculators required!
Linear Algebra
Linear algebra underlies almost all of the math in speech synthesis and 34
recognition but also pays huge dividends elsewhere (e.g., statistical modeling)
High school made me hate math
Gilbert Strang helped me see its beauty...
• Praat means “talk” or “speak” (e.g., “Praat Nederlands met me”) in Dutch
• http://www.fon.hum.uva.nl/praat/
https://www.youtube.com/watch?v=UgkyrW2NiwM
2001: A Space Odyssey, 1968, Metro-Goldwyn-Mayer
Science Fiction
Real-life systems are much more inflexible and highly constrained to limited 45
conditions such as speaker, accent, style, content, and environment
• Variations in these can result in massive decline in performance
Talk about speech recognition: Did you hear the Ewok say “That guy’s wise”?
Return of the Jedi (Star Wars: Episode VI), 1983, Lucas Film
Science Fiction
47
https://www.youtube.com/watch?v=J5TAnU7gHws
https://www.youtube.com/watch?v=J5TAnU7gHws
https://www.youtube.com/watch?v=Mp9VqigoVg8
Beyond Death, The Story of God with Morgan Freeman, Revelations Entertainment
Science Fiction
The character Data from Star Trek: TNG is a fascinating case where, although 50
speech recognition is basically flawless, there are still difficulties associated
with world knowledge, pragmatics, and socializing
https://www.youtube.com/watch?v=HiIlJaSDPaA
All Good Things (Season 7, episode 25 ), Star Trek: The Next Generation, 1993
Science Fiction
The character Data from Star Trek: TNG is a fascinating case where, although 51
speech recognition is basically flawless, there are still difficulties associated
with world knowledge, pragmatics, and socializing
https://www.youtube.com/watch?v=9FqFm_vmVnE
Starship Mine (Season 6, episode 18), Star Trek: The Next Generation, 1993
Science Fact
52
https://www.youtube.com/watch?v=gSz7WU1nH50
Erica, Osaka University and Kyoto University
Science Fiction
53
https://www.youtube.com/watch?v=Bht96voReEo
• Be equipped with physical vocal tract hardware that it can control (or a
real-human vocal tract!)
• Be equipped with a software simulation of the vocal tract
• Be equipped with an acoustic model of speech (formant synthesizers)
– This is the most removed from how humans actually produce speech
None of these techniques produce speech that sounds
indistinguishable from that of a human
• The problems lie in our knowledge of speech dynamics and acoustics and
aerodynamics and the computational problems associated with simulating
these things
To this day, the most natural sounding synthesis is based on splicing
together pre-recorded speech in novel ways
Computers and Speech Synthesis
To use the human voice is to be human: 61
• There are functional reasons why we might want a robot to have a human
voice: it seems important in many jobs (could a robot nurse or counsellor
be as effective as a human in helping you heal?)
• But if we can ever make a computer sound like a human, there is also the
problem that people will impute certain mental attributes to the machine,
such as consciousness, emotions, thoughts, desires, and so forth
– This includes emulating voice attributes associated with mental and
physiological states humans experience (but would not be experienced
by the robot, if robots could be given emotions in the first place)
• There is no reason why a machine should have the same worldview,
beliefs, or psychological states as a human
Computers and Speech Synthesis
To use the human voice is to be human: 62
• People may be lead by the qualities in the synthetic voice to make invalid
inferences about its intelligence, understanding, motivation and emotional
state
• This could lead to various problems (e.g., humans trusting the machine,
falling in love with it, etc.)
https://www.theguardian.com/world/2015/jul/16/japans-robot-hotel-a-dinosaur-at-reception-a-machine-for-room-service
Grim Reality (for Robots)
64
“The Henn na Hotel in Japan, translated as Strange Hotel, found that robots annoyed the guests and would
often break down. Guests complained their robot room assistants thought snoring sounds were
commands and would wake them up repeatedly during the night. Meanwhile, the robot at the front desk
could not answer basic questions. Human staff ended up working overtime to repair robots that stopped
working. One staff member said it is easier now that they are not being frequently called by guests to help
with problems with the robots, reports the Mirror.”
https://www.hotelmanagement.net/tech/japan-s-henn-na-hotel-fires-half-its-robot-workforce
Modern Text-to-Speech Systems
Text-to-speech systems automatically produce intelligible synthetic speech 65
from input text
• There are several stages:
– Pre-processing to clean-up the text (convert abbreviations, acronyms,
numbers, and so forth into pronounceable words)
– Splitting the text into prosodic phrases
– Marking to identify prosodic prominence and intonation contour
– Pronunciation of words
– Timing of phonetic elements
– Signal generation
The output can sound more or less natural depending on how the signal is
generated, but it always seems to sound as if it were read by someone who
does not understand the material
Modern Text-to-Speech Systems
Naturalness significantly breaks-down at the level of prosodic structure 66
• The reason is that this information is not specified explicitly in the text but
rather relates to the meaning and function of utterances in the text
• This structure cannot be determined by processing words in isolation
– For example, the word “toe” in “The toe nails on the other hand...”
needs to receive the main accent (“The TOE nails..”), otherwise the
phrase does not make sense (try it: e.g., “The toe NAILS on the other
hand...”?)
• For natural speech, you need to specify information structure
– You need to know what information is old, which is new, which is
important, which is contradicting previous knowledge, and so forth
Science Fact
67
https://www.youtube.com/watch?v=z2bTymnb1uE
Science Fact
68
https://www.theverge.com/2013/9/17/4596374/machine-language-how-siri-found-its-voice
Speech Synthesis and Recognition
Lecture 1: 69
• Linguistics
• Theoretical Computer Science
• Artificial Intelligence
• Mathematics and Statistics
• Psychology
• Cognitive Science, etc.
What and Where is NLP
The goals of NLP can be very far-reaching 72
Or very down-to-earth
• Searching the Web
• Context-sensitive spelling correction
• Analyzing reading-level or authorship statistically
• Extracting company names and locations from news articles
Why is NLP Difficult
Complexity: 73
• Language has many levels each with its own set of rules and constraints
and variation
• Language is not uniform, but varies from individual to individual and from
place to place
Why is NLP Difficult
Complexity 74
Imperfection:
• Language is full of disfluencies, corrections, omissions, and errors
• Language is conveyed in the presence of various sources of noise
Why is NLP Difficult
Complexity 75
Imperfection
Context:
• Much of the meaning of language depends on context (pragmatics)
– The president of the united states (said now or 50 years ago)
– You can give me your homework tomorrow but you must give me your
homework today!
Why is NLP Difficult
Complexity 76
Imperfection
Context
Ambiguity: Multiple interpretations are present at all levels of the signal
• Writing:
– Character recognition: Was that an I or an L?
– Homographs: Is that the word lead (type of metal) or lead (verb)?
• Speech:
– Segmentation problem: Was that those low clouds or those slow
clouds? Sin tax or syntax? Mairzy Dotes?
– (Near) Homophony: What did you mean when you said you want to
take a pea? Did you say 16 or 60?
Why is NLP Difficult
Complexity 77
Imperfection
Context
Ambiguity: Multiple interpretations are present at all levels of the signal
• Writing
• Speech
• Grammar:
– Syntactic ambiguity:
• I once shot an elephant in my pyjamas
– Semantic:
• Lexical: I like to hang out at the bank
• Quantifier: Every student likes a dog (i.e., one specific dog or a
unique dog for each student?)
According to Bondo
Natural language is: 78
• HAL: I’m sorry Frank. I think you missed it. Queen to bishop 3, bishop takes
queen, knight takes bishop, mate.
take meaning “capture” or “get possession of”
• HAL: We can certainly afford to be out of communication for the short time
it will take to replace it.
take meaning “to last over a specified length of time”
• HAL: I honestly think you ought to sit down calmly, take a stress pill, and
think things over.
take meaning “consume” or “ingest”
Pragmatics
82
Baley had sat down during the course of his last speech and now he tried to rise again,
but a combination of weariness and the depth of the chair defeated him. He held out his
hand petulantly. ‘Give me a hand, will you, Daneel?’
Daneel stared at his own hand. ‘I beg your pardon, Partner Elijah?’
Baley silently swore at the other’s literal mind and said, ‘Help me out of the chair.’
Isaac Asimov, The Naked Sun, 1957
https://www.youtube.com/watch?v=nIwrgAnx6Q8
Chatbots
Chatbots (or chatterbots) are systems designed to simulate natural human 86
conversation
• They are designed with the intent to convince the user that the system is
actually a real human
• One goal is to advance artificial intelligence along the lines specified by
Alan Turing’s “imitation game” or the “Turing test”
http://cyberpsych.org/eliza/#.WYgzpYh96Uk
https://www.eviebot.com/en/
My dad’s name is Dave. What’s my dad’s name, Evie? created by Rollo Carpenter and Existor
Other Chatbot Avatars
89
– to cut?
– see (past tense)?
• kid
– young human?
– young goat?
• with
– together?
– instrumental?
Ambiguity in Language
• Miners refuse to work after death 104
• Stolen painting found by tree
• Milk drinkers are turning to powder
• Drunk gets nine months in violin case
• Panda mating fails; Veterinarian takes over
• Astronaut takes blame for gas in space craft
• Grandmother of eight makes hole in one
• Lack of brains hinders research
• Iraqi Head Seeks Arms
• Juvenile Court to Try Shooting Defendant
• Teacher Strikes Idle Kids
• British Left Waffles on Falkland Islands
• Ban on Nude Dancing on Governor’s Desk
Ambiguity in Computers
• I found a bug in this computer program 105
actual bug (moth) trapped in relay found by Grace Hopper in Mark II Computer
Probabilistic Models of Language
To handle this ambiguity and to integrate evidence from multiple levels we 106
turn to the tools of probability:
• Bayesian Classifiers (not rules)
• Hidden Markov Models (not Deterministic Finite Automatons)
• Probabilistic Context-Free Grammars
• Ranking Models
• . . . other tools of Machine Learning, AI, Statistics
Speech Synthesis and Recognition
Lecture 1: 107
And of course, moving one’s head should not create so much noise either!
https://www.youtube.com/watch?v=l394k-h2lAc
Challenges for Machines
Acoustic feature vectors: 115
• The field has converged on the use of cepstral coefficients (more on these
later!) obtained from observations made over the speech signal and
stored as acoustic feature vectors
• But this assumes that speech is best broken into frames, or windows, and
that, within a frame, the speech is effectively stationary
A vector in a three-
dimensional space
Challenges for Machines
Acoustic feature vectors: 116
• Speech has much finer structure to it and it seems human make use of this
– Harmonic structure: The way acoustic analysis is performed works
very well to capture source and filter characteristics, but this results in
neglecting a lot of finer details conveyed in the harmonic spectrum
(e.g., phonatory quality)
– Fine temporal structure: e.g., transients in time and frequency, such as
plosive bursts, pitch changes in contour tones, diphthongs or
“monophthongs” with unstable formants, are all relevant here
Challenges for Machines
Acoustic feature vectors: 117