You are on page 1of 23

EEM.

ssr: Speaker & Speech Recognition

Speech Communication

by
Dr Philip Jackson
lecturer in speech & audio

Centre for Vision, Speech & Signal Processing,


Department of Electronic Engineering.

http://www.ee.surrey.ac.uk/Teaching/Courses/eem.ssr
World of speech technologies
Automatic Spoken Spoken Spoken
speech dialogue language language
recognition processing understanding generation

Emotion
Speaker recognition in speech
Speech
Speech perception technology
Phonetics
Speech
enhancement Speech production

Speech coding Speech Speech Speech synthesis


modification analysis
Speech-related disciplines
Psycholog Linguistics
y
Maths & Phonetics
stats
Speech
science
Acoustics
Computer
science
Signal
Electronics processin
g
The speech chain
SPEAKER LISTENER

SENSORY NERVES FEEDBACK


EAR LINK SENSORY NERVES
EAR
MOTOR NERVES
SOUND
VOCAL
WAVES
MUSCLES

Linguistic Physiological Acoustic Physiological Linguistic


It comes as naturally as
breathing
• Speech is man’s preferred modality
• Can use natural language for interacting
with complex systems
• Hands-free
• Eyes-free
• Small footprint
• Requires no training
Ideas and language
• Ideas are concepts or abstract notions
• Language has a grammar and syntax,
and is made up of words
• Develop our understanding of the world
at the same time as we are learning to
talk:
– Many of our thoughts are framed in terms of
words
– Language (and culture) affect the way we
think
Written vs. spoken language
• Written language
– discrete words separated by spaces
– usually complete, correct spelling
– opportunity to skip, skim or re-read

• Spoken language
– continuous sequence of sounds, usually
without spaces
– often damaged, interrupted, parts mumbled
Speech is not acoustic text
Sounds and words
• Phonetics
– How speech sounds are produced
– Acoustic result of speech articulation

• Phonology
– How sounds are used to make words
– The functions of the sounds within a
particular language
Acoustic signal
• Sound produced by vibration of vocal
cords
• Sound modified by resonances of the
vocal tract

• International Phonetic Alphabet (IPA)


– smallest unit in speech where substitution
could change meaning = phoneme
Speech sounds
• Speech production
– Articulators: how do they affect the speech waveform?
• Phonemes
– What are they, why are they useful?
– Phonemes are speech sounds in an ideal world.
• Phonetics
– How are phonemes actually realized?
– Phones are speech sounds in the real world.
– Allophones are different types of realisation.
• The wider context
– Language, accent,
– Speaker differences,
– Effect of external factors.
Vowels, consonants and
syllables
• Vowels
– Vibrating vocal cords in larynx with clear vocal tract
– Produced using slower extrinsic muscles
• Consonants
– Usually some occlusion of the vocal tract
– Sound source can be from larynx, click or hiss
– Produced using faster intrinsic muscles
• Syllables
– All languages have CV syllables
– Basic unit of articulation
– Consonant clusters
Phonetics vs. orthography
• Letter-phoneme mapping is not 1-to-1:
• Some sounds require several letters
– e.g., “sh”, “ph”
• Some letters have several pronunciations
– e.g., “g”, “c”
• Some sounds have several transcriptions
– e.g., /f/: “f” and “ph”
• Some letters produce several sounds
– e.g., “x” /ks/
• Some combinations have complex relations
– e.g., “-ough-”
• Different accents use different phonemes
– e.g., “bath”
Prosody
• Pitch
– Corresponds to the frequency of vibration of
the vocal cords
– (Has phonetic significance in tonal languages)
• Intensity
– How loud a particular word or syllable is
• Timing
– Durations depend on the phrasing
(punctuation), context (cf. “league”, “leek”),
etc.
– Stress timed vs. syllable timed languages
Language, accent and dialect
• Language
– A system of communication with a
vocabulary of words, grammar and syntax
– Different languages have different phonetic
contrasts (“right”, “light”)
• Accent
– Pronunciation variations that do not affect
meaning of spoken utterance (“good”, “food”)
– Intelligible by native speakers
• Dialect
– Variations in vocabulary, and possibly other
aspects, for distinct population
Non-acoustic signals
• Many other sources of information from other
senses: face, body, gesture, touch,…
– can make you “hear” things differently
• Lip reading
– Information about articulation can be derived from
(peripherally) observing lips
– Major cue for hearing impaired
– Significant effect for normal hearers (McGurk)
• Para-linguistic information
– Facial mood and emotion
– Culturally-grounded gestures
– Modifying gestures
– Body language
– Stress and emphasis
Complexity demands
intelligence
• Speech is very complex
– requires fusion of many sources of knowledge

• Humans have developed large brains and


supreme intelligence in the animal
kingdom to deal with it:
– very large number of neurons, in parallel
Summary of speech comm.
• Speech is natural modality of man to interact
with machines
– Ideas and language
– Written vs. spoken language (phonology)
– Continuous acoustic signal (phonetics)
• Phonemes, phones and allophones
– Vowels and consonants
– Phonetic vs. orthographic transcriptions
– Intelligible by native speakers
• Para-linguistic information
– Prosody: intensity, pitch and timing
– Language, accent and dialect
– Visual, haptic and contextual information
Speech recognition
What is speech recognition?
• Types of spoken language processing:
– Automatic speech recognition (ASR)
– Spoken language understanding
– Dialogue systems
– Paralinguistic speech processing
– Speech verification
– Speech coding, enhancement & modification
– Speech synthesis
– Spoken language generation
– Speaker recognition: identification and
authentication/verification
Speech recognition problem
• The dream and reality
– Intelligent machines?
– Size of vocabulary: 50, 1000, 20000 words
– Speaker -dependent/-independent
• Discovering our ignorance
– How does the ear work?
– How does the brain process sounds to perceive
concepts?
• Circumventing our ignorance
– Ad-hoc rules vs. pattern matching techniques
– Most successful based on stochastic modelling
– Recent advances in neural network approaches
Dimensions of difficulty
• Speaker dependency
• Vocabulary size
• Isolated words vs. continuous speech
• Language constraints and knowledge
sources
• Acoustic ambiguity
• Noise robustness
Speech recognition summary
• Dream and reality
– Speech-to-text machines
– Vocabulary size and speaker-dependency trade off
against recognition accuracy
• Incomplete specification
– Of language, of the human ear, the auditory nerves
and of how the cortex processes speech to derive
meaning
• An engineering solution
– Use pattern matching techniques
– Most successful based on Hidden Markov Models
– Recent advances in HMM/ANN hybrids