You are on page 1of 18

Is a speech recognition system

an intelligent machine?

Philip JB Jackson

Centre for Vision, Speech & Signal Processing, UniS


INTRODUCTION

Qualities of intelligence
1. sophisticated perceptual abilities
2. complex behaviour
3. understanding of the world
4. resilience to error
5. self-awareness
6. adaptability
INTELLIGENCE

1. Perception
• feature extraction
• active sensing
• multi-modal integration
INTELLIGENCE

2. Behaviour
• complex actions
• inference and reasoning
• intention
• autonomy
INTELLIGENCE

3. World model
• objects
– with attributes
• actions
– can act on certain objects
• community (relations)
• environment
INTELLIGENCE

4. Robustness
• flexibility
• strategies that allow for mistakes
– e.g., trial and error
• redundancy
INTELLIGENCE

5. Self model
• present position
• present activity
• fitness/ability
• bill of health
• identity
INTELLIGENCE

6. Ability to adapt
• learning
– new experience
• adaptation
– change in context/environment
• update
– new knowledge or structure
• feedback
– perception of own performance
INTELLIGENCE

Attributes of an intelligent entity


1. Sensory input
2. Memory, processor, actuator
3. Models
4. Robust architecture
5. States
6. Learning, adaptation, update &
feedback
RECOGNIZER

Automatic speech recognition


Front-end Low-level
Decoder
processing models

speech signal Front-end Low-level transcription


Decoder
processing models

Syntax
grammar Language Speaker
models models
RECOGNIZER

Automatic speech recognition


1. Audio(/video) input processing
2. As an expert system, relating to:
transcription, database entries, commands,
natural language
3. Hidden Markov models
4. Training, and noise and speaker
adaptation
EM using B-W alg’m., PMC/spectral
subtraction, MLLR/formant normalisation
5. Decoding
Viterbi algorithm.
RECOGNIZER

Language understanding
• Symbols
– description
• Word network
– syntax
• Language model
– priors
• transcript
– output
RECOGNIZER

Understanding of the world


• models
– refinement of perception during training,
speaker adaptation (off-/on-line)
• symbol grounding
– correlations of sounds and experience
• e.g., onomatopoeia, pop, bang, whizz, whisper
– correlations of words and experience
• e.g., semantic context, light/dark, noisy
– developmental
• social context
– sympathy, altruism & social behaviour
RECOGNIZER

Details of front end


• Auditory analogy
• noise robustness
• speaker models and adaptation
RECOGNIZER

Decoding with HMMs


• “What is the most likely
interpretation of what I just heard?”
– Hypotheses
– Pruning
– Traceback
• Gaussian pdfs are natural choice of
building block for emission probs.
RECOGNIZER

Training process
• Initialisation
– monophone models
– alignment
• Refinement
– triphones
– clustering
– tied mixture models
• multi-modal techniques
SUMMARY
Features of current systems
• Speaker ID
• Speaker normalisation
• Tracking of speech dynamics
• Integration of multimodal stimuli
• Auditory scene analysis
• Source separation (cocktail party)
• Parallel model combination
SUMMARY
For the future
• Auditory/vocal behaviour of
locust/bird/chimp/?
• Better treatment of fluent speech
– coarticulation/dialogue language
models/accents
• Training for a well-rounded artefact
– defence/business news/your own
offspring!
• The Apple and the Cross