Is a speech recognition system an intelligent machine?

Philip JB Jackson

Centre for Vision, Speech & Signal Processing, UniS

INTRODUCTION

Qualities of intelligence
1. 2. 3. 4. 5. 6. sophisticated perceptual abilities complex behaviour understanding of the world resilience to error self-awareness adaptability

INTELLIGENCE

1. Perception
• feature extraction • active sensing • multi-modal integration

INTELLIGENCE

2. Behaviour
• • • • complex actions inference and reasoning intention autonomy

INTELLIGENCE

3. World model
• objects
– with attributes

• actions
– can act on certain objects

• community (relations) • environment

INTELLIGENCE

4. Robustness
• flexibility • strategies that allow for mistakes
– e.g., trial and error

• redundancy

INTELLIGENCE

5. Self model
• • • • • present position present activity fitness/ability bill of health identity

INTELLIGENCE

6. Ability to adapt
• learning
– new experience

• adaptation
– change in context/environment

• update
– new knowledge or structure

• feedback
– perception of own performance

INTELLIGENCE

Attributes of an intelligent entity
1. 2. 3. 4. 5. 6. Sensory input Memory, processor, actuator Models Robust architecture States Learning, adaptation, update & feedback

RECOGNIZER

Automatic speech recognition
Front-end processing speech signal Front-end processing Low-level models Low-level models Decoder transcription

Decoder

Syntax grammar

Language models

Speaker models

RECOGNIZER

Automatic speech recognition
1. Audio(/video) input processing 2. As an expert system, relating to:
transcription, database entries, commands, natural language

3. Hidden Markov models 4. Training, and noise and speaker adaptation
EM using B-W alg’m., PMC/spectral subtraction, MLLR/formant normalisation

5. Decoding
Viterbi algorithm.

RECOGNIZER

Language understanding
• Symbols
– description

• Word network
– syntax

• Language model
– priors

• transcript
– output

RECOGNIZER

Understanding of the world
• models
– refinement of perception during training, speaker adaptation (off-/on-line)

• symbol grounding
– correlations of sounds and experience
• e.g., onomatopoeia, pop, bang, whizz, whisper

– correlations of words and experience
• e.g., semantic context, light/dark, noisy

– developmental

• social context
– sympathy, altruism & social behaviour

RECOGNIZER

Details of front end
• Auditory analogy • noise robustness • speaker models and adaptation

RECOGNIZER

Decoding with HMMs
• “What is the most likely interpretation of what I just heard?”
– Hypotheses – Pruning – Traceback

• Gaussian pdfs are natural choice of building block for emission probs.

RECOGNIZER

Training process
• Initialisation
– monophone models – alignment

• Refinement
– triphones – clustering – tied mixture models

• multi-modal techniques

SUMMARY

Features of current systems
• • • • • • • Speaker ID Speaker normalisation Tracking of speech dynamics Integration of multimodal stimuli Auditory scene analysis Source separation (cocktail party) Parallel model combination

SUMMARY

For the future

• Auditory/vocal behaviour of locust/bird/chimp/? • Better treatment of fluent speech
– coarticulation/dialogue language models/accents

• Training for a well-rounded artefact
– defence/business news/your own offspring!

• The Apple and the Cross