School of Computing and Mathematical Sciences

CMSCD1008 Introduction to Multimedia Technology

Lecture 6: Speech recognition and synthesis
Chris Wren

In this session...

What is speech recognition?

Types of speech recognition How does it work? What are its uses?

What is speech synthesis?

How does it work? What are its uses?

Speech recognition

Speech recognition is the process of recognising and understanding human speech and interpreting this inside a computer It is a very demanding task that requires a huge amount of processing power
Bear in mind that it takes humans many years to fully understand spoken instructions!

Has recently become more widespread due to

Reductions in cost of CPU chips Increases in CPU processing power

Types of speech recognition

The user gives the computer simple spoken commands, e.g. Start Word

Discrete recognition
The user speaks single words separated by distinct pauses to construct a given sentence or phrase
e.g. This is HND Multimedia

Continuous recognition
The user speaks using natural language with no pauses (i.e. they use normal conversation)

Characteristics of speech

Phonemes are the fundamental elements of pronunciation in a language

There are about 80 phonemes in the English language from which all words are constructed
Normal conversation requires the recognition of 10 to 15 phonemes per second

Syllables are composed of phonemes Words are made up of syllables Phrases and sentences are composed of words

Understanding speech
I would like an ice-cream
Eye wood like a nice scream
This is the sentence that has been spoken by the human.

The first attempt at recognition will try to identify words that it understands.

I would like a nice scream

I would like an ice-cream

The recogniser then uses language- and grammar- specific rules to determine whether the sentence actually makes sense.
Some words have common pairings in general text and the recogniser will try many different combinations.

Speaker recognition

Speaker-independent recognition
The software can recognise any user Is generally pre-trained by a lot of different users Difficult to develop and expensive to build

Speaker-dependent recognition
The software can only recognise one user Is generally trained by that user Can be made to recognise new words

Problems in recognising speech

Background noise Differences in microphones

Headsets, lapel microphones and speaker phones

Children's voices as well as adult voices Accents and dialects Foreign languages Specialist vocabularies
e.g. legal or medical terminology

Recognition rates

The recognition rate is the percentage of words that a recogniser can accurately recognise without mistake This can be calculated as follows:

Number of correct words 100 Number of test words

Current rates are around 90% - 95% To improve this rate you have to train the recogniser to recognise your voice

Uses for speech recognition

Hands-free typing
Dictation (no need for a secretary) Language translation

Voice print identification

e.g. accessing a cash machine using your voice

Devices that are difficult to use with your hands

In a car e.g. the AutoPC

Improved accessibility for disabled people Telephone support

Eliminating press hash now type prompts

Speech synthesis

Speech synthesis is the conversion of electronic text into spoken output Sometimes known as Text-To-Speech (TTS) Has a reputation of sounding like a robot
Listen to Stephen Hawkings speech synthesiser!

Modern TTS synthesisers have very realistic sounding voices for general text
Some can even be made to sing and whistle!


Types of speech synthesiser

Formant synthesis
This models the human vocal system from scratch using a frequency analysis of real speech It then recreates these frequencies using a sound synthesiser

Concatenative synthesis
This constructs the speech by joining together small samples of the basic phonemes that make up real speech


Problems with speech synthesis

Speech synthesis is very challenging The main area where speech synthesisers are weak is in their simulation of prosody
The changes in rhythm, intonation and stress as we speak

To generate realistic prosody, the computer must understand the meaning of the text
Otherwise it sounds lifeless and electronic


Uses for speech synthesis

Proof-reading of written text

Documents, reports, etc. Email

Improved accessibility for disabled people

Screen readers

Automated telephone help systems User interface enhancements

Synthesised voice can be more friendly than text Voice does not take up any screen space


Speech recognition and speech synthesis require a large amount of processing power to be effective They also require large amounts of memory in which to process this data Successful use of a recognition package requires extensive training which is a continuous process Recognition rates are currently around 90% 95%

Next session...

