You are on page 1of 3

SCIENCE & TECHNOLOGY

CONVERSING WITH COMPUTERS - NATURALLY

by Professor Marcel Tatham, Department of Language and Linguistics, University of Essex

For the past 20 years or so, a great deal of research has been
aimed at enabling users of computers to communicate with their
machines by voice only, instead of by keyboards and visual
display units. What at first appeared to be relatively easy, getting a
computer to talk with a human-like voice and making it able to
respond appropriately when spoken to, has turned out to be
extremely difficult. It is so difficult that we can now confidently
predict that it will be several decades before fully natural, free-
flowing conversation can take place between people and
machines. Only now are we beginning to develop systems that
perform the task in an acceptable way, though in certain restricted
areas conversation with computers is already with us,

The scenario of a conversational system three separate components: or words.


with a machine requires us to picture using Before any response to the perceived
a computer to replace one of a pair of (I) hearing the message or question; signal can be generated, cognitive proces-
human beings speaking to one another, (2) thinking about how to respond; ses must take place that result in the Iisten-
that is, using the machine to simulate the (3) speaking the response. er's understanding of what has jusr been
behaviour involved in carrying out one heard. Only when the message has been
side of the conversation. So we can begin Hearing is in fact a two-stage process. understood can the listener compose a
by examining in general terms what it is The speech signal enters the ear and al- suitable response.
that a person does during the comrnunica- most immediately a complex acoustic Two stages are involved in speaking
tion process. analysis takes place in the inner ear. This the response. First, the message needs to
On the surface, what happens looks is a passive process involving no thought be encoded linguistically: thought is
simple enough: when someone is spoken on the part of the listener. The results of rurned into language as preparations are
to, the immediate response is to speak the analysis are sent to the brain where, in made to adjust the speech organs 10 pro-
back, enabling information to flow back a second stage, cognitive processing takes duce the appropriate acoustic signal. Sec-
and forrh between the two speakers. We place to further refine the analysis and to ond, a quite complex process of neuro-
can model this behaviour as consisting of 'label' the data as partieular speech sounds muscular contral is brought into play to
make sure that lungs, vocal cords, tongue,
Iips and so forth are organized to produce
84/81/98 85:48:B3 ßLACKCAT.Dl Left: 8.088 U Riyht: 8.815 U TiMe: 1.558 S the right sounds at the right time.
The processes are repeated over and
over by each participant in the conversa-
ti on as the exchange of information un-
2.8 rolls.

Simulating conversation
u
e 8.8 The basic idea of replacing one human
I being by a machine is simple enough: pro-
t grarns replace in turn the three stages of
• f+ "s - Oe __ • b"",-c.R. . c.......+-. the human process being simulated. The
first stage is parallel to the hearing pro-
-2.B cess. A mierophone picks up the speech
signal end a program simulates the ear's
analysis of the acoustic signal. The results
of this analysis are processed bya second
&00 BOB lBBB 1208 1400 1&00 lBOB
.s program wh ich conducts the labelling
task: the acoustic signal is identified as
containing particular speech sounds that in
Fig. 1. Acoustic signal of the sentence "lt's a black cat." It is easy to spot the individualwords. combination represent individual words.
ELEKTOR ELECTRONICS JANUARY t990
T-CIENCE & TECHNOLOGY

84/81/98 85:49:83 BLACKCAT.D1

rl~!'~
.;
-,
"
;::;[(
.:,'
-,
,'.:

4.8

,,','

-za 3.8
k
m
::: H
\... z 2.0 ...
d ..
B -4a ... ...

1.8
-b8 :;..

.."b88 980 1808 1288 1408 1b08 1988


:.1-+ '.5 4...- -'~~- _.c.-.:t- MS

Fig. 2. Acoustic analysis bV computers 01the senten ce in Fig. 1. In this, a kind 01 map 01the inner details 01the signal, the boundaries between the words
are still easilv detected.

The stage is usually called 'automatic ers to hold conversations with people is the height of summer it might weil mean that
speech recognition'. simulation of the cognitive processing in- the temperature has reached 25 degrees.
The next stage in the simulation is to volved in understanding what the person Human beings talking to one another
copy the human process of understanding has said and in formulating an appropriate know whether it is winter or summer; the
the message content of the speech just re- response. The reason for this is that speech computer does not. All of us have spent
ceived. We call this 'speech understand- signals do not contain within themselves our lifetimes acquiring such information
ing ', and it is by far the most difficult part al1 the information needed for their under- which we use every time we interpret what
of the system in which to achieve satisfac- standing. Take the simple sentence "it's is said to us. The problem is how to give
tory results. People understand speech in warm today." Spoken during the winter the computer just the right 'experience' to
the context of their accumulated experi- this rnight mean that the temperature has be able to pur what it hears into context
ence of the world. Obviously, to put such risen to 10 degrees, but spoken at the and reach an understanding of the message.
knowledge into a computer is a vast, if not
impossible, task. In practice ways are frund
of getting around the problem, usually by 04/81/90 86:83:45 HOUARE~O.D1 LeU: 8.882 V Right: 8.885 U TiMe: 1.141 S
restricting the area of conversation as much
as possible, so that the contexts the com-
puter must know to interpret the informa-
2.8
tion are narrowed down to a manageable
size. The processes involved are studied
within the field of artificial intelligence,
which here means simulating the cognitive
behaviour of people. u
o 8.8+--""
Once the computer has understood what I
has been said to it, it is able to formulate a t
response which must thenbespoken. Speech s
synthesis forms the final stage of the pro-
cess. A program takes the linguistically
-2.8 ')
encoded response and, mimicking the way
a human being produces an acoustic sig-
.. ~. -~ .
nal, produces speech through a loud-
BB8 lBB8 12BB 14BB 16BB
speaker. .s

Experience and labelling Fig. 3. Wavelorm 01 the sentence "Howare vou?" Because the consonants are spoken in much the
The most difficult part of gerring comput- same wav as we speak vowels, the words are blurred together.

ELEKTOR ELECTRONICS JANUARY t990


CONVERSING WITH COMPUTERS - NATURALLY I
64/81/86 6G:63:45 HOUAREYO.D1
.'.
:1~::i~::::r:[::~~
. -,

',:.:.: .. :.:
::::: "
4.8 ".

-26 3.8
.'
k .!::~!~:::{:
-,, .":;".:.:;;.: .
11~
:!~ Hz .':' ··.:.:·::·:-:·t::(:/[·:·.·.·:·:·:·!·!·.::.
d .'. 2.6
B -4B .'.
-,

1.8
-G8

888 1688 1268 1486 1G66


~ ~ IIIS

Fig. 4. Analysed version 01the wavelorm in Fig. 3. For this phrase the segmentation 01individual sounds and words is by no means obvious.

But even if we are able to limir the topic the analysed version of the same sentence, Tal in that, for exarnple, the computer can-
of conversation in such a way that the we can see that the segmentation of the in- not detect the subtleties hidden within
computer stands sorne chance of under- dividual sounds and words that make up human speech nor reproduce thern when it
standing what is said to it, there is also the the phrase is by no means obvious. speaks. These subtleties largely communi-
difficult task of converting the sound it ca te peoples' attitudes, feelings or emo-
hears into the senten ces it must under- Synthesizing the response tions. We can say the simple word "hello''
stand. Figure 1 shows the acoustic signal in such a way that it communicates how
of the spoken sentence "it's a black cat." The easiest component of the system is the pleased we are to see someone; or how re-
In the waveform, time runs from left to synthesis of the message the computer has lieved we are they have shown up; or how
right and it is easy to spot, by looking at generared in response to what it has heard. angry we are; or how surprised, and so on.
the way in which the amplitude ofthe trac- Assuming the problems of the earlier stages Research is weil under way to determine
ing changes, the individual words in the have been overcome and that the computer just how a person communicates such
sentence. Labelling this signal is not at all has forrnulated what it wants to say, the re- emotive subtleties in speech, and how
difficult because it is naturally segmented. sponse must now be spoken. In sorne these can be detected and reproduced by
11 consists of alternating consontants sense the task here is the opposite of the the computer. Preliminary results of incor-
(without loud vocal cord vibration) and labelling one: the computer has generated porating this research into our conversa-
vowels (with loud vocal cord vibration). 11 the right labels and arranged the words to tional sirnulations show naturalness to be
is the alternation that results in the widen- form a sentence, and now the acoustic sig- considerably improved.
ing and narrowing of the trace. Figure 2 nal has to be genera ted. The building It will be some time before holding
shows the acoustic analysis the computer blocks used to form the spoken sentence what seems a quite natural conversation
perforrns on the same sentence, producing are individual sounds but, as we saw above, with a computer will be a common every-
a kind of map of the inner details of the people do not speak sequences of iso lated day occurrence. But already, computers
signal. Once again, the boundaries be- sounds. They run thern together. Blurring that understand us and can respond by
tween the words are easily detected. the boundaries between speech segments speaking back are beginning to make their
Sentences are rarely this easy to seg- forms the basis of good speech synthesis, appearance. In certain restricted areas,
ment, however. In Fig. 3 we see the wave- and success depends on accurately gener- such as information services available
form of the sentence "How are you?" Al- ating the different types of blurring, as over the telephone, experimental systems
though in the spelling of this sentence we shown by Figures I and 3. are al ready in use. Banking transactions
see orthographie consonants and vowels Conversations with computers will be a and airline time-table services are the first
alternating, the particular consonants here success only when everything is as natural of such systems becoming available, and
(w, r, y) are spoken in much the same way as between two human beings. Each of the as we improve the conversational abilities
as we speak vowels. The result is a blnr- stages described above can now be accom- of the computer we shall see a rapid ex-
ring together of the words, making it al- plished with varying degrees of success, pansion of interaction between human be-
most impossible to spot the boundaries be- and we have complete systems in opera- ings and computers, not using keyboards
tween them. If we look at Fig. 4. wh ich is tion in the laboratory. But they are unnaru- and screens but natural-sounding speech.

ELEKTOR ELECTRONICS JANUARY 1990

You might also like