Professional Documents
Culture Documents
net/publication/234805405
CITATIONS READS
106 11,601
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Eric Atwell on 19 May 2014.
Bridging the Gap: Academic and Industrial Research in Dialog Technologies Workshop Proceedings, pages 89–96,
NAACL-HLT, Rochester, NY, April 2007.
2007
c Association for Computational Linguistics
implemented by Wallace in 1995. ALICE knowl- the test. At the beginning it was decided to limit
edge about English conversation patterns is stored the topic, in order to limit the amount of language
in AIML files. AIML, or Artificial Intelligence the contestant programs must be able to cope with,
Mark-up Language, is a derivative of Extensible and to limit the tenor. Ten agents were used, 6
Mark-up Language (XML). It was developed by were computer programs. Ten judges would con-
Wallace and the Alicebot free software community verse with the agents for fifteen minutes and rank
during 1995-2000 to enable people to input dia- the terminals in order from the apparently least
logue pattern knowledge into chatbots based on the human to most human. The computer with the
A.L.I.C.E. open-source software technology. highest median rank wins that year’s prize. Joseph
In this paper we present other methods to Weintraub won the first, second and third Loebner
evaluate the chatbot systems. ALICE chtabot sys- Prize in 1991, 1992, and 1993 for his chatbots, PC
tem was used for this purpose, where a Java pro- Therapist, PC Professor, which discusses men ver-
gram has been developed to read from a corpus sus women, and PC Politician, which discusses
and convert the text to the AIML format. The Cor- Liberals versus Conservatives. In 1994 Thomas
pus of Spoken Afrikaans (Korpus Gesproke Afri- Whalen (Whalen 2003) won the prize for his pro-
kaans, KGA), the corpus of the holy book of Islam gram TIPS, which provides information on a par-
(Qur’an), and the FAQ of the School of Computing ticular topic. TIPS provides ways to store,
at University of Leeds 3 were used to produce two organize, and search the important parts of sen-
KGA prototype, the Qur’an prototype and the tences collected and analysed during system tests.
FAQchat one consequently. However there are sceptics who doubt the ef-
Section 2 presents Loebner Prize contest, sec- fectiveness of the Turing Test and/or the Loebner
tion 3 illustrates the ALICE/AIMLE architecture. Competition. Block, who thought that “the Turing
The evaluation techniques of the KGA prototype, test is a sorely inadequate test of intelligence be-
the Qur’an prototype, and the FAQchat prototype cause it relies solely on the ability to fool people”;
are discussed in sections 4, 5, and 6 consequently. and Shieber (1994), who argued that intelligence is
The conclusion is presented in section 7. not determinable simply by surface behavior.
Shieber claimed the reason that Turing chose natu-
2 The Loebner Prize Competition ral language as the behavioral definition of human
intelligence is “exactly its open-ended, free-
The story began with the “imitation game” which wheeling nature”, which was lost when the topic
was presented in Alan Turing’s paper “Can Ma- was restricted during the Loebner Prize. Epstein
chine think?” (Turing 1950). The imitation game (1992) admitted that they have trouble with the
has a human observer who tries to guess the sex of topic restriction, and they agreed “every fifth year
two players, one of which is a man and the other is or so … we would hold an open-ended test - one
a woman, but while screened from being able to with no topic restriction.” They decided that the
tell which is which by voice, or appearance. Turing winner of a restricted test would receive a small
suggested putting a machine in the place of one of cash prize while the one who wins the unrestricted
the humans and essentially playing the same game. test would receive the full $100,000.
If the observer can not tell which is the machine Loebner in his responses to these arguments be-
and which is the human, this can be taken as strong lieved that unrestricted test is simpler, less expen-
evidence that the machine can think. sive and the best way to conduct the Turing Test.
Turing’s proposal provided the inspiration for Loebner presented three goals when constructing
the Loebner Prize competition, which was an at- the Loebner Prize (Loebner 1994):
tempt to implement the Turing test. The first con- • “No one was doing anything about the
test organized by Dr. Robert Epstein was held on Turing Test, not AI.” The initial Loebner
1991, in Boston’s Computer Museum. In this in- Prize contest was the first time that the
carnation the test was known as the Loebner con- Turing Test had ever been formally tried.
test, as Dr. Hugh Loebner pledged a $100,000 • Increasing the public understanding of AI
grand prize for the first computer program to pass is a laudable goal of Loebner Prize. “I be-
lieve that this contest will advance AI and
3
http://www.comp.leeds.ac.uk
90
serve as a tool to measure the state of the The AIML pattern is simple, consisting only of
art.” words, spaces, and the wildcard symbols _ and *.
• Performing a social experiment. The words may consist of letters and numerals, but
no other characters. Words are separated by a sin-
The first open-ended implementation of the gle space, and the wildcard characters function like
Turing Test was applied in the 1995 contest, and words. The pattern language is case invariant. The
the prize was granted to Weintraub for the fourth idea of the pattern matching technique is based on
time. For more details to see other winners over finding the best, longest, pattern match. Three
years are found in the Loebner Webpage 4 . types of AIML categories are used: atomic cate-
In this paper, we advocate alternative evalua- gory, are those with patterns that do not have wild-
tion methods, more appropriate to practical infor- card symbols, _ and *; default categories are
mation systems applications. We have investigated those with patterns having wildcard symbols * or
methods to train and adapt ALICE to a specific _. The wildcard symbols match any input but can
user’s language use or application, via a user- differ in their alphabetical order. For example,
supplied training corpus. Our evaluation takes ac- given input ‘hello robot’, if ALICE does not find a
count of open-ended trials by real users, rather than category with exact matching atomic pattern, then
controlled 10-minute trials. it will try to find a category with a default pattern;
The third type, recursive categories are those with
3 The ALICE/AIML chatbot architecture templates having <srai> and <sr> tags, which refer
to simply recursive artificial intelligence and sym-
AIML consists of data objects called AIML ob- bolic reduction. Recursive categories have many
jects, which are made up of units called topics and applications: symbolic reduction that reduces com-
categories. The topic is an optional top-level ele- plex grammatical forms to simpler ones; divide
ment; it has a name attribute and a set of categories and conquer that splits an input into two or more
related to that topic. Categories are the basic units subparts, and combines the responses to each; and
of knowledge in AIML. Each category is a rule for dealing with synonyms by mapping different ways
matching an input and converting to an output, and of saying the same thing to the same reply.
consists of a pattern, which matches against the The knowledge bases of almost all chatbots are
user input, and a template, which is used in gener- edited manually which restricts users to specific
ating the Alice chatbot answer. The format struc- languages and domains. We developed a Java pro-
ture of AIML is shown in figure 1. gram to read a text from a machine readable text
(corpus) and convert it to AIML format. The chat-
< aiml version=”1.0” > bot-training-program was built to be general, the
< topic name=” the topic” > generality in this respect implies, no restrictions on
specific language, domain, or structure. Different
<category> languages were tested: English, Arabic, Afrikaans,
<pattern>PATTERN</pattern> French, and Spanish. We also trained with a range
<that>THAT</that> of different corpus genres and structures, includ-
<template>Template</template> ing: dialogue, monologue, and structured text
</category> found in the Qur’an, and FAQ websites.
.. The chatbot-training-program is composed of
.. four phases as follows:
</topic> • Reading module which reads the dialogue
</aiml> text from the basic corpus and inserts it
The <that> tag is optional and means that the cur- into a list.
rent pattern depends on a previous bot output. • Text reprocessing module, where all cor-
Figure 1. AIML format pus and linguistic annotations such as
overlapping, fillers and others are filtered.
• Converter module, where the pre-
4
http://www.loebner.net/Prizef/loebner-prize.html processed text is passed to the converter to
consider the first turn as a pattern and the
91
second as a template. All punctuation is • The Qur’an prototype that is trained by the
removed from the patterns, and the pat- holy book of Islam (Qur’an): where in ad-
terns are transformed to upper case. dition to the first word approach, two sig-
• Producing the AIML files by copying the nificant word approaches (least frequent
generated categories from the list to the words) were used, and the system was
AIML file. adapted to deal with the Arabic language
An example of a sequence of two utter- and the non-conversational nature of
ances from an English spoken corpus is: Qur’an as shown in section 5;
• The FAQchat prototype that is used in the
<u who=F72PS002> FAQ of the School of Computing at Uni-
<s n="32"><w ITJ>Hello<c PUN>. versity of Leeds. The same learning tech-
</u> niques were used, where the question
<u who=PS000> represents the pattern and the answer rep-
<s n="33"><w ITJ>Hello <w NP0>Donald<c resents the template. Instead of chatting for
PUN>. just 10 minutes as suggested by the Loeb-
</u> ner Prize, we advocate alternative evalua-
After the reading and the text processing tion methods more attuned to and
phase, the text becomes: appropriate to practical information sys-
tems applications. Our evaluation takes ac-
F72PS002: Hello count of open-ended trials by real users,
PS000: Hello Donald rather than artificial 10-minute trials as il-
The corresponding AIML atomic category that lustrated in the following sections.
is generated from the converter modules looks like: The aim of the different evaluations method-
<category> ologies is as follows:
<pattern>HELLO</pattern> • Evaluate the success of the learning tech-
<template>Hello Donald</template> niques in giving answers, based on dia-
</category> logue efficiency, quality and users’
As a result different prototypes were developed, satisfaction applied on the KGA.
in each prototype, different machine-learning tech- • Evaluate the ability to use the chatbot as a
niques were used and a new chatbot was tested. tool to access an information source, and a
The machine learning techniques ranged from a useful application for this, which was ap-
primitive simple technique like single word match- plied on the Qur'an corpus.
ing to more complicated ones like matching the • Evaluate the ability of using the chatbot as
least frequent words. Building atomic categories an information retrieval system by com-
and comparing the input with all atomic patterns to paring it with a search engine, which was
find a match is an instance based learning tech- applied on FAQchat.
nique. However, the learning approach does not
stop at this level, but it improved the matching 4 Evaluation of the KGA prototype
process by using the most significant words (least
frequent word). This increases the ability of find- We developed two versions of the ALICE that
ing a nearest match by extending the knowledge speaks Afrikaans language, Afrikaana that speaks
base which is used during the matching process. only Afrikaans and AVRA that speaks English and
Three prototypes will be discussed in this paper as Afrikaans; this was inspired by our observation
listed below: that the Korpus Gesproke Afrikaans actually in-
• The KGA prototype that is trained by a cludes some English, as Afrikaans speakers are
corpus of spoken Afrikaans. In this proto- generally bilingual and “code-switch” comfortably.
type two learning approaches were We mounted prototypes of the chatbots on web-
adopted. The first word and the most sig- sites using Pandorabot service 5 , and encouraged
nificant word (least frequent word) ap-
5
proach; http://www.pandorabots.com/pandora
92
open-ended testing and feedback from remote us- 4.2 Dialogue quality metric
ers in South Africa; this allowed us to refine the
system more effectively. In order to measure the quality of each re-
We adopted three evaluation metrics: sponse, we wanted to classify responses according
• Dialogue efficiency in terms of matching to an independent human evaluation of “reason-
type. ableness”: reasonable reply, weird but understand-
• Dialogue quality metrics based on re- able, or nonsensical reply. We gave the transcript
sponse type. to an Afrikaans-speaking teacher and asked her to
• Users' satisfaction assessment based on an mark each response according to these classes. The
open-ended request for feedback. number of turns in each dialogue and the frequen-
cies of each response type were estimated. Figure 3
4.1 Dialogue efficiency metric shows the frequencies normalised to relative prob-
abilities of each of the three categories for each
We measured the efficiency of 4 sample dia- sample dialogue. For this evaluator, it seems that
logues in terms of atomic match, first word match, “nonsensical” responses are more likely than rea-
most significant match, and no match. We wanted sonable or understandable but weird answers.
to measure the efficiency of the adopted learning
mechanisms to see if they increase the ability to 4.3 Users' satisfaction
find answers to general user input as shown in ta-
ble 1. The first prototypes were based only on literal
Matching Type D1 D2 D3 D4 pattern matching against corpus utterances: we had
Atomic 1 3 6 3 not implemented the first word approach and least-
First word 9 15 23 4 frequent word approach to add “wildcard” default
Most significant 13 2 19 9 categories. Our Afrikaans-speaking evaluators
No match 0 1 3 1 found these first prototypes disappointing and frus-
Number of turns 23 21 51 17 trating: it turned out that few of their attempts at
Table 1. Response type frequency conversation found exact matches in the training
corpus, so Afrikaana replied with a default “ja”
The frequency of each type in each dialogue most of the time. However, expanding the AIML
generated between the user and the Afrikaans pattern matching using the first-word and least-
chatbot was calculated; in Figure 2, these absolute frequent-word approaches yielded more favorable
frequencies are normalised to relative probabilities. feedback. Our evaluators found the conversations
No significant test was applied, this approach to less repetitive and more interesting. We measure
evaluation via dialogue efficiency metrics illus- user satisfaction based on this kind of informal
trates that the first word and the most significant user feed back.
approach increase the ability to generate answers
to users and let the conversation continue.
Response Types
Matching Types
1.00
Repetion (%)
0.80 reasonable
0.8
repetition (%)
Atomic 0.60
Weird
0.6 0.40
0.4 First word 0.20 Non sensical
0.00
0.2
1
0 Most
D ue
e
gu
gu
gu
significant
g
lo
lo
lo
lo
2
4
1
ia
ia
ia
ia
gu
Match
D
D
gu
gu
gu
lo
lo
lo
lo
ia
nothing
ia
ia
ia
D
93
5 Evaluation of the Qur'an prototype •
The English translation of the Qur’an is
not enough to judge if the verse is related
In this prototype a parallel corpus of Eng- or not, especially given that non-Muslims
lish/Arabic of the holy book of Islam was used, the do not have the background knowledge of
aim of the Qur’an prototype is to explore the prob- the Qur’an.
lem of using the Arabic language and of using a Using chatting to access the Qur’an looks like
text which is not conversational in its nature like the use of a standard Qur’an search tool. In fact it
the Qur’an. The Qur’an is composed of 114 soora is totally different; a searching tool usually
(chapters), and each soora is composed of different matches words not statements. For example, if the
number of verses. The same learning technique as input is: “How shall I pray?” using chatting: the
the KGA prototype were applied, where in this robot will give you all ayyas where the word
case if an input was a whole verse, the response “pray” is found because it is the most significant
will be the next verse of the same soora; or if an word. However, using a search tool 6 will not give
input was a question or a statement, the output will you any match. If the input was just the word
be all verses which seems appropriate based on the “pray”, using chatting will give you the same an-
significant word. To measure the quality of the swer as the previous, and the searching tool will
answers of the Qur’an chatbot version, the follow- provide all ayyas that have “pray” as a string or
ing approach was applied: substring, so words such as: ”praying, prayed, etc.”
1. Random sentences from Islamic sites were will match.
selected and used as inputs of the Eng- Another important difference is that in the
lish/Arabic version of the Qur’an. search tool there is a link between any word and
2. The resulting transcripts which have 67 the document it is in, but in the chatting system
turns were given to 5 Muslims and 6 non- there is a link just for the most significant words,
Muslims students, who were asked to label so if it happened that the input statement involves a
each turn in terms of: significant word(s), a match will be found, other-
• Related (R), in case the answer was correct wise the chatbot answer will be: “I have no answer
and in the same topic as the input. for that”.
• Partially related (PR), in case the answer
was not correct, but in the same topic.
Answer types
• Not related (NR), in case the answer was
not correct and in a different topic. 70%
Proportions of each label and each class of us- 60%
ers (Muslims and non-Muslims) were calculated as 50%
Pro po rtio n
50%
members of the staff and students. Users were 40% Staff
number
asked to try using the system, and state whether 30% Student
they were able to find answers using the FAQchat 20% Total
95
• Users who preferred Google justified their chine Conversations. Kluwer, Bos-
preference for two reasons: ton/Drdrecht/London, pp. 9-19.
1. Prior familiarity with using Google. Epstein R. 1992. Can Machines Think?. AI magazine,
2. FAQchat seemed harder to steer with care- Vol 13, No. 2, pp80-95
fully chosen keywords, but more often did
Garner R. 1994. The idea of RED, [Online],
well on the first try. This happens because http://www.alma.gq.nu/docs/ideafred_garner.htm
FAQchat gives answers if the keyword
matches a significant word. The same will Hirschman L. 1995. The Roles of language processing
occur if you reformulate the question and in a spoken language interface. In Voice Communi-
cation Between Humans and Machines, D. Roe and J.
the FAQchat matches the same word.
Wilpon (Eds), National Academy Press Washinton,
However Google may give different an- DC, pp217-237.
swers in this case.
To test reliability of these results, the t=Test Hutchens, J. 1996. How to pass the Turing test by
were applied, the outcomes ensure the previous cheating. [Onlin], http://ciips.ee.uwa.edu.au/Papers/,
1996
results.
Hutchens, T., Alder, M. 1998. Introducing MegaHAL.
7 Conclusion [Online],
http://cnts.uia.ac.be/conll98/pdf/271274hu.pdf
The Loebner Prize Competition has been used Loebner H. 1994. In Response to lessons from a re-
to evaluate the ability of chatbots to fool people stricted Turing Test. [Online],
that they are speaking to humans. Comparing the http://www.loebner.net/Prizef/In-response.html
dialogues generated from ALICE, which won the
Loebner Prize with real human dialogues, shows Maier E, Mast M, and LuperFoy S. 1996. Overview.
In Elisabeth Maier, Marion Mast, and Susan Luper-
that ALICE tries to use explicit dialogue-act lin- Foy (Eds), Dialogue Processing in Spoken Language
guistic expressions more than usual to re enforce Systems, , Springer, Berlin, pp1-13.
the impression that users are speaking to human.
Our general conclusion is that we should NOT McTear M. 2002. Spoken dialogue technology: ena-
adopt an evaluation methodology just because a bling the conversational user interface. ACM Com-
puting Surveys. Vol. 34, No. 1, pp. 90-169.
standard has been established, such as the Loebner
Prize evaluation methodology adopted by most Shieber S. 1994. Lessons from a Restricted Turing
chatbot developers. Instead, evaluation should be Test. Communications of the Association for Com-
adapted to the application and to user needs. If the puting Machinery, Vol 37, No. 6, pp70-78
chatbot is meant to be adapted to provide a specific Turing A. 1950. Computing Machinery and intelli-
service for users, then the best evaluation is based gence. Mind 59, 236, 433-460.
on whether it achieves that service or task Weizenbaum, J. 1966. ELIZA-A computer program
for the study of natural language communication be-
References tween man and machine. Communications of the
Abu Shawar B and Atwell E. 2003. Using dialogue ACM. Vol. 10, No. 8, pp. 36-45.
corpora to retrain a chatbot system. In Proceedings of
the Corpus Linguistics 2003 conference, Lancaster Whalen T. 2003. My experience with 1994 Loebner
University, UK, pp681-690. competition, [Online],
http://hps.elte.hu/~gk/Loebner/story94.htm
Batacharia, B., Levy, D., Catizone R., Krotov A. and
Wilks, Y. 1999. CONVERSE: a conversational com-
panion. In Wilks, Y. (ed.), Machine Conversations.
Kluwer, Boston/Drdrecht/London, pp. 205-215.
Colby, K. 1999a. Comments on human-computer con-
versation. In Wilks, Y. (ed.), Machine Conversations.
Kluwer, Boston/Drdrecht/London, pp. 5-8.
Colby, K. 1999b. Human-computer conversation in a
cognitive therapy program. In Wilks, Y. (ed.), Ma-