Different Measurements Metrics To Evaluate A Chatbot System: January 2007

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/234805405
Different measurements metrics to evaluate a chatbot system
Conference Paper · January 2007

DOI: 10.3115/1556328.1556341
CITATIONS READS
106 11,601
2 authors:
Bayan Abu Shawar Eric Atwell

Al Ain University University of Leeds
60 PUBLICATIONS 998 CITATIONS 394 PUBLICATIONS 3,916 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Arabic Dialects Text Classification View project
Islamic application of automatic question answering system View project
All content following this page was uploaded by Eric Atwell on 19 May 2014.
The user has requested enhancement of the downloaded file.

Different measurements metrics to evaluate a chatbot system
Bayan Abu Shawar Eric Atwell

IT department School of Computing
Arab Open University University of Leeds
[add] LS2 9JT, Leeds-UK
b_shawar@arabou-jo.edu.jo eric@comp .leeds.ac.uk
users turn by turn using natural language. Different

Abstract chatbots or human-computer dialogue systems
have been developed using text communication
A chatbot is a software system, which can such as Eliza (Weizenbaum 1966), PARRY (Colby
interact or “chat” with a human user in 1999b), CONVERSE (Batacharia etc 1999),
natural language such as English. For the ALICE 1 . Chatbots have been used in different do-
annual Loebner Prize contest, rival chat- mains such as: customer service, education, web
bots have been assessed in terms of ability site help, and for fun.
to fool a judge in a restricted chat session. Different mechanisms are used to evaluate
We are investigating methods to train and Spoken Dialogue Systems (SLDs), ranging from
adapt a chatbot to a specific user’s lan- glass box evaluation that evaluates individual
guage use or application, via a user- components, to black box evaluation that evaluates
supplied training corpus. We advocate the system as a whole McTear (2002). For exam-
open-ended trials by real users, such as an ple, glass box evaluation was applied on the
example Afrikaans chatbot for Afrikaans- (Hirschman 1995) ARPA Spoken Language sys-
speaking researchers and students in tem, and it shows that the error rate for sentence
South Africa. This is evaluated in terms of understanding was much lower than that for sen-
“glass box” dialogue efficiency metrics, tence recognition. On the other hand black box
and “black box” dialogue quality metrics evaluation evaluates the system as a whole based
and user satisfaction feedback. The other on user satisfaction and acceptance. The black box
examples presented in this paper are the approach evaluates the performance of the system
Qur'an and the FAQchat prototypes. Our in terms of achieving its task, the cost of achieving
general conclusion is that evaluation the task in terms of time taken and number of
should be adapted to the application and turns, and measures the quality of the interaction,
to user needs. normally summarised by the term ‘user satisfac-
tion’, which indicates whether the user “ gets the
information s/he wants, is s/he comfortable with
1 Introduction the system, and gets the information within accept-
able elapsed time, etc.” (Maier et al 1996).
“Before there were computers, we could distin- The Loebner prize 2 competition has been used
guish persons from non-persons on the basis of an to evaluate machine conversation chatbots. The
ability to participate in conversations. But now, we Loebner Prize is a Turing test, which evaluates the
have hybrids operating between person and non ability of the machine to fool people that they are
persons with whom we can talk in ordinary lan- talking to human. In essence, judges are allowed a
guage.” (Colby 1999a). Human machine conversa- short chat (10 to 15 minutes) with each chatbot,
tion as a technology integrates different areas and asked to rank them in terms of “naturalness”.
where the core is the language, and the computa- ALICE (Abu Shawar and Atwell 2003) is the
tional methodologies facilitate communication be- Artificial Linguistic Internet Computer Entity, first
tween users and computers using natural language.
A related term to machine conversation is the 1
http://www.alicebot.org/
chatbot, a conversational agent that interacts with 2
http://www.loebner.net/Prizef/loebner-prize.html
89
Bridging the Gap: Academic and Industrial Research in Dialog Technologies Workshop Proceedings, pages 89–96,
NAACL-HLT, Rochester, NY, April 2007. 2007
c Association for Computational Linguistics
implemented by Wallace in 1995. ALICE knowl- the test. At the beginning it was decided to limit
edge about English conversation patterns is stored the topic, in order to limit the amount of language
in AIML files. AIML, or Artificial Intelligence the contestant programs must be able to cope with,
Mark-up Language, is a derivative of Extensible and to limit the tenor. Ten agents were used, 6
Mark-up Language (XML). It was developed by were computer programs. Ten judges would con-
Wallace and the Alicebot free software community verse with the agents for fifteen minutes and rank
during 1995-2000 to enable people to input dia- the terminals in order from the apparently least
logue pattern knowledge into chatbots based on the human to most human. The computer with the
A.L.I.C.E. open-source software technology. highest median rank wins that year’s prize. Joseph
In this paper we present other methods to Weintraub won the first, second and third Loebner
evaluate the chatbot systems. ALICE chtabot sys- Prize in 1991, 1992, and 1993 for his chatbots, PC
tem was used for this purpose, where a Java pro- Therapist, PC Professor, which discusses men ver-
gram has been developed to read from a corpus sus women, and PC Politician, which discusses
and convert the text to the AIML format. The Cor- Liberals versus Conservatives. In 1994 Thomas
pus of Spoken Afrikaans (Korpus Gesproke Afri- Whalen (Whalen 2003) won the prize for his pro-
kaans, KGA), the corpus of the holy book of Islam gram TIPS, which provides information on a par-
(Qur’an), and the FAQ of the School of Computing ticular topic. TIPS provides ways to store,
at University of Leeds 3 were used to produce two organize, and search the important parts of sen-
KGA prototype, the Qur’an prototype and the tences collected and analysed during system tests.
FAQchat one consequently. However there are sceptics who doubt the ef-
Section 2 presents Loebner Prize contest, sec- fectiveness of the Turing Test and/or the Loebner
tion 3 illustrates the ALICE/AIMLE architecture. Competition. Block, who thought that “the Turing
The evaluation techniques of the KGA prototype, test is a sorely inadequate test of intelligence be-
the Qur’an prototype, and the FAQchat prototype cause it relies solely on the ability to fool people”;
are discussed in sections 4, 5, and 6 consequently. and Shieber (1994), who argued that intelligence is
The conclusion is presented in section 7. not determinable simply by surface behavior.
Shieber claimed the reason that Turing chose natu-
2 The Loebner Prize Competition ral language as the behavioral definition of human
intelligence is “exactly its open-ended, free-
The story began with the “imitation game” which wheeling nature”, which was lost when the topic
was presented in Alan Turing’s paper “Can Ma- was restricted during the Loebner Prize. Epstein
chine think?” (Turing 1950). The imitation game (1992) admitted that they have trouble with the
has a human observer who tries to guess the sex of topic restriction, and they agreed “every fifth year
two players, one of which is a man and the other is or so … we would hold an open-ended test - one
a woman, but while screened from being able to with no topic restriction.” They decided that the
tell which is which by voice, or appearance. Turing winner of a restricted test would receive a small
suggested putting a machine in the place of one of cash prize while the one who wins the unrestricted
the humans and essentially playing the same game. test would receive the full $100,000.
If the observer can not tell which is the machine Loebner in his responses to these arguments be-
and which is the human, this can be taken as strong lieved that unrestricted test is simpler, less expen-
evidence that the machine can think. sive and the best way to conduct the Turing Test.
Turing’s proposal provided the inspiration for Loebner presented three goals when constructing
the Loebner Prize competition, which was an at- the Loebner Prize (Loebner 1994):
tempt to implement the Turing test. The first con- • “No one was doing anything about the
test organized by Dr. Robert Epstein was held on Turing Test, not AI.” The initial Loebner
1991, in Boston’s Computer Museum. In this in- Prize contest was the first time that the
carnation the test was known as the Loebner con- Turing Test had ever been formally tried.
test, as Dr. Hugh Loebner pledged a $100,000 • Increasing the public understanding of AI
grand prize for the first computer program to pass is a laudable goal of Loebner Prize. “I be-
lieve that this contest will advance AI and
3
http://www.comp.leeds.ac.uk
90
serve as a tool to measure the state of the The AIML pattern is simple, consisting only of
art.” words, spaces, and the wildcard symbols _ and *.
• Performing a social experiment. The words may consist of letters and numerals, but
no other characters. Words are separated by a sin-
The first open-ended implementation of the gle space, and the wildcard characters function like
Turing Test was applied in the 1995 contest, and words. The pattern language is case invariant. The
the prize was granted to Weintraub for the fourth idea of the pattern matching technique is based on
time. For more details to see other winners over finding the best, longest, pattern match. Three
years are found in the Loebner Webpage 4 . types of AIML categories are used: atomic cate-
In this paper, we advocate alternative evalua- gory, are those with patterns that do not have wild-
tion methods, more appropriate to practical infor- card symbols, _ and *; default categories are
mation systems applications. We have investigated those with patterns having wildcard symbols * or
methods to train and adapt ALICE to a specific _. The wildcard symbols match any input but can
user’s language use or application, via a user- differ in their alphabetical order. For example,
supplied training corpus. Our evaluation takes ac- given input ‘hello robot’, if ALICE does not find a
count of open-ended trials by real users, rather than category with exact matching atomic pattern, then
controlled 10-minute trials. it will try to find a category with a default pattern;
The third type, recursive categories are those with
3 The ALICE/AIML chatbot architecture templates having <srai> and <sr> tags, which refer
to simply recursive artificial intelligence and sym-
AIML consists of data objects called AIML ob- bolic reduction. Recursive categories have many
jects, which are made up of units called topics and applications: symbolic reduction that reduces com-
categories. The topic is an optional top-level ele- plex grammatical forms to simpler ones; divide
ment; it has a name attribute and a set of categories and conquer that splits an input into two or more
related to that topic. Categories are the basic units subparts, and combines the responses to each; and
of knowledge in AIML. Each category is a rule for dealing with synonyms by mapping different ways
matching an input and converting to an output, and of saying the same thing to the same reply.
consists of a pattern, which matches against the The knowledge bases of almost all chatbots are
user input, and a template, which is used in gener- edited manually which restricts users to specific
ating the Alice chatbot answer. The format struc- languages and domains. We developed a Java pro-
ture of AIML is shown in figure 1. gram to read a text from a machine readable text
(corpus) and convert it to AIML format. The chat-
< aiml version=”1.0” > bot-training-program was built to be general, the
< topic name=” the topic” > generality in this respect implies, no restrictions on
specific language, domain, or structure. Different
<category> languages were tested: English, Arabic, Afrikaans,
<pattern>PATTERN</pattern> French, and Spanish. We also trained with a range
<that>THAT</that> of different corpus genres and structures, includ-
<template>Template</template> ing: dialogue, monologue, and structured text
</category> found in the Qur’an, and FAQ websites.
.. The chatbot-training-program is composed of
.. four phases as follows:
</topic> • Reading module which reads the dialogue
</aiml> text from the basic corpus and inserts it
The <that> tag is optional and means that the cur- into a list.
rent pattern depends on a previous bot output. • Text reprocessing module, where all cor-
Figure 1. AIML format pus and linguistic annotations such as
overlapping, fillers and others are filtered.
• Converter module, where the pre-
4
http://www.loebner.net/Prizef/loebner-prize.html processed text is passed to the converter to
consider the first turn as a pattern and the
91
second as a template. All punctuation is • The Qur’an prototype that is trained by the
removed from the patterns, and the pat- holy book of Islam (Qur’an): where in ad-
terns are transformed to upper case. dition to the first word approach, two sig-
• Producing the AIML files by copying the nificant word approaches (least frequent
generated categories from the list to the words) were used, and the system was
AIML file. adapted to deal with the Arabic language
An example of a sequence of two utter- and the non-conversational nature of
ances from an English spoken corpus is: Qur’an as shown in section 5;
• The FAQchat prototype that is used in the
<u who=F72PS002> FAQ of the School of Computing at Uni-
<s n="32"><w ITJ>Hello<c PUN>. versity of Leeds. The same learning tech-
</u> niques were used, where the question
<u who=PS000> represents the pattern and the answer rep-
<s n="33"><w ITJ>Hello <w NP0>Donald<c resents the template. Instead of chatting for
PUN>. just 10 minutes as suggested by the Loeb-
</u> ner Prize, we advocate alternative evalua-
After the reading and the text processing tion methods more attuned to and
phase, the text becomes: appropriate to practical information sys-
tems applications. Our evaluation takes ac-
F72PS002: Hello count of open-ended trials by real users,
PS000: Hello Donald rather than artificial 10-minute trials as il-
The corresponding AIML atomic category that lustrated in the following sections.
is generated from the converter modules looks like: The aim of the different evaluations method-
<category> ologies is as follows:
<pattern>HELLO</pattern> • Evaluate the success of the learning tech-
<template>Hello Donald</template> niques in giving answers, based on dia-
</category> logue efficiency, quality and users’
As a result different prototypes were developed, satisfaction applied on the KGA.
in each prototype, different machine-learning tech- • Evaluate the ability to use the chatbot as a
niques were used and a new chatbot was tested. tool to access an information source, and a
The machine learning techniques ranged from a useful application for this, which was ap-
primitive simple technique like single word match- plied on the Qur'an corpus.
ing to more complicated ones like matching the • Evaluate the ability of using the chatbot as
least frequent words. Building atomic categories an information retrieval system by com-
and comparing the input with all atomic patterns to paring it with a search engine, which was
find a match is an instance based learning tech- applied on FAQchat.
nique. However, the learning approach does not
stop at this level, but it improved the matching 4 Evaluation of the KGA prototype
process by using the most significant words (least
frequent word). This increases the ability of find- We developed two versions of the ALICE that
ing a nearest match by extending the knowledge speaks Afrikaans language, Afrikaana that speaks
base which is used during the matching process. only Afrikaans and AVRA that speaks English and
Three prototypes will be discussed in this paper as Afrikaans; this was inspired by our observation
listed below: that the Korpus Gesproke Afrikaans actually in-
• The KGA prototype that is trained by a cludes some English, as Afrikaans speakers are
corpus of spoken Afrikaans. In this proto- generally bilingual and “code-switch” comfortably.
type two learning approaches were We mounted prototypes of the chatbots on web-
adopted. The first word and the most sig- sites using Pandorabot service 5 , and encouraged
nificant word (least frequent word) ap-
5
proach; http://www.pandorabots.com/pandora
92
open-ended testing and feedback from remote us- 4.2 Dialogue quality metric
ers in South Africa; this allowed us to refine the
system more effectively. In order to measure the quality of each re-
We adopted three evaluation metrics: sponse, we wanted to classify responses according
• Dialogue efficiency in terms of matching to an independent human evaluation of “reason-
type. ableness”: reasonable reply, weird but understand-
• Dialogue quality metrics based on re- able, or nonsensical reply. We gave the transcript
sponse type. to an Afrikaans-speaking teacher and asked her to
• Users' satisfaction assessment based on an mark each response according to these classes. The
open-ended request for feedback. number of turns in each dialogue and the frequen-
cies of each response type were estimated. Figure 3
4.1 Dialogue efficiency metric shows the frequencies normalised to relative prob-
abilities of each of the three categories for each
We measured the efficiency of 4 sample dia- sample dialogue. For this evaluator, it seems that
logues in terms of atomic match, first word match, “nonsensical” responses are more likely than rea-
most significant match, and no match. We wanted sonable or understandable but weird answers.
to measure the efficiency of the adopted learning
mechanisms to see if they increase the ability to 4.3 Users' satisfaction
find answers to general user input as shown in ta-
ble 1. The first prototypes were based only on literal
Matching Type D1 D2 D3 D4 pattern matching against corpus utterances: we had
Atomic 1 3 6 3 not implemented the first word approach and least-
First word 9 15 23 4 frequent word approach to add “wildcard” default
Most significant 13 2 19 9 categories. Our Afrikaans-speaking evaluators
No match 0 1 3 1 found these first prototypes disappointing and frus-
Number of turns 23 21 51 17 trating: it turned out that few of their attempts at
Table 1. Response type frequency conversation found exact matches in the training
corpus, so Afrikaana replied with a default “ja”
The frequency of each type in each dialogue most of the time. However, expanding the AIML
generated between the user and the Afrikaans pattern matching using the first-word and least-
chatbot was calculated; in Figure 2, these absolute frequent-word approaches yielded more favorable
frequencies are normalised to relative probabilities. feedback. Our evaluators found the conversations
No significant test was applied, this approach to less repetitive and more interesting. We measure
evaluation via dialogue efficiency metrics illus- user satisfaction based on this kind of informal
trates that the first word and the most significant user feed back.
approach increase the ability to generate answers
to users and let the conversation continue.
Response Types
Matching Types
1.00
Repetion (%)
0.80 reasonable
0.8
repetition (%)
Atomic 0.60
Weird
0.6 0.40
0.4 First word 0.20 Non sensical
0.00
0.2
1
0 Most
D ue
e
gu
gu
gu
significant
g
lo
lo
lo
lo
2
4
1
ia
ia
ia
ia
gu
Match
D
D
gu
gu
gu
lo
lo
lo
lo
ia
nothing
ia
ia
ia
D
Figure 3. The quality of the Dialogue: Response

type relative probabilities
Figure 2. Dialogue efficiency: Response Type
Relative Frequencies
93
5 Evaluation of the Qur'an prototype •
The English translation of the Qur’an is
not enough to judge if the verse is related
In this prototype a parallel corpus of Eng- or not, especially given that non-Muslims
lish/Arabic of the holy book of Islam was used, the do not have the background knowledge of
aim of the Qur’an prototype is to explore the prob- the Qur’an.
lem of using the Arabic language and of using a Using chatting to access the Qur’an looks like
text which is not conversational in its nature like the use of a standard Qur’an search tool. In fact it
the Qur’an. The Qur’an is composed of 114 soora is totally different; a searching tool usually
(chapters), and each soora is composed of different matches words not statements. For example, if the
number of verses. The same learning technique as input is: “How shall I pray?” using chatting: the
the KGA prototype were applied, where in this robot will give you all ayyas where the word
case if an input was a whole verse, the response “pray” is found because it is the most significant
will be the next verse of the same soora; or if an word. However, using a search tool 6 will not give
input was a question or a statement, the output will you any match. If the input was just the word
be all verses which seems appropriate based on the “pray”, using chatting will give you the same an-
significant word. To measure the quality of the swer as the previous, and the searching tool will
answers of the Qur’an chatbot version, the follow- provide all ayyas that have “pray” as a string or
ing approach was applied: substring, so words such as: ”praying, prayed, etc.”
1. Random sentences from Islamic sites were will match.
selected and used as inputs of the Eng- Another important difference is that in the
lish/Arabic version of the Qur’an. search tool there is a link between any word and
2. The resulting transcripts which have 67 the document it is in, but in the chatting system
turns were given to 5 Muslims and 6 non- there is a link just for the most significant words,
Muslims students, who were asked to label so if it happened that the input statement involves a
each turn in terms of: significant word(s), a match will be found, other-
• Related (R), in case the answer was correct wise the chatbot answer will be: “I have no answer
and in the same topic as the input. for that”.
• Partially related (PR), in case the answer
was not correct, but in the same topic.
Answer types
• Not related (NR), in case the answer was
not correct and in a different topic. 70%
Proportions of each label and each class of us- 60%
ers (Muslims and non-Muslims) were calculated as 50%
Pro po rtio n
the total number over number of users times num- Muslims

40%
ber of turns. Four out of the 67 turns returned no Non Muslims
30%
answers, therefore actually 63 turns were used as Overall
20%
presented in figure 4. 10%
In the transcripts used, more than half of the re- 0%
sults were not related to their inputs. A small dif- Related Partialy Not related
ference can be noticed between Muslims and non- Related
Muslims proportions. Approximately one half of Answers

answers in the sample were not related from non-
Muslims’ point of view, whereas this figure is 58% Figure4. The Qur’an proportion of each answer
from the Muslims’ perspective. Explanation for type denoted by users
this includes:
• The different interpretation of the answers.
6 Evaluation of the FAQchat prototype
The Qur’an uses traditional Arabic lan- To evaluate FAQchat, an interface was built,
guage, which is sometimes difficult to un- which has a box to accept the user input, and a but-
derstand without knowing the meaning of ton to send this to the system. The outcomes ap-
some words, and the historical story be-
hind each verse. 6
http://www.islamicity.com/QuranSearch/
94
pear in two columns: one holds the FAQchat an- Users Mean of users find- Proportion of find-
swers, and the other holds the Google answers af- /Tool ing answers ing answers
ter filtering Google to the FAQ database only. FAQchat Google FAQchat Google
Google allows search to be restricted to a given Staff 5.53 3.87 61% 43%
URL, but this still yields all matches from the Student 8.8 5.87 73% 49%
whole SoC website (http://www.comp.leeds.ac.uk) Overall 14.3 9.73 68% 46%
so a Perl script was required to exclude matches
Table 2: Proportion of users finding answers
not from the FAQ sub-pages.
An evaluation sheet was prepared which con-
Of the overall sample, the staff outcome shows
tains 15 information-seeking tasks or questions on
that 61% were able to find answers by FAQchat
a range of different topics related to the FAQ data-
where 73% of students managed to do so; students
base. The tasks were suggested by a range of users
were more successful than staff.
including SoC staff and research students to cover
the three possibilities where the FAQchat could 6.2 The preferred tool per each question
find a direct answer, links to more than one possi-
ble answer, and where the FAQchat could not find For each question, users were asked to state
any answer. In order not to restrict users to these which tool they preferred to use to find the answer.
tasks, and not to be biased to specific topics, the The proportion of users who preferred each tool
evaluation sheet included spaces for users to try 5 was calculated. Results in figure 5 shows that 51%
additional tasks or questions of their own choosing. of the staff, 41% of the students, and 47% overall
Users were free to decide exactly what input-string preferred using FAQchat against 11% who pre-
to give to FAQchat to find an answer: they were ferred the Google.
not required to type questions verbatim; users were
free to try more than once: if no appropriate an- Which tool do you prefer?
swer was found; users could reformulate the query. 60%

The evaluation sheet was distributed among 21
Avearge percentage
50%
members of the staff and students. Users were 40% Staff
number
asked to try using the system, and state whether 30% Student
they were able to find answers using the FAQchat 20% Total
responses, or using the Google responses; and 10%

which of the two they preferred and why. 0%
Twenty-one users tried the system; nine mem- FAQchat Google
Tool
bers of the staff and the rest were postgraduates.
The analysis was tackled in two directions: the
preference and the number of matches found per Figure5. Proportion of preferred tool
question and per user.
6.3 Number of matches and preference found
per user
6.1 Number of matches per question
The number of answers each user had found
The number of evaluators who managed to find was counted. The proportions found were the
answers by FAQchat and Google was counted, for same. The evaluation sheet ended with an open
each question. section inviting general feedback. The following is
Results in table 2 shows that 68% overall of our a summary of the feedback we obtained:
sample of users managed to find answers using the • Both staff and students preferred using the
FAQchat while 46% found it by Google. Since FAQchat for two main reasons:
there is no specific format to ask the question, 1. The ability to give direct answers some-
there are cases where some users could find an- times while Google only gives links.
swers while others could not. The success in find- 2. The number of links returned by the
ing answers is based on the way the questions were FAQchat is less than those returned by
presented to FAQchat. Google for some questions, which saves
time browsing/searching.
95
• Users who preferred Google justified their chine Conversations. Kluwer, Bos-
preference for two reasons: ton/Drdrecht/London, pp. 9-19.
1. Prior familiarity with using Google. Epstein R. 1992. Can Machines Think?. AI magazine,
2. FAQchat seemed harder to steer with care- Vol 13, No. 2, pp80-95
fully chosen keywords, but more often did
Garner R. 1994. The idea of RED, [Online],
well on the first try. This happens because http://www.alma.gq.nu/docs/ideafred_garner.htm
FAQchat gives answers if the keyword
matches a significant word. The same will Hirschman L. 1995. The Roles of language processing
occur if you reformulate the question and in a spoken language interface. In Voice Communi-
cation Between Humans and Machines, D. Roe and J.
the FAQchat matches the same word.
Wilpon (Eds), National Academy Press Washinton,
However Google may give different an- DC, pp217-237.
swers in this case.
To test reliability of these results, the t=Test Hutchens, J. 1996. How to pass the Turing test by
were applied, the outcomes ensure the previous cheating. [Onlin], http://ciips.ee.uwa.edu.au/Papers/,
1996
results.
Hutchens, T., Alder, M. 1998. Introducing MegaHAL.
7 Conclusion [Online],
http://cnts.uia.ac.be/conll98/pdf/271274hu.pdf
The Loebner Prize Competition has been used Loebner H. 1994. In Response to lessons from a re-
to evaluate the ability of chatbots to fool people stricted Turing Test. [Online],
that they are speaking to humans. Comparing the http://www.loebner.net/Prizef/In-response.html
dialogues generated from ALICE, which won the
Loebner Prize with real human dialogues, shows Maier E, Mast M, and LuperFoy S. 1996. Overview.
In Elisabeth Maier, Marion Mast, and Susan Luper-
that ALICE tries to use explicit dialogue-act lin- Foy (Eds), Dialogue Processing in Spoken Language
guistic expressions more than usual to re enforce Systems, , Springer, Berlin, pp1-13.
the impression that users are speaking to human.
Our general conclusion is that we should NOT McTear M. 2002. Spoken dialogue technology: ena-
adopt an evaluation methodology just because a bling the conversational user interface. ACM Com-
puting Surveys. Vol. 34, No. 1, pp. 90-169.
standard has been established, such as the Loebner
Prize evaluation methodology adopted by most Shieber S. 1994. Lessons from a Restricted Turing
chatbot developers. Instead, evaluation should be Test. Communications of the Association for Com-
adapted to the application and to user needs. If the puting Machinery, Vol 37, No. 6, pp70-78
chatbot is meant to be adapted to provide a specific Turing A. 1950. Computing Machinery and intelli-
service for users, then the best evaluation is based gence. Mind 59, 236, 433-460.
on whether it achieves that service or task Weizenbaum, J. 1966. ELIZA-A computer program
for the study of natural language communication be-
References tween man and machine. Communications of the
Abu Shawar B and Atwell E. 2003. Using dialogue ACM. Vol. 10, No. 8, pp. 36-45.
corpora to retrain a chatbot system. In Proceedings of
the Corpus Linguistics 2003 conference, Lancaster Whalen T. 2003. My experience with 1994 Loebner
University, UK, pp681-690. competition, [Online],
http://hps.elte.hu/~gk/Loebner/story94.htm
Batacharia, B., Levy, D., Catizone R., Krotov A. and
Wilks, Y. 1999. CONVERSE: a conversational com-
panion. In Wilks, Y. (ed.), Machine Conversations.
Kluwer, Boston/Drdrecht/London, pp. 205-215.
Colby, K. 1999a. Comments on human-computer con-
versation. In Wilks, Y. (ed.), Machine Conversations.
Kluwer, Boston/Drdrecht/London, pp. 5-8.
Colby, K. 1999b. Human-computer conversation in a
cognitive therapy program. In Wilks, Y. (ed.), Ma-
View publication stats

96

Different Measurements Metrics To Evaluate A Chatbot System: January 2007

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Different Measurements Metrics To Evaluate A Chatbot System: January 2007

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Different measurements metrics to evaluate a chatbot system

Conference Paper · January 2007

Bayan Abu Shawar Eric Atwell

SEE PROFILE SEE PROFILE

Arabic Dialects Text Classification View project

Islamic application of automatic question answering system View project

The user has requested enhancement of the downloaded file.

Bayan Abu Shawar Eric Atwell

users turn by turn using natural language. Different

Figure 3. The quality of the Dialogue: Response

the total number over number of users times num- Muslims

Muslims proportions. Approximately one half of Answers

swer was found; users could reformulate the query. 60%

responses, or using the Google responses; and 10%

View publication stats

You might also like