A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs

)
Wonkyu Park1, Sung H. Han1, Yong S. Park1, Jungchul Park1, and Huichul Yang2
1

Department of Industrial and Management En ineerin , POSTECH, San 31, Hyoja, Pohan , 790-784, South Korea {p09plus1,shan,drastle,mozart}@postech.ac.kr 2 Samsung Electronics, Seoul, South Korea huicul.yang@samsung.com

Abstract. Usability evaluation is now considered an essential procedure in developing a spoken language dialogue system (SLDS). This paper proposes a systematic framework for evaluating the usability of SLDSs. The framework consists of what to evaluate and how to evaluate. What to evaluate includes components, evaluation criteria, and usability measures to evaluate various aspects of SLDSs. With respect to how to evaluate, a procedure for developing scenarios and scenario-based evaluation methods are introduced. In addition, a case study, in which the usability an SLDS was evaluated, was conducted to validate the proposed framework. The results of the case study showed successfully the usability level, usability problems, and design implications for further development. The framework proposed in the study can be practically applied to usability evaluation of SLDSs.

1 Introduction
During the last two decades or so, many studies have been conducted to improve the performance of spoken language dialogue systems (SLDSs). However, most studies focused on recognition performance, while only a few studies investigated human factors issues such as user models, linguistic behavior, user satisfaction, etc. [1]. Human factors issues play an important role in an SLDS because enhanced usability can partially cover imperfect recognition accuracy of the system. When it comes to natural dialogues between the users and an SLDS, it is obvious that value of the SLDS depends on usability, which is critical to make the system commercially successful [2]. Usability evaluation in the development process is essential because it provides current usability levels and reveals potential usability problems. Although a variety of studies conducted usability evaluation [3, 4, 5, 6, 7], only a few proposed systematic evaluation frameworks or methodologies for SLDSs [1, 8]. Walker et al. developed a framework for evaluating SLDSs, PARADISE (Paradigm for Dialogue System Evaluation) [8]. It provides a quantitative usability index (i.e. user satisfaction) considering task success and costs. Dybkjær and Bernsen developed an evaluation template for SLDSs that consisted of 10 entries such as ‘what is being evaluated’,
N. Aykin (Ed.): Usability and Internationalization, Part I, HCII 2007, LNCS 4559, pp. 398 – 404, 2007. © Springer-Verlag Berlin Heidelberg 2007

A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs)

399

‘system part evaluated’, ‘type of evaluation’, ‘symptoms to look for’, etc. [1]. However, these studies are not easy for practitioners to apply to usability evaluation because they do not deal with specific data collection methods. This paper aims to propose an evaluation framework for SLDSs. The framework identifies usability measures for various aspects of SLDSs. Also, it proposes scenariobased methods to effectively evaluate usability in terms of both performance and satisfaction. In addition, a case study is conducted to validate the proposed framework. An SLDS providing the user information about schedules, contacts, weather, etc. in a home environment, was developed for the case study.

2 Usability Evaluation Framework for SLDSs
The usability evaluation framework proposed in this study consists of what to evaluate and how to evaluate. Fig.1 depicts details of the proposed framework.
What to evaluate? C om ponent s
Syst em Task anal s ysi U ser Pr ocess el abor i aton

How to evaluate? M easur es
M easur e gat i herng M easur e m odii i fcaton M easur e sel i ecton

Eval i C rt i uaton iera

Scenaro G ener i i aton
5 step procedures [15]

Eval i M et uaton hods Devel opm ent
Pre-determined dialogues Realistic dialogues only Realistic dialogues after pre-determined dialogues

I er i nt acton

Fig. 1. A framework for evaluating the usability of SLDSs

What to evaluate includes components, evaluation criteria, and usability measures. The framework has three components, i.e. user, system, and the interaction between these two. Evaluation criteria are functions and characteristics of each component that affect the usability of an SLDS. Usability experts identify them using a task analysis technique. Usability of a system can be quantitatively measured by employing performance and satisfaction measures [9]. Relevant measures were surveyed from the existing literatures. From the measures, usability experts selected ones appropriate to each criterion by considering ease of measurement and relevance to usability. How to evaluate introduces the procedure to create scenarios for SLDSs evaluation and has two methods to collect usability measures. The framework proposes scenariobased evaluation by real users. It enables researchers to find usability problems that originate from mismatches between what the user needs and what the system provides [10, 11]. 2.1 ’What to Evaluate’ The framework has components, evaluation criteria, and usability measures with respect to what to evaluate. Components in this study are classified into user, system,

400

W. Park et al.

and interaction. Various aspects of SLDSs can be evaluated by considering the three components, while previous studies mainly evaluated the system only [1]. Criteria are developed to evaluate each component. The criteria are made by elaborating the process shown in a modified job process chart. A job process chart reported by [12] is a specific type of partitioned operational sequence diagrams. With the modified job process chart, practitioners are able to identify system-user interaction processes and information transmitted between them. An example of the chart is shown in Fig. 2, from which evaluation criteria are made for the case study. For example, an evaluation criterion of ‘recognition performance’ is elaborated from ‘recognize input’. Another example is ‘user behavior’ that comes from ‘construct utterance’ when the system fails to provide information that the users request.
Syst em
Start

I er i nt acton

User
Start Construct Utterance

Recognize input

Input utterance

Speak at a microphone Read feedback message

Display feedback Test output relevance Yes No

Feedback message

Output message Generate response Error, Additional info., Response Read output message Is information adequate No Yes End

End

Fig. 2. A modified job process chart of an SLDS used in the case study

A variety of usability measures were collected from previous studies [2, 3, 4, 5, 6, 7]. Some measures (e.g. number of barge-ins and number of SLDS help) were SLDSspecific, while others (e.g. task completion time and number of errors) could be used for general usability evaluations. The latter measures might be modified to fit SLDSs. For example, the number of errors is modified into the number of unrecognized words/utterance and the number of utterance construction errors. Usability measures appropriate to a corresponding criterion were selected by ease of measurement and relevance to usability. For example, word recognition rate was selected to evaluate the

A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs)

401

‘recognition performance’. Table 1 shows components, evaluation criteria and usability measures developed for the case study.
Table 1. Evaluation criteria and measures of each component for the case study Components Criteria Recognition performance System Dialogue model Measures Sentence recognition rate Word recognition rate Recognition error frequency Adequacy of reasoning function Utterance construction error (frequency and types) Correct response rate User satisfaction on system response Task completion time Frequency of failed tasks Overall user satisfaction on the SLDS Users’ response pattern to various system errors Patterns of utterance construction Utterance variation

Interaction Task User satisfaction User User behavior Learning

2.2 ’How to Evaluate’ A scenario-based method can be used to evaluate an SLDS system in a realistic situation [13]. A variety of situations should be considered in evaluation scenarios. However, there exist few studies, except for [14], that systematically develop scenarios reflecting various situations. Park et al. proposed a scenario development procedure that consists of five steps [14]: 1) identifying functions and information that system can provide, 2) analyzing sentence structures appropriate to system functions and information, 3) analyzing proper words for sentence structures, 4) creating scenario structures by mapping words into sentence structures, and 5) developing detail scenarios. This study uses this procedure when creating scenarios. The framework proposes two scenario-based evaluation methods. The first one uses pre-determined dialogues. It provides utterance that the system can handle. The system can always come up with an answer, unless it fails to recognize the speech pronounced by the user. This method is mainly appropriate to measure recognition performance of an SLDS. Pre-determined dialogues are developed through the entire five steps explained above. The second method performs scenarios to evaluate an SLDS’s overall usability in realistic situations. Given a situation and information to be queried, the user asks the system using his/her own expressions. In addition to recognition performance measured by the first method, the discrepancy between the dialogue model

System output

– – – – – – –

– – –

402

W. Park et al.

hypothesized by the developer and the user’s actual utterance pattern can be analyzed. Realistic dialogues can be developed using the step 1 stated above. Table 2 shows examples of the two types of dialogues when the user conducts the same task. The effects of previous experience with the pre-determined dialogues on the user’s utterance pattern are also investigated by comparing two user groups (one group of users who conducts the realistic dialogues only, and another group of users who conducts realistic dialogues after experiencing the pre-determined ones).
Table 2. Examples of two dialogue types Pre-determined dialogues (performed through two transactions) 1: Any e-mail from mom this afternoon? 2: Contents of the e-mail?

Realistic dialogues

I heard mom sent me an e-mail this afternoon. So I would like to know contents of the e-mail.

3 Validation of the Proposed Framework
A total of 84 subjects who speak Korean participated in the case study. The participants were randomly assigned to one of three different experiments: 60 subjects for conducting pre-determined dialogues (experiment 1), 12 for realistic dialogues (experiment 2), and the other 12 for realistic dialogues after the pre-determined dialogues (experiment 3). A larger number of participants were assigned to experiment 1, because the SLDS was in the early stage of the development process in which the developers needed to focus on the recognition performance. A total of 24 scenarios were developed for the experiments. Twelve scenarios were pre-determined dialogues, while the other were realistic dialogues. Evaluation criteria and usability measures for the case study were developed according to the proposed framework (See section 2.1), which are shown in Table 1. The evaluation results provide design problems, usability levels, and valuable design implications for the SLDS. This paper describes sentence recognition rates and correct response rates only. The average values of these measures for the three experiments are depicted in Fig. 3. Based on the results of the usability evaluation, design implications for further development were made. Firstly, the recognition algorithm needs improvement to effectively process users’ utterance. The sentence recognition rate of 50 % might be too low for a commercial SLDS. When significant improvement is difficult to achieve, introducing auxiliary input devices such as keyboard and mouse would be a good support for better usability. Secondly, help documents or training programs should be provided in the SLDS. Experiment 3 that included short training before the main experiment showed better performance than Experiment 2 in both measures. This implies that system help may make it easier for users to use the SLDS. Information describing what and how to interact with the system should be provided for better usability.

A Framework for Evaluating the Usability of Spoken Language Dialog Systems (SLDSs)

403

Finally, the developers should improve a reasoning function that enables the system to identify user’s intention from what it has recognized. It is important when, as in this case, the system’s recognition performance is poor. Note that the correct response rates are higher than the sentence recognition rates in all the three experiments. The reasoning function can be improved by refining the dialogue model based on utterance patterns that the users employ in their daily lives.
100 85. 9 80
Percentage

75. 1 61. 8 50 55. 9

71. 5
Exp. 1 Exp. 2 Ex . 3

60 40 20 0

Sentence recognition rate

Correctresponse rate

Fig. 3. Correct sentence recognition rates and correct response rates for three experiments

4 Conclusion
A usability evaluation framework for SLDSs was proposed. It focuses on both what to evaluate and how to evaluate. Usability measures are systematically defined to evaluate SLDSs. Evaluation criteria that could affect the usability of SLDSs are identified from a modified job process chart. In addition, the study also proposes two types of scenario-based evaluation methods. Each evaluation method can be used for a different purpose. In a case study, an SLDS was evaluated using the proposed framework. The case study revealed the usability level, usability problems, and design implications for better usability. The framework described in the study can be practically applied to evaluating the usability of SLDSs.

References
1. Dybkjær, L., Bernsen, N.O.: Usability Issues in Spoken Language Dialogue Systems. Natural Language Processing 6, 243–272 (2000) 2. Kwahk, J.: A methodology for evaluating the usability of audiovisual consumer electronic products. Unpublished Ph. D. dissertation, Pohang University of Science and Technology, Pohang, South Korea (1999) 3. Danieli, M., Gerbino, E.: Metrics for evaluating dialogue strategies in a spoken language system. In: The 1995 AAAI spring symposium on empirical methods in discourse interpretation and generation, pp. 34–39 (1995)

404

W. Park et al.

4. Dybkjær, L., Bernsen, N.O., Dybkjær, H.: Evaluation of spoken dialogues: user test with a simulated speech recogniser. CPK - Center for PersonKommunikation, Aalborg University 9a & 9b (1996) 5. Litman, D.J., Pan, S.: Designing and Evaluating an Adaptive Spoken Dialogue System. User Modeling and User-Adapted Interaction. 12, 111–137 (2002) 6. Polifroni, J., Hirschman, L., Seneff, S., Zue, V.: Experiments in evaluating interactive spoken language systems. In: The DARPA Speech and Natural Language Workshop, pp. 28–33 (1992) 7. Simpson, A., Fraser, N.A.: Black Box and Glass Box Evaluation of the SUNDIAL System. In: The EUROSPEECH: European Conference on Speech Processing, Berlin, pp. 1423–1426 (1993) 8. Walker, M.A., Litman, D.J., Kamm, C.A., Abella, A.: PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In: The 35th annual meeting of the association for computational linguistics (ACL-97), Madrid, Spain, pp. 271–280 (1997) 9. Han, S.H., Yun, M.H., Kwahk, J., Hong, S.W.: Usability of consumer electronic products. International Journal of Industrial Ergonomics 28, 143–151 (2001) 10. Dybkjær, L., Bernsen, N.O.: Usability evaluation in spoken language dialogue systems. In: The Proceedings of the workshop on evaluation for language and dialogue systems, Toulouse, France (2001) 11. Park, Y.S., Han, S.H., Yang, H., Park, W.: Usability evaluation of conversational interface using scenario-based approach. In: The 2005 ESK spring conference (2005) 12. Tanish, M.A.: Job process charts and man-computer interaction within naval command systems. Ergonomics 28, 555–565 (1985) 13. Dybkjær, L., Bernsen, N.O., Dybkjær, H.: Scenario design for spoken language dialogue systems development. In: the ESCA workshop on spoken dialogue systems, pp. 93–96 (1995) 14. Park, W., Han, S.H., Yang, H., Park, Y.S., Cho, Y.: A methodology of analyzing user input scenarios for a conversational interface. In: The 2005 ESK spring conference (2005)

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.