You are on page 1of 4

Designing Language Technology Applications:

A Wizard of Oz Driven Prototyping Framework

S. Schlögl P. Milhorat∗ , G. Chollet∗ , J. Boudy†
MCI Management Center Innsbruck Institut Mines-Télécom
Management, Communication & IT Télécom ParisTech & † Télécom SudParis

Innsbruck, AUSTRIA Paris, FRANCE

Abstract do not, however, work well with speech and nat-

Wizard of Oz (WOZ) prototyping employs ural language. The Wizard of Oz (WOZ) method
a human wizard to simulate anticipated can be employed to address this shortcoming. By
functions of a future system. In Natural using a human ‘wizard’ to mimic the functional-
Language Processing this method is usu- ity of a system, either completely or in part, WOZ
ally used to obtain early feedback on di- supports the evaluation of potential user experi-
alogue designs, to collect language cor- ences and interaction strategies without the need
pora, or to explore interaction strategies. for building a fully functional product first (Gould
Yet, existing tools often require complex et al., 1983). It furthermore supports the collection
client-server configurations and setup rou- of domain specific language corpora and the easy
tines, or suffer from compatibility prob- exploration of varying dialog designs (Wirén et al.,
lems with different platforms. Integrated 2007). WOZ tools, however, are often application
solutions, which may also be used by de- dependent and built for very specific experimental
signers and researchers without technical setups. Rarely, are they re-used or adapted to other
background, are missing. In this paper application scenarios. Also, when used in combi-
we present a framework for multi-lingual nation with existing technology components such
dialog research, which combines speech as ASR or TTS, they usually require complex soft-
recognition and synthesis with WOZ. All ware installations and server-client configurations.
components are open source and adaptable Thus, we see a need for an easy ‘out-of-the-box’
to different application scenarios. type solution. A tool that does not require great
technical experience and therefore may be used by
1 Introduction researchers and designers outside the typical NLP
In recent years Language Technologies (LT) such research and development community. This demo
as Automatic Speech Recognition (ASR), Ma- is the result of our recent efforts aimed at building
chine Translation (MT) and Text-to-Speech Syn- such an integrated prototyping tool.
thesis (TTS) have found their way into an increas-
ing number of products and services. Technolog- We present a fully installed and configured
ical advances in the field have created new possi- server image that offers multi-lingual (i.e. English,
bilities, and ubiquitous access to modern technol- German, French, Italian) ASR and TTS integrated
ogy (i.e. smartphones, tablet computers, etc.) has with a web-based WOZ platform. All components
inspired novel solutions in multiple application ar- are open-source (i.e. adaptable and extendable)
eas. Still, the technology at hand is not perfect and and connected via a messaging server and a num-
typically substantial engineering effort (gathering ber of Java programs. When started the framework
of corpora, training, tuning) is needed before pro- requires only one single script to be executed (i.e.
totypes involving such technologies can deliver a there is a separate script for each language so that
user experience robust enough to allow for poten- the components are started using the right param-
tial applications to be evaluated with real users. eters) in order to launch a WOZ driven system en-
For graphical interfaces, well-known prototyping vironment. With such a pre-configured setup we
methods like sketching and wire-framing allow for believe that also non-NLP experts are able to suc-
obtaining early impressions and initial user feed- cessfully conduct extended user studies for lan-
back. These low-fidelity prototyping techniques guage technologies applications.

Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 85–88,
Gothenburg, Sweden, April 26-30 2014. 2014 Association for Computational Linguistics
2 Existing Comparable Tools inconsistencies with its possible bias on evalua-
tion results. A combination of both types of tools
Following the literature, existing tools and frame-
can outweigh their deficiencies and furthermore
works that support prototyping of language tech-
allow for supporting different stages of prototyp-
nology applications can be separated into two cat-
ing. That is, a wizard might complement exist-
egories. The first category consists of so-called
ing technology on a continuum by first taking on
Dialogue Management (DM) tools, which focus
the role of a ‘controller’ who simulates technol-
on the evaluation of Language Technologies (LTs)
ogy. Then, in a second stage one could act as a
and whose primary application lies in the areas of
‘monitor’ who approves technology output, before
NLP and machine learning. Two well-known ex-
finally moving on to being a ‘supervisor’ who only
amples are the CSLU toolkit (Sutton et al., 1998)
overrides output in cases where it is needed (Dow
and the Olympus dialogue framework (Bohus et
et al., 2005). However, to allow for such variation
al., 2007). Others include the Jaspis dialogue man-
an architecture is required that on the one hand
agement system (Turunen and Hakulinen, 2000)
supports a flexible use of technology components
and the EPFL dialogue platform (Cenek et al.,
and on the other hand offers an interface for real-
2005). DM tools explore the language-based inter-
time human intervention.
action between a human and a machine and aim at
improving this dialogue. They usually provide an
application development interface that integrates 3 Integrated Prototyping Framework
different LTs such as ASR and TTS, which is then
used by an experimenter to specify a pre-defined
dialogue flow. Once the dialogue is designed, it In order to offer a flexible and easy to use pro-
can be tested with human participants. The main totyping framework for language technology ap-
focus of these tools lies on testing and improving plications we have integrated a number of exist-
the quality of the employed technology compo- ing technology components using an Apache AC -
TIVE MQ messaging server2 and several Java pro-
nents and their interplay. Unlike DM tools, rep-
resentatives from the second category, herein af- grams. Our framework consists of the J ULIUS
ter referred to as WOZ tools, tend to rely entirely Large Vocabulary Continuous Speech Recogni-
on human simulation. This makes them more in- tion engine3 , an implementation of the G OOGLE
teresting for early feedback, as they better sup- S PEECH API4 , the W EB WOZ Wizard of Oz
port the aspects of low-fidelity prototyping. While Prototyping Platform5 and the MARY Text-to-
these applications often offer more flexibility, they Speech Synthesis Platform6 . All components are
rarely integrate actual working LTs. Instead, a hu- fully installed and connected running on a V IR -
TUAL B OX server image7 (i.e. Ubuntu 12.04 LTS
man mimics the functions of the machine, which
allows for a less restrictive dialogue design and Linux Server). Using this configuration we offer
facilitates the testing of user experiences that are a platform that supports real-time speech recogni-
not yet supported by existing technologies. Most tion as well as speech synthesis in English, French,
WOZ tools, however, should be categorized as German and Italian. Natural Language Under-
throwaway applications i.e. they are built for one standing (NLU), Dialog Management (DM), and
scenario and only rarely re-used in other settings. Natural Language Generation (NLG) is currently
Two examples that allow for a more generic ap- performed by the human ‘wizard’. Respective
plication are SUEDE (Klemmer et al., 2000) and technology components may, however, be inte-
Richard Breuer’s WOZ tool1 . grated in future versions of the framework. The
While both DM and WOZ tools incorporate following sections describe the different compo-
useful features, neither type provides a full range nents in some more detail and elaborate on how
of support for low-fidelity prototyping of LT ap- they are connected.
plications. DM tools lack the flexibility of ex-
ploring aspects that are currently not supported by 2
technology, and pure WOZ applications often de- 3 index.php
pend too much on the actions of the wizard, which
can lead to unrealistic human-like behaviour and
1 7

3.1 Automatic Speech Recognition 3.4 Messaging Server and Gluing Programs
The J ULIUS open-source Large Vocabulary Con- In order to achieve the above presented integration
tinuous Speech Recognition engine (LVCSR) uses of ASR, WOZ and TTS we use an Apache AC -
n-grams and context-dependent Hidden Markov TIVE MQ messaging server and a number of Java
Models (HMM) to transform acoustic input into programs. One of these programs takes the output
text output (Lee et al., 2008). Its recognition from our ASR component and inserts it into the
performance depends on the availability of lan- WebWOZ input stream. In addition it publishes
guage dependent resources i.e. acoustic models, this output to a specific ASR ActiveMQ queue so
language models, and language dictionaries. Our that other components (e.g. potentially an NLU
framework includes basic language resources for component) may also be able to process it. Once
English, German, Italian and French. As those an ASR result is available within WebWOZ, it is
resources are still very limited we have also in- up to the human wizard to respond. WebWOZ
tegrated online speech recognition for these four was slightly modified so that wizard responses are
languages using the Google Speech API. This al- not only sent to the internal WebWOZ log, but
lows for conducting experiments with users while also to a WIZARD ActiveMQ queue. A second
at the same time collecting the necessary data for Java program then takes the wizard responses from
augmenting and filling in J ULIUS language re- the WIZARD queue and pushes them to a sepa-
sources. rate MARY queue. While it may seem unneces-
sary to first take responses from one queue just to
3.2 Text-to-Speech Synthesis publish them to another queue, it allows for the
easy integration of additional components. For
M ARY TTS is a state-of-the-art, open source example, we have also experimented with a dis-
speech synthesis platform supporting a variety tinct NLG component. Putting this component
of different languages and accents (Schröder and between the WIZARD and the MARY queue we
Trouvain, 2003). For the here presented multi- were able to conduct experiments where a wiz-
lingual prototyping framework we have installed ard instead of sending entire text utterance would
synthesized voices for US English (cmu-slt- rather send text-based semantic frames (i.e. a se-
hsmm), Italian (istc-lucia-hsmm), German (dfki- mantically unified representation of a user’s in-
pavoque-neutral) as well as French (enst-dennys- put). Such shows the flexibility of using the de-
hsmm). Additional voices can be downloaded and scribed queue architecture. Finally we use a third
added through the M ARY component installer. Java program to take text published to the MARY
queue (i.e. either directly coming from the wiz-
3.3 Wizard of Oz ard or produced by an NLG component as with
one of our experimental settings) and send it to the
WebWOZ is a web-based prototyping platform for
M ARY TTS server. Figure 1 illustrates the differ-
WOZ experiments that allows for a flexible inte-
ent framework components and how they are con-
gration of existing LTs (Schlögl et al., 2010). It
nected to each other.
was implemented using modern web technologies
(i.e. Java, HTML, CSS) and therefore runs in any 4 Demo Setup
current web browser. It usually uses web services
to integrate a set of pre-configured LT components The optimal setup for the demo uses two computer
(i.e. ASR, MT, TTS). For the presented prototyp- stations, one for a wizard and one for a test user.
ing framework, however, we have integrated Web- The stations need to be connected via a LAN con-
WOZ with our ASR solution (i.e. the combined nection. The test user station runs the prototyping
Google/J ULIUS engine) and M ARY TTS. Conse- framework, which is a fully installed and config-
quently ASR output is displayed in the top area ured Virtual Box software image (Note: any com-
of the wizard interface. A wizard is then able to puter capable of running Virtual Box can serve as a
select an appropriate response from a set of pre- test user station). The wizard station only requires
viously defined utterances or use a free-text field a modern web browser to interact with the test user
to compose a response on the fly. In both cases station. A big screen size (e.g. 17 inch) for the
the utterance is sent to the M ARY TTS server and wizard is recommended as such eases his/her task.
spoken out by the system. Both stations will be provided by the authors.

Figure 1: Prototyping Framework Components.

5 Summary and Future Work J. D. Gould, J. Conti, and T. Hovanyecz. 1983. Com-
posing letters with a simulated listening typewriter.
This demo presents an integrated prototyping Communications of the ACM, 26(4):295–308.
framework for running WOZ driven language
S. R. Klemmer, A. K. Sinha, J. Chen, J. A. Landay,
technology application scenarios. Gluing together N. Aboobaker, and A. Wang. 2000. SUEDE: A wiz-
existing tools for ASR, WOZ and TTS we have ard of oz prototyping tool for speech user interfaces.
created an easy to use environment for spoken di- In Proc. of UIST, pages 1–10.
alog design and research. Future work will focus C. Lee, S. Jung, and G. G. Lee. 2008. Robust dia-
on adding additional language technology compo- log management with n-best hypotheses using di-
nents (e.g. NLU, DM, NLG) and on improving the alog examples and agenda. In Proc. of ACL-HLT,
currently limited ASR language resources. pages 630–637.
S. Schlögl, G. Doherty, N. Karamanis, and S Luz.
Acknowledgments 2010. WebWOZ: a wizard of oz prototyping frame-
work. In Proc. of the ACM EICS Symposium on En-
The presented research is conducted as part of the gineering Interactive Systems, pages 109–114.
vAssist project (AAL-2010-3-106), which is par-
tially funded by the European Ambient Assisted M. Schröder and J. Trouvain. 2003. The German
text-to-speech synthesis system MARY: A tool for
Living Joint Programme and the National Funding
research, development and teaching. International
Agencies from Austria, France and Italy. Journal of Speech Technology.
S. Sutton, R. Cole, J. de Vielliers, J. Schalkwyk, P. Ver-
References meulen, M. Macon, Y. Yan, E. Kaiser, B. Rundle,
K. Shobaki, P. Hosom, A. Kain, J. Wouters, D. Mas-
D. Bohus, A. Raux, T. K. Harris, M. Eskenazi, and A. I. saro, and M. Cohen. 1998. Universal speech tools:
Rudnicky. 2007. Olympus: An open-source frame- The CSLU toolkit.
work for conversational spoken language interface
research. In Proc. of NAACL-HLT, pages 32–39. M. Turunen and J. Hakulinen. 2000. Jaspis- a frame-
work for multilingual adaptive speech applications.
P. Cenek, M. Melichar, and M. Rajman. 2005. A In Proc. of ICSLP, pages 719–722.
framework for rapid multimodal application design.
In Proceedings of TSD, pages 393–403. M. Wirén, R. Eklund, F. Engberg, and J. Westermark.
2007. Experiences of an In-Service Wizard-of-
S. Dow, B. Macintyre, J. Lee, C. Oezbek, J. D. Bolter, Oz Data Collection for the Deployment of a Call-
and M. Gandy. 2005. Wizard of oz support through- Routing Application. In Proc. of NAACL-HLT,
out an iterative design process. IEEE Pervasive pages 56–63.
Computing, 4(4):18–26.