You are on page 1of 8

Getting started with VoiceXML 2.

0
VoiceXML Overview

What is VoiceXML? Well it's an XML language for writing Web pages you
interact with by listening to spoken prompts and jingles, and control by means of
spoken input. VoiceXML brings the Web to telephones. If you want to get a
hands on feeling for what this is like, there are an increasing number of voice
portals which you can phone into and try out for yourself. Several sites also
offer free hosting for VoiceXML. Some pointers to these sites can be found in
the FAQ on the overview page.

VoiceXML isn't HTML. HTML was designed for visual Web pages and lacks the
control over the user-application interaction that is needed for a speech-based
interface. With speech you can only hear one thing at a time (kind of like looking
at a newspaper with a times 10 magnifying glass). VoiceXML has been carefully
designed to give authors full control over the spoken dialog between the user
and the application. The application and user take it in turns to speak: the
application prompts the user, and the user in turn responds.

VoiceXML documents describe:


• spoken prompts (synthetic speech)
• output of audio files and streams
• recognition of spoken words and phrases
• recognition of touch tone (DTMF) key presses
• recording of spoken input
• control of dialog flow
• telephony control (call transfer and hangup)

VoiceXML makes it easy to rapidly create new applications and shields


developers from the low level and implementation details. It separates user-
interaction from service logic. The W3C VoiceXML 2.0 specification is the
definitive reference to VoiceXML. You can also find other related work on
W3C's voice browser overview page

Key Concepts
A session begins when the user starts to interact with a VoiceXML interpreter
and continues as VoiceXML documents are loaded and unloaded. The session
ends when requested by the user, VoiceXML document or interpreter context.
The platform defines the default session behavior, although this can be
overridden in part by VoiceXML.

VoiceXML documents define applications as a set of named dialog states. The


user is always in one dialog state at any time. Each dialog specifies the next
dialog to transition to using a URL.

VoiceXML dialogs include: forms and menus. A menu


presents the user with a choice of options and the
transitions to another dialog state based upon the user's
selection. A form defines an interaction that collects
values for each of the fields in the form. Each field may
specify a prompt, the expected input, and evaluation
rules. The form can be submitted to a server in much the
same way as for HTML.
An application is a set of VoiceXML documents that
share the same application root document. The root
document is automatically loaded whenever one of the
application documents is loaded, and remains loaded
until there is a transition to a different application, or
when the call is disconnected. The root document
information is available to all documents in the same
application.

Each dialog state has one of more grammars associated with it, that are used to
describe the expected user input, either spoken input or touch-tone (DTMF) key
presses. In the simplest case, only the dialog's grammars are active in that
dialog. In more complex cases, other grammars can be active.
• grammars defined within the dialog itself
• external grammars referenced by links
• grammars defined at the document level and marked as being globally
active
• grammars defined in the root application document and active throughout
the application

A subdialog is like a function call: it allows you to call out to a new dialog and
then returns to the original dialog, retaining the local state information for that
dialog. Sub dialogs can be used to handle confirmations and to create a library
of re-usable dialogs for common tasks.
VoiceXML allows you to define named variables for holding data. These can be
defined at any level and their scope follows an inheritance model. You can test
the values of variables to determine what dialog state to transition to next.
Variable expressions can also be used for conditional prompts and grammars
etc.

Events are thrown when the user fails to respond to a prompt, or when the input
can't be understood. VoiceXML allows you to write handlers for catching events.
These follow an inheritance model, and events can be caught at a higher level if
there is no corresponding handler at the dialog level.

VoiceXML allows you to use scripting (ECMAScript) when you need additional
control over the application. VoiceXML employs a form filling metaphor. You
can define a complex grammar for collecting the values of several fields in a
single response. Any unfilled fields can be handled by special subdialogs
defined inline within each dialog.

VoiceXML Examples
Here is a very simple VoiceXML application. It says "Welcome to Travel
Planner!", plays a short audio advertising jingle and then exits:

<?xml version="1.0" encoding="ISO-8859-1"?>


<vxml version="2.0" lang="en">
<form>
<block>
<prompt bargein="false">Welcome to Travel Planner!
<audio src="http://www.adline.com/mobile?code=12s4"/>
</prompt>
</block>
</form>
</vxml>

The following example offers a menu of three choices: sports, weather or news.
<?xml version="1.0"?>
<vxml version="2.0">
<menu>
<prompt>
Say one of: <enumerate/>
</prompt>
<choice next="http://www.sports.example/start.vxml">
Sports
</choice>
<choice next="http://www.weather.example/intro.vxml">
Weather
</choice>
<choice next="http://www.news.example/news.vxml">
News
</choice>
<noinput>Please say one of <enumerate/></noinput>
</menu>
</vxml>

This dialog might proceed as follows:


Computer: Say one of: Sports; Weather; News.
Human: Astrology
I did not understand what you said.
Computer:
(a platform-specific default message.)
Computer: Say one of: Sports; Weather; News.

Human: Sports
Computer: (proceeds to http://www.sports.example/start.vxml)

Here is another example, this time, using a form to ask the user to choose a city
and the number of travellers. Once this information has been collected it is
submitted to a web server:

<?xml version="1.0" encoding="ISO-8859-1"?>


<vxml version="2.0" lang="en">
<form>

<field name="city">
<prompt>Where do you want to travel to?</prompt>
<option>Edinburgh</option>
<option>New York</option>
<option>London</option>
<option>Paris</option>
<option>Stockholm</option>
</field>

<field name="travellers" type="number">


<prompt>How many are travelling to <value expr="city"/>?</prompt>
</field>

<block>
<submit next="http://localhost/handler" namelist="city travellers"/>
</block>

</form>
</vxml>

VoiceXML allows you to give progressively more detailed prompts when the
user is having difficulty answering. This relies on a counter that increments each
time around. The following example shows how for a field that collects the
number of people travelling. The user is initially asked: "How many are travelling
to Boston". If this doesn't get a satisfactory answer, the user is then asked:
"Please tell me the number of people travelling". The nomatch element allows
you to provide a reminder if the user said something other than a number:

<field name="travellers" type="number">

<prompt count="1">
How many are travelling to <value expr="city"/>?
</prompt>

<prompt count="2">
Please tell me the number of people travelling.
</prompt>
<prompt count="3">
To book a flight, you must tell me the number
of people travelling to <value expr="city"/>.
</prompt>

<nomatch>
<prompt>Please say just a number.</prompt>
<reprompt/>
</nomatch>

</field>

Here is an example that checks the value of a field after it has been collected.
This is used to issue a warning when the number of travellers in the group is
greater than twelve:

<field name="travellers" type="number">


<prompt>How many are travelling to <value expr="city"/>?</prompt>

<filled>
<var name="num_travellers" expr="travellers + 0"/>
<if cond="num_travellers > 12">
<prompt>
Sorry, we only handle groups of up to 12 people.
</prompt>
<clear namelist="travellers"/>
</if>
</filled>

</field>

VoiceXML allows you to define subdialogs that can be used for common tasks.
Subdialogs are analogous to subroutines in programming languages. Here is an
example of a confirmation subdialog where a confirmation is asked to decide
whether to accept an earlier input or not:

<form id="ynconfirm">
<var name="user_input"/>

<field name="yn" type="boolean">

<prompt>Did you say <value expr="user_input"/></prompt>

<filled>
<var name="result" expr="'false'"/>
<if cond="yn">
<assign name="result" expr="'true'"/>
</if>
<return namelist="result"/>
</filled>

</field>

</form>

If the speech recognizer indicates that it wasn't quite sure of what the user said,
VoiceXML allows you to tailor the dialog appropriately. In the following example,
the user is asked for a confirmation if the confidence score for the city name is
less than 0.7, but if it less than 0.3, the user will be asked to say the city name
again:
<field name="city">
<prompt>Which city?</prompt>
...
<filled>
<if cond="city$.confidence < 0. 3">
<prompt>Sorry, I didn't get that</prompt>
<clear namelist="city"/>
<elseif cond="city$.confidence < 0.7"/>
<assign name="utterance" expr="city$.utterance"/>
<goto nextitem="confirmcity"/>
</if>
</filled>
</field>

<subdialog name="confirmcity" src="#ynconfirm" cond="false">


<param name="user_input" expr="utterance"/>
<filled>
<if cond="confirmcity.result=='false'">
<clear namelist="city"/>
</if>
</filled>
</subdialog>

If the confidence is less that 0.3, the user will be told "Sorry, I didn't get that",
and will then be reprompted for the city name. If the confidence is less than 0.7,
the generic conformation subdialog is invoked. The subdialog element acts like
a subroutine call. The param element is used to pass data to the subdialog.

You can also use grammars in separate files. The following example makes use
of grammars in "trade.xml":

<form name="trader">

<field name="company">
<prompt> Which company do you want to trade?</prompt>
<grammar src="trade.xml#company" type="application/grammar+xml"/>
</field>

<field name="action">
<prompt>
do you want to buy or sell shares in
<value expr="company"/>?
</prompt>
<grammar src="trade.xml#action" type="application/grammar+xml"/>
</field>

</form>

You can use the import element to import grammar rules so that you can refer
to them in locally defined grammars. In the following it is assumed that
"politeness.xml" defines rules named "startPolite" (e.g. 'please') and "endPolite"
(e.g. 'thankyou'):

<grammar xml:lang="en">
<import uri="http://please.com/politeness.xml" name="polite"/>

<rule name="command" scope="public">


<ruleref import="polite#startPolite"/>
<ruleref uri="#action"/>
<ruleref uri="#object"/>
<ruleref import="polite#endPolite"/>
</rule>

<rule name="action" scope="public">


<choice>
<item tag="buy"> buy </item>
<item tag="sell"> sell </item>
</choice>
</rule>

<rule name="company" scope="public">


<choice>
<item tag="ericsson"> ericsson </item>
<item tag="nokia"> nokia </item>
</choice>
</rule>
</grammar>

In the following example for a stock trading application, the user can respond
with a short phrase such as "buy ericsson" that sets both the company and the
trade (buy or sell). The grammar for this is defined in the file "trade.xml". If the
user fails to respond adequately, then the applications tries a simpler approach,
prompting first for the company and then for the trade. The field elements are
skipped if the corresponding field value has already been filled.

<form name="trader">

<grammar src="trade.xml#command" type= "application/grammar+xml"/>

<initial name="start">
<prompt>What trade do you want to make?</prompt>
<nomatch count="1">
<prompt>Please say something like ‘buy ericsson’ </prompt>
<reprompt/>
</nomatch>
<nomatch count="2">
Sorry, I didnÂ’t understand your request. LetÂ’s try something
simpler.
<assign name="start" expr="true"/>
</nomatch>
</initial>

<field name="company"> ... </field>


<field name="action"> ... </field>
</form>

The application may give the user the chance to change to a different task by
speaking the appropriate command. The grammar for this can be specified at
the document level or in the application root document. Here is an example of a
document level command menu:

<?xml version="1.0" encoding="ISO-8859-1"?>


<vxml version="2.0" lang="en">

<form name="trader">
...
</form>

<menu name="portal-commands" scope="document">


<choice expr="http://www.wl.com?action=car">Car hire</choice>
<choice expr="http://www.wl.com?action=hotel">Hotel
reservations</choice>
<choice expr="http://www.wl.com?action=news">TodayÂ’s news</choice>
</menu>

...
</vxml>
To reference the application root document, you use the application
attribute on the vxml element:
<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en"
application="http://buster/portal?sessionID=12d4rf65hg4" >

...
</vxml>

Here is an example of a root document that makes available a command for


returning to the portal home page. The example also includes a handler for
catching "noinput" events in case these haven't been caught by lower level
handlers, e.g. on each dialog:

<form name="portal-commands" scope="document">


<field name="action">
<grammar src="http://buster/portal/commands.xml"
type="application/grammar+xml"/>
</field>
<block>
<submit next="http://www.wl.com"/>
</block>
</form>
<var name="portal-help" expr=
"To return to your portal home, say 'home page', or press 0."/>

<catch event="noinput">
Sorry, I didnÂ’t hear anything.
</catch>

You might also like