Interactive Telecommunication Systems and Sevices

Moderné vzdelávanie pre vedomostnú spoločnosť/
Projekt je spolufinancovaný zo zdrojov EÚ
INTERAKTÍVNE TELEKOMUNIKAČNÉ
SYSTÉMY A SLUŽBY
INTERACTIVE TELECOMMUNICATION
SYSTEMS AND SERVICES
Fakulta elektrotechniky a informatiky
Juhár Jozef, Ondáš Stanislav

Táto publikácia vznikla za finančnej podpory z Európskeho sociálneho fondu v rámci
Operačného programu VZDELÁVANIE.
Prioritná os 1 Reforma vzdelávania a odbornej prípravy

Opatrenie 1.2 Vysoké školy a výskum a vývoj ako motory rozvoja vedomostnej spoločnosti.
Názov projektu: Balík prvkov pre skvalitnenie a inováciu vzdelávania na TUKE

ITMS 26110230070
NÁZOV: Interaktívne telekomunikačné systémy a služby

AUTORI: prof. Ing. Jozef Juhár, CSc., Ing. Stanislav Ondáš, PhD.
VYDAVATEĽ: Technická univerzita v Košiciach
ROK: 2015
ROZSAH: 90 strán
NÁKLAD: 50 ks
VYDANIE: prvé
ISBN: 978-80-553-2018-2
Rukopis neprešiel jazykovou úpravou.

Za odbornú a obsahovú stránku zodpovedajú autori.
Interactive telecommunication systems and services
Table of content
FOREWORD ...................................................................................................................................................... 5
1. INTRODUCTION ....................................................................................................................................... 6
1.1. SPOKEN DIALOGUE SYSTEMS ........................................................................................................................ 7

1.2. MULTIMODAL INTERACTIVE SYSTEMS AND SERVICES ......................................................................................... 9
1.3. APPLICATIONS OF SDS AND MDS SYSTEMS .................................................................................................. 12
2. COMPONENTS OF THE SPEECH-BASED HUMAN-MACHINE INTERFACE .................................................. 15
2.1. AUTOMATIC SPEECH RECOGNITION ............................................................................................................ 15

2.2. NATURAL LANGUAGE UNDERSTANDING....................................................................................................... 17
2.3. DIALOGUE MANAGEMENT ......................................................................................................................... 18
2.4. SPOKEN LANGUAGE GENERATION AND SPEECH SYNTHESIS................................................................................ 20
3. TECHNOLOGIES FOR DESIGNING ITSS .................................................................................................... 25
3.1. LANGUAGES FOR DESIGNING ITSS .............................................................................................................. 25

3.1.1. W3C Speech Interface Framework ................................................................................................ 25
3.1.2. W3C Multimodal Interaction ......................................................................................................... 33
3.1.3. Speech Application Language Tags ............................................................................................... 38
3.2. INTERFACES IN INTERACTIVE TELECOMMUNICATION SYSTEMS ........................................................................... 38
3.2.1. Java Speech API ............................................................................................................................. 38
3.2.2. Microsoft Speech API ..................................................................................................................... 39
3.2.3. Media Resource Control Protocol .................................................................................................. 40
3.2.4. Web Speech API ............................................................................................................................. 41
3.3. TOOLS AND SOLUTIONS ............................................................................................................................ 41
3.3.1. Speech Recognition toolkits........................................................................................................... 42
3.3.2. Language modeling toolkits .......................................................................................................... 44
3.3.3. Speech synthesis toolkits ............................................................................................................... 44
3.3.4. Dialog manager tools .................................................................................................................... 46
3.3.5. VoIP software gateways ................................................................................................................ 47
4. DESIGNING VOICEXML-BASED VOICE SERVICES ..................................................................................... 48
4.1. BASIC PRINCIPLES .................................................................................................................................... 48

4.1.1. Dialogue structure and flow .......................................................................................................... 48
4.1.2. Writing prompts and grammars.................................................................................................... 54
4.2. WRITING VOICEXML APPLICATIONS ........................................................................................................... 55
5. WEB APPLICATIONS WITH VOICE MODALITY ........................................................................................ 64
5.1. TECHNOLOGIES FOR CREATING WEB APPLICATION .......................................................................................... 65

5.1.1. HTML5 ........................................................................................................................................... 65
5.1.2. JavaScript ...................................................................................................................................... 65
5.2. TECHNOLOGIES FOR ENABLING SPEECH MODALITY TO WEB APPLICATIONS ........................................................... 66
5.2.1. Google Chrome and the speech control ........................................................................................ 67
5.3. AUDIO SIGNAL CAPTURING USING HTML5 ................................................................................................... 69
5.3.1. getUserMedia API interface .......................................................................................................... 69
5.3.2. Web Audio API interface ............................................................................................................... 70
5.4. PUBLIC LIGHTING CONTROL WEB APPLICATION – A CASE STUDY ......................................................................... 71
6. TABLE OF PICTURES ............................................................................................................................... 75
3
7. REFERENCES .......................................................................................................................................... 77
APPENDIX A: PUBLIC LIGHTING CONTROL APPLICATION SOURCE CODES ....................................................... 79
APPENDIX B: VOICEXML 2.0 ELEMENTS .......................................................................................................... 87
APPENDIX C: SRGS ELEMENTS (XML FORM) ................................................................................................... 89
4
Foreword
Interactivity, mobility and real-time multimedia streaming becomes new

attributes of the modern services. The rapid rise of such services brings
new technologies, but it is also true that new technologies bring new
services.
Proposed book aspires to capture this effect. It is focused mainly on the

interactive telecommunication systems and services the view of the
increasing speech technologies, as are automatic speech recognition or
text-to-speech synthesis. Improving quality of those technologies
together with other factors change the way of human-machine
interaction. We are approaching a human-human way interaction more
and more, what leads to decreasing of user’s cognition load and it
brings also higher perceived quality of the service.
Regarding to the considerable width of the described problematics, it

was not possible to deal with each technology in detail, but we hope
that proposed book will help the reader to obtain important information
about modern technologies behind interactive telecommunications
systems and services.
authors
5
1. Introduction
Interactive telecommunication systems (ITS) are systems that enable human-machine

interaction and are provided through telecommunication networks. There are several
categories of such systems, which can be classified according several criterions.
According number of modalities that are involved in human machine communication, we can
distinguish between unimodal and multimodal systems.
According type of telecommunication network we can classify ITSs on systems that are
provided through telephony network and enables voice-based services or on systems that
delivers their services through internet network.
There can be done a lot of classifications, but we rather want to focus on another view on such
systems. In general, they are human-machine interfaces (HMI). Proposed book focuses on
speech-based human-machine interfaces due to the specialization of authors in its own
research, which is done in Laboratory of speech and mobile technologies in
telecommunications. Speech-based HMI is inspired by the human-human dialogue and its
principle is described by the human-machine communication chain (see Fig. 1.).
Fig. 1. Human machine communication chain
The modeled situation of human-machine communication starts with the message formulation
in the user’s (operator’s) brain and it is followed by its pronunciation. Spoken message is
captured by an electroacoustic transducer (microphone) and transformed to electrical signal.
6
The block of signal preprocessing performs similar operation as human hearing system does,
what means, it transforms incoming electrical signal to the sequence of parametric vectors,
which represents important characteristics of acoustic signal across the time line. Such
sequence of parametric vectors in an input of the automatic speech recognition (ASR) block,
which decodes them and generates an appropriate sequence of words. To decode meaning of
incoming words further processing has to be done in the block of speech (natural language)
understanding (NLU). Here, incoming utterance is transformed to some kind of semantic
representation usable by dialogue manager unit to control human-machine dialogue
interaction. Dialogue manager is “the brain” of the speech-based HMI. It decides about the
next step in the interaction and constructs the response to the user in a form of semantic
representation of the information, which should be presented to the user. Such information has
often a form of attribute-value pairs and need to be transformed to the form of a natural
language utterance. This process is done in the block of natural language generation. After
constructing the message in form of such utterance, block of text-to-speech synthesis (TTS)
transform it to the acoustic signal, which can be after post-processing delivered back to the
user.
Described scenario imagines a basic interaction loop, which enables a dialogue exchange in
speech-based human machine interfaces. Spoken dialogue systems directly adopt those
scenarios.
1.1. Spoken dialogue systems

Spoken dialogue systems (SDS) are computer programs developed to interact with users
employing speech in order to provide them with specific automated services [1]. SDSs are
unimodal systems, what means, they accept as its input speech and produce the speech-based
output. They use dialogue interaction to obtain information from the user and to fulfill user’s
goals. Services provided by SDSs are usually called automated voice services. The typical
services provided by SDSs are weather forecast, timetable services, call centers, help desks,
room/flight booking services and etc. Nowadays spoken dialogue systems are often integrated
as in-car communication systems or personal assistants, where they are often extended to
allow using other modalities.
Fig. 2 Pipeline architecture of the Spoken Dialogue System
7
The basic architecture of the SDS directly implements the human machine communication
chain. This type of architecture is often called “pipeline architecture” (Fig. 2), because
components of the system are here arranged serially, what means that the output of previous
component is the input of the next component and the processing of incoming information is
synchronous. Such arrangement has its own disadvantages mainly in the processing delay and
in the unavailable backchannel between components. Other types of architectures, which
enable more processing flexibility and more cooperation between components, were
developed. As a typical representative of the SDS architectures is a distributed architecture
with typical hub-server topology.
A typical example of distributed architecture is Hub-server architecture as used in Galaxy

communicator. Hub architecture was developed in the project supported by the grant agency
DARPA. DARPA communicator [2] is an open system, where the main component is called
“Hub”. Hub process distributes the communication and services between other components -
servers. The successor of DARPA communicator is The Galaxy communicator [3]. Its scheme
is shown in Fig. 3.
Fig. 3 Hub-server architecture of the Galaxy Communicator1
The Galaxy Communicator is distributed, message-based, hub-and-spoke software

infrastructure optimized for spoken dialogue systems development. It consists of the main
components as Generator (NLG), Synthesizer (TTS), Recognizer (ASR), Dialogue (DM), and
Parser (NLU). Audio client as a next component provides voice data and telephony events to
the system. Backend server is responsible for communication with external data sources and
the Builtin server provides some build in features to the system.
There is also a group of systems, which uses a hub-server architecture, but instead of hub in
the center of the system, they use as a central-router dialogue manager.
1
Source: http://communicator.sourceforge.net/sites/MITRE/distributions/GalaxyCommunicator/
8
1.2. Multimodal interactive systems and services

Although spoken dialogue systems are very popular and helpful, the interaction between
people relies not only on speech-based interaction. It is typically multimodal. People usually
use speech, gestures, sight, posture and other modalities in interpersonal communication.
Therefore the multimodal interaction is intended to be the best way of interaction also with
computers, robots or with “virtual humans”. However, there is a long way to enhancing the
virtual humans with capability of human-like interaction. To be able to add such capabilities
to a virtual human, knowledge from several research areas is necessary, e.g. informatics,
communications, cognitive science, neurology, psychology, linguistics etc. There are also a
lot of interaction aspects, which should be consider, like processing of nonverbal signals,
emotions, producing feedback or cross-modality issues.
The multimodal dialogue systems (MDS) are interactive systems, enabling the human-
machine interaction using several input-output modalities, like speech, gestures, touches, etc.
Fig. 4 Architecture of the multimodal dialogue system2
In case of multimodal dialogue systems the group of “core” technologies (ASR, NLU, DM,
NLG, TTS) is extended with next technologies, which recognize and interpret other
modalities and performs multimodal integration (fusion) and multimodal generation (fision)
of allowed modalities.
A general architecture of the multimodal dialogue system is shown in Fig. 4. Each input
modality has to be firstly captured and recognized. Then, semantic interpretation can be
performed separately for each input modality or recognized outputs from all modalities can be
interpreted together in the block of multimodal integration (fusion). Data fusion or
multimodal integration integrates information incoming through several input modalities to
produce one meaningful semantic message for the dialogue manager.
2
Source: http://nlp.postech.ac.kr/research/dialog_system/mdm/
9
We can illustrate multimodal fusion on the following example. Assume application for mobile
device with touchscreen that enable user to select an area on the map and search for points of
interests (POI) in this area. Assume that, such application allows using of gestures on
touchscreen and speech input. Assume user’s input as is shown in Fig. 5.
Fig. 5. An example of user's input in multimodal interaface
User selected the area on the map and spoke the utterance: “Please, find all pubs here”. In
sketched scenario, the block of fusion, has to integrate the input from touchscreen (gesture,
which selected the area, selected area coordinates or some identification point) and the spoken
utterance. Practically, the word “here” can be replaced with the name of the area (e.g. city
center), which could be enough representative for creating a request to the database.
In general, methods of multimodal fusion can be divided according the phase, where the
fusion is performed on:
 Data level fusion

 Feature level fusion
 Decision level fusion
Data level fusion integrates directly data streams, which are captured by input devices like
microphones, cameras or sensors. A good example is a combination of the cameras video
stream to enable multiple views. Feature level fusion integrates feature vectors from at least
two inputs usually to enhance the recognizer performance. Such fusion technique is
successfully used in case of features computed from speech and lips movement, where
significant improvement of speech recognition can be achieved especially in noisy
environment. Scenario illustrated in Fig. 5 imagines typical situation, where decision level
fusion is suitable.
10
Typical input modalities are:
 Speech and speech characteristics (emotions)

 Pointing or touching (touchscreen)
 Gestures (hand gestures, head gestures, body motion)
 Face expressions and gaze tracking
 Biometric signals
 Brain activity
 Others
The multimodal generation (fision) is an inverse process to multimodal integration (fusion).

Fision module decides about the way of output presentation. It selects modalities, which
should be involved in output presentation and also it decides about output information
distribution among the output modalities. The fision methods are typically platform-
dependent, what means that each platform usually use its own approach to multimodal
generation according to allowable output modalities.
Typical output modalities are as follows:
 Visual modalities (text, graphics, animations or virtual agent embodiment)

 Auditory modalities (speech, non-speech sound)
 Other
Fig. 6. MOBILTEL Weather Forecast multimodal application
11
A set of output windows can be seen in Fig. 6, which is a visual part of the multimodal output
generated by the multimodal system MOBILTEL, developed in our laboratory. The second
output modality is the synthetic speech.
1.3. Applications of SDS and MDS systems

At the beginning, speech-based human-machine interfaces had a form of Voice User
Interfaces (VUI) that allowed only DTMF input and produced speech-based output. Such
interfaces were typically available through telephony network and provide simple services as
is pizza ordering, weather forecast or simple call centers.
With the growing availability and performance of automatic speech recognition systems,
Voice User Interfaces transforms to typical Spoken Dialogue Systems, which enable to
provide more complex voice services. Moreover, the interaction that enables speech input
were more comfortable. Whereas first SDS systems were capable to interpret only speech
input in a form of separate spoken words or short phrases, more modern systems enable to use
more natural language involving large vocabulary continuous speech recognition (LVCSR)
systems. This enables user to formulate its intentions once in a form of a sentence, what
significantly decreases the time needed to complete the desired task. More complex booking
services (flight booking, holiday booking, etc.) or voice banking applications and more
complex call centers were typical voice services provided by such systems.
Then, extension of the internet connectivity in mobile devices and extension of mobile
devices with touchscreens radically changed the way of human-computer interaction. This
changes moves interaction from keyboard-based input to touches/touch gestures input and
from unimodal mode (spoken interaction) to multimodal way. E.g. Google search interface
can be seen as a simple multimodal system too. It enables to type or to speak the text to be
searched. This is an example of an „asymmetric“ multimodal interface, which enables
multimodal input, but only unimodal output (in form of a list of web pages).
Thanks to increasing quality and usability of automatic speech recognition (ASR) and text to
speech systems (TTS), we can observe the rapid development of advanced multimodal
human-machine interfaces (HMI) that enables spoken dialogue in a human-like way. Such
systems are often called virtual agents or assistants. It was Apple Siri virtual assistant in 2012,
which can be consider as a first widely used, modern, really usable and helpful dialogue-
based HMI, which among others demonstrates the most-modern ASR, TTS and natural
language processing (NLP) technologies. In 2012 the Google Now was introduced also. The
Microsoft Cortana comes in 2014. Fig. 7 demonstrates one of the interaction scenarios with
Apple Siri assistant.
There is also another group of virtual agents that takes the human embodiment. They are
known also as embodied conversational agents (ECA). The human-like look gives them the
new possibilities in interaction. They can communicate using human-like gestures, body and
head movement or involving gaze direction changes. They can express emotions or their
attitude as a reaction to human behavior.
12
Fig. 7. An example of interaction scenario of Apple SIRI virtual asistant3 and a selection of its possibilities4
One of the popular frameworks, which enable to integrate embodied conversational agent is
GRETA, designed at Telecom Paris Tech University (see Fig. 8).
Fig. 8. GRETA - Embodied Conversational Agent5
From embodied agents, there is only one step towards robots, which also requires multimodal
human-machine interface. Robots that rely on multimodal human-like interaction can be
further divided on humanoid robots and other robotic systems (e.g. healthcare robot), which
do not take human embodiment. According mentioned division, also used type of input/output
modalities differs. Humanoid robots can communicate using the same output modalities as
virtual agents. In case of e.g. healthcare or accompanying robots, which often use touchscreen
as interface hardware, other modalities are assumed. Fig. 9 shows robotic system Car-O-bot 3
(Fraunhofer IPA) designed for healthcare and accompanying of disabled and elderly people
on the left side and one of the most popular humanoid robot NAO (Aldebaran company).
3
Source: www.pcmag.com
4
Source: www.applemagazin.com
5
Source: https://trac.telecom-paristech.fr/trac/project/greta
13
Fig. 9. Car-O-bot 3 robot (left) vs. Aldebaran NAO humanoid robot (right) 6
Another application of speech technologies and multimodal interaction systems are in-vehicle
infotainment systems with speech interface. They are designed to increase safety during
controlling in-vehicle subsystems as are aircondition, radio, music playing or navigation.
Such systems are perceived also as more user friendly. Fig. X shows a Fujitsu prototype of the
Interactive Voice-Recognition Car Navigation Unit.
Fig. 10 Fujitsu prototype of the Interactive Voice-recognition Car Navigation Unit 7
6
Sources: http://library.isr.ist.utl.pt/docs/roswiki/care(2d)o(2d)bot.html, https://www.aldebaran.com/
7
Source: http://www.fujitsu-ten.com/release/2014/20140217_e.html
14
2. Components of the speech-based human-

machine interface
2.1. Automatic Speech Recognition

Automatic Speech Recognition (ASR) technology is an important part of each human-
machine speech interface. The process of automatic speech recognition can be defined as
transformation of incoming acoustic signal, captured by microphone, to corresponding
sequence of words. Whereas the output of ASR has a form of text information, such systems
are sometimes called speech-to-text convertors. Recognized sequence of words can be the
finil product or can be further processed in next modules, typically in the block of the
language understanding.
The general architecture of the ASR system is shown in Fig. 11.
Fig. 11 General architecture of the ASR system
The main components of ASR system are blocks of feature extraction or parameterization and
decoding or searching. There are three resources that are used by the decoding component –
acustic model, language model and the lexicon.
The block of feature extraction converts incoming speech signal into its parametric
representation, which is suitable for speech recognition process (e.g. MFCC parametric
vectors).
15
Decoding component implements one of the recognition approaches. In general, it performs

an acoustic matching, where it compares incoming representation of input signal with acoustic
patterns or with statistical models that represents recognizable units (words, phonemes …).
This acoustic matching is often called decoding or searching, because it tries to find either a
pattern with the minimum distance to the input signal or the model, which gives the maximum
likelihood of input signal generation.
In case of the statistical recognition methods, recognition units are modeled by statistical
acoustic models, predominantly using Hidden Markov models (HMM). HMM models that
represent basic recognition units are trained on speech corpus, which consists of speech
recordings and its exact transcription. To train speaker-independent system, large corporas
with more than hundred hours of speech recorded by many speakers is required.
Language model is the next important resource involved in recognition process. It models
real occurrence probabilities of utterances in a language and this way reduces the complexity
of the recognition space. There are two types of language models – deterministic or stochastic
models. Deterministic models are rule-based speech grammars, which directly determine
allowed words and phrases and its possible combinations. Stochastic models learn real
language statistics from text corporas and models occurance probabilities of so called N-
grams, which are sequences of N words. Whereas deterministic models (grammars) are well
suitable for recognition of simple commands or short utterances, stochastic N-gram models
have to be trained from large corporas to enable dictation or recognition of the natural
continuous speech.
Third resource required for speech recognition is the lexicon or dictionary, which contains all
words that can be recognized by the ASR system, with its pronunciation. Lexicons serve for
converting recognized sequence of phonemes to its grapheme equivalents, which creates
meaningfull words.
During the last thirty years, corpus-based ASR method became preferred. Such methods are
based on statistical modeling algorithms, where the system models are automatically trained
on text and speech data.
According the decoding technique, corpus-based ASR systems can be divided on systems
based on:
 Pattern-matching algorithms
 Hidden Markov models theory
 Neural networks
 Hybrid approaches
Pattern-matching algorithms are the oldest one. In this method, words or sequences of
words are the smallest recognition unit. The decoding process is done by searching for the
lowest distance between parametric representation of pattern words or phrases in a system
vocabulary and the representation of incoming word or phrase. To make computation of
distance computable, a nonlinear transformation of the time line is necessary to ensure the
16
equal length of parametric vectors of both input signal parameters and pattern parameters. The
most popular algorithm from this group is Dynamic Time Warping (DTW) algorithm, where
nonlinear transformation of the time line (warping) is controlled by the particular distance
between parametric vectors.
Pattern matching methods have several important limitations. Such technique is speaker-
dependent, what means that it gives sufficient results only in case that the speaker and the
person who recorded patterns are the same person. Next important limitation is that such
method works well for task, where separate words should be recognized, because in this case,
the word boundaries are detectable and the difference between length of pattern and incoming
words are limited. This is not true for continuous speech, where the word boundaries are
worse detectable. Another limitation of described method, which uses whole words as a basic
unit, is the not-finite set of patterns, which grows with the number of recognizable words.
Unlike described approach, approaches, which takes smaller units (phonemes, triphones or
syllables) work with finite set of “patterns”.
Automatic Speech Recognition systems based on Hidden Markov Models (HMM) theory
are dominant in this area. In such systems, whole words or often sub-word units (phonemes,
syllables) are modeled using HMM models and then words can be modeled by concatenation
of those models. HMM models are often connected with language model, which provides
information about word sequences probability of occurrence. HMM-based approach became
an standard approach to ASR task, because it overcomes the limits of previously introduced
method. It enables to train speaker-independent system, which can be used not only for the
recognition of separated words but also for large vocabulary continuous speech recognition
(LVCSR).
Another group of ASR algorithms is based on Artificial Neural Networks (ANN) and hybrid
systems, which join together HMM and ANN. Using ANN in ASR task is a logical step
towards the speech decoding in a similar manner than in case of human listener.
2.2. Natural Language Understanding

The output of the automatic speech recognition has a form of text, which requires next
processing to be usable for interaction purposes. The Natural Language Understanding (NLU)
technology transfer incoming textual representation of spoken utterance to corresponding
representation of meaning. It extracts so called semantic information. The output of NLU
module has usually form attribute-value pairs or structured semantic frames.
There are three main types of approaches to interpretation of natural language utterances:
 Linguistic methods
 Statistical approaches
 other, simplified approaches
Linguistic method of NLU is the most complex approach, which performs complex analysis
on morphological, syntactic, semantic and contextual layer. Although, this approach seems to
17
be the best way to understanding, its complexity usually do not enable to use it in standard
spoken dialogue systems. Statistical approaches imagine an alternative to linguistic methods
and can be concluded that they are also based on linguistic. Using of labeled corpora is typical
for statistical approaches. Then, models are trained from labeled data automatically using one
of the machine learning algorithms. The main disadvantage of such approach is a need of
corpora, which have to be often labeled manually by human annotators.
High complexity and costs for designing NLU systems based on previously introduced
methods led to searching for simpler approaches, which can bring enough understanding in
task-oriented, domain-dependent interactive dialogue systems. Thanks to W3C
standardization effort, approach based on semantic grammars can be considered as most
popular one. Its idea is similar to speech grammars idea, where all possible words and phrases
are described in the grammar. NLU system based on semantic grammars also decodes
meaning by matching incoming utterance text with grammar, which is able to associate part of
utterance to one of the semantic slot. Languages from W3C Speech Interface Framework
enables to join speech grammar (written according SRGS recommendation) and rules for
semantic interpretation (written according SISR rec.) together. In this case rules for meaning
decoding are directly included into the speech grammar.
Here, we can note that the serial concatenation of ASR and NLU is not the only way how to
transform acoustic signal to usable meaning representation. To obtain better results in more
complex communication scenarios (as is e.g. conversation), some kind of backchannel is
required, which enable to influence selection of the best recognition hypothesis.
2.3. Dialogue management

The dialogue management unit or dialogue manager (DM) is a central element of each
human-machine speech based interface. The main purpose of DM is to manage dialogue
interaction, what means to decide about the next step in the dialog or to decide about the way
of the system’s reaction. It combines incoming input from the user with actual dialogue state
and interaction history to be able to make decision about the next step in the interaction. To be
able to realize interaction, DM has to manage also other components of interactive dialog
system (e.g. ASR, TTS, etc.).
There are a lot of approaches, which are usually use in dialogue management and also there
can be done divisions according various criterions. One of them is division as follows [4]:
 Finite state/dialogue grammar-based approaches

 Plan-based approaches
 Collaborative approaches
 HMM-based approaches
In another division, among others statistical based methods can be found. Despite this
distribution, it can be concluded, that any system can in the end be implemented as a finite
state machine.
18
Dialogue management systems based on finite state machines model dialogue as a group of
states (nodes), which are arranged in some structure (state machine) and are interconnected by
edges. In this method, state imagines a particular dialogue exchange, where the system often
asks user a question. Then, each edge corresponds with one possible user’s answer. States can
be associated also to other system’s actions (not only asking the user). An example of a finite
state machine dialog is shown in Fig. 12.
Fig. 12 Example of dialogue finite state machine8
Dialogue grammar is similar approach, where the system tries to identify some patterns of
dialogue or patterns of speech acts [5] and their responses. Dialogue grammars, similar to
speech grammars, prescripts dialog using of sequencing regularities in dialogues. Such
regularities are called adjacency pairs. Jefferson in 1972 [6] proposed that a dialogue is a
collection of such pairs (question-answer, proposal-acceptance/denial, etc.).
8
Source: http://www.ling.gu.se/~lager/teaching/dialogue_systems/labs/simple_python_dialog_system.html
19
Finite state machines and also dialogue grammars are approaches, which lack flexibility in
dialogue. On the other side, they enable to manage dialogue interaction reliably and to lead
interaction successfully to its goal.
More flexibility offer frame-based approaches, which do not define fix structure of “states”,
which are called frames here. This approach is known also as form-filling method, where the
dialog can be seen as a form, which can be filled by voice. Order of filling particular items in
form is arbitrary.
Plan-based approaches are more focused on goals, which dialogue participants plan to
achieve in interaction. One of the most important goals is changing of the mental state of the
listener. Here, utterances are seen as speech acts [5] and are used to achieve these goals.
Speech acts imagine the communication intention of particular utterance. The listener has to
identify plan of the speaker and respond accordingly.
In a focus of collaborative approaches is a collaborative process, which is one of the

attributes of dialogue interaction. The main idea is that dialogue participants work together to
achieve a mutual understanding, which can lead to achieve interaction goals. Collaborative
approaches try to focus on the motivations behind a dialogue and the dialogue mechanisms,
rather than concentrate on the task structure
Although plan-based and collaborative approaches provide more flexibility than frame-based
or finite-states methods, and also provide better dialogue model, they implementation,
portability and maintenance is more demanding and expensive. In a large group of spoken
dialogue systems and other human-machine speech-based interfaces rather simpler way of
dialogue management technique is used, based on using of the dialogue description language
(DDL) VoiceXML. VoiceXML is a markup language, which enables to write control
instruction for spoken dialogue in a form of interpretable scripts. This language became an
international standard for writing dialogues for task-oriented speech-based human-machine
interfaces. More information about VoiceXML will be provided in 3.1.1.1.
2.4. Spoken language generation and speech synthesis

The next step after the decision of the DM about the next step in the dialog is constructing of
the system response, which will be delivered back to the user. The process of transforming
semantic representation of system’s response to the utterance in natural language is called
natural language generation (NLG). NLG is one of the NLP (Natural Language Processing)
algorithms. It is the inverse process to natural language understanding or meaning
interpretation. Techniques usually used for NLG are similar to NLU techniques and they can
be divided into deterministic or stochastic methods. Moreover, in VoiceXML-based systems,
natural language generation is directly performed on the side of dialogue manager, because,
VoiceXML offers very simple deterministic approach to NLG. The utterances as the system
output are generated from the constructions, which contains text and slots to include meaning
representation of desired output. Such constructions can be seen as frames. An example of the
frame for generation system output can look as follows:
20
FRAME: The train [train-name] leaves the station [station-name] at [departure-time].
REALIZATION: The train IC502 leaves the station Košice at 9:15.
In previous example, frame contains from the text with three slots – train-name, station-name
and departure-time. During output generation, these slots are filled by the values, which
related to the names of the slots to generate meaningful sentence, which can be synthesized in
the next step.
In general, the process of natural language generation can be divided into two main phases:
 Content planning
 Utterance construction
In the first phase, the system decides according actual dialog state, what should be presented
to the user. In the second phase, the selection of appropriate words and form of presentation is
selected and the controlling instructions for output realization (e.g. for TTS system) are
prepared.
Natural language generation block is usually followed by the Text-to-Speech (TTS)

component. Text-to-Speech system transforms incoming text (and controlling instructions) to
spoken equivalent in a form of acoustic signal. This technology gives artificial systems the
capability to speak, it produces artificial voice.
In general, TTS system consists of two main parts – text preprocessing and speech
synthesizer. In text preprocessing part, several steps are performed – document structure
detection, text normalization, linguistic analysis for extracting phonetic and prosodic
information. To produce artificial speech with the correct articulation, usually the incoming
text needs to be transformed into its phonetic representation and an appropriate prosodic
parameters (melody, emphasis, tempo, etc.) need to be extracted. Then according to
information from text preprocessing, speech synthesizer generates an acoustic signal using
one of the possible methods.
There are few synthesis methods, which can be used for speech synthesis:
 Articulation synthesis
 Formants synthesis
 Concatenation synthesis
 Corpus synthesis
 Statistical parametric synthesis
Articulation-based speech synthesis method is based on the physical model of speech

production and it can be consider as the most general. Unfortunately its using is limited due to
the significant complexity of the physical model.
21
Formant synthesis uses theory of the source and the filters, in other words the electrical
model of speech production. For the input text an appropriate set of parameters need to be
generated and such parameters are the used for setting the synthesis filters.
The most popular and widely used method in speech synthesis is concatenation of speech
units, which match input text. Methods of concatenation synthesis differ according the
length of concatenated units. The most commonly used phonetic elements for concatenation
are diphones, which enables to achieve intelligible speech output. Diphones are units,
delimited into the interval from the middle of the first phoneme to the center of the following
phoneme. Diphones are acceptable compromise between their frequency and quality of
synthesis. The popularity of diphone synthesis lies also in an acceptable number of units for
concatenation, which moves in the range of about 1200 to 1600 (for Slovak language). The
block diagram of the diphone speech synthesis is shown in Fig. 13.
Fig. 13 Block diagram of the diphone-based concatenative speech synthesis
The first step is the selection of the desired diphones from acoustic database. Such database
contains recordings of words that include all the necessary diphones of desired language. The
next step is the modification of prosodic parameters. For this modification it is possible to use
several algorithms, such as MBROLA, RELP or TD PSOLA. Segments with modified
parameters are then concatenated to obtain the synthetic speech.
Corpus synthesis takes advantages of concatenation approach and suppresses its

disadvantages, mainly the robotic character of speech, which is caused by concatenation itself.
Decreasing the number of concatenations is one of the successful ways how to overcome this
22
drawback. Using of longer units, which are concatenated, decreases the total number of
concatenations. In case of corpus synthesis, synthesizer selects from database for
concatenation such large units as possible. In other words, it works with units with variable
length. If, can be in database found whole sentence, such sentence can be used as output of
the TTS (after some processing). If there occurs word or phrase, which should be synthesized,
such word or phrase will be concatenated with the remaining part of the utterance, which
shout be synthesized.
Corpus-based synthesis can provide very good quality (hardly distinguishable from human
speech) for domain-specific speech synthesis on condition that the corpus was recorded from
in-domain sentences. In case of need to synthesize out-of-vocabulary words and sentences,
the output quality will be comparable with classical concatenation synthesis (e.g. diphone
synthesis).
Statistical parametric synthesis combines source-filter approach that use electrical model of
speech production with the statistical modeling algorithm. Theory of Hidden Markov Models
is the preferable way of statistical modeling. HMM-based speech synthesis has been gaining
in popularity in recent years. The basic principle of this method is to use the context-
dependent HMM models, which are trained from speech corpus, as generative models for
speech synthesis process [7]. Fig. 14 shows a block diagram of HMM-based speech synthesis.
Fig. 14. Block diagram of the HMM-based speech synthesis
System for speech synthesis based on HMM is divided to two parts: training part and speech
synthesis part. The training part performs extraction of spectral and excitation parameters
from speech corpus and context-dependent HMM models and duration models are the results.
23
In the synthesis stage, trajectories of speech parameters are generated directly from the trained
models and the final speech waveform is reconstructed using appropriate synthesis methods,
i.e. MLSA filter or STRAIGHT-based vocoding.
There is a large number of TTS systems and tools. Commercial systems are typical by the
very good quality of synthesized speech thanks to large corpora. There is also a group of free
tools, which enable to construct your own speech synthesis system or to process your corpora
and train models. We will make short description of these tools in the section 3.3.3.
24
3. Technologies for designing ITSS
Speech technologies can be classified to several groups as follows:
 Languages that serves as control scripts for defining the communication issues in
HMIs
 Interfaces that enable to join together particular engines e.g. ASR, TTS or PBX.
 Platforms that provide complex solution for providing speech-based interactive
services
 Tools that enables to prepare models, speech engines, systems or services
In all mentioned categories a significant effort towards the standardization can be observed, as
a response to earlier problems with compatibility and portability, which increases costs on
speech-based services development. This process can be assessed as not finished yet, because
of ongoing development of “core” technologies as are speech recognition, speech synthesis,
language understanding and generation and dialogue modeling. Nevertheless, there are
several standardized technologies that can be considered as successful and have been widely
accepted by academic and also industrial community. We will talk about them in next
paragraphs and sections.
3.1. Languages for designing ITSS

3.1.1. W3C Speech Interface Framework
Languages grouped in the so called “W3C Speech Interface Framework” (SIF) can be
considered as the most important group of languages that serve for writing spoken dialogues.
SIF has been designed by the Voice Browser Working Group (VBWG) of the W3C (the
World Wide Web Consortium), which is an international community developing the standards
ensuring the long-term growth of the Web. VBWG focuses on development of standards for
so called “Voice Browsers”. The name of the group is based on the initial idea to bring
technologies that enable to browse internet by speech communication through telephony
network. Voice browsers have been considered as systems that should allow people to access
the Web using speech synthesis, pre-recorded audio, and speech recognition through their
phone device, but during the existence of the VBWG the situation has rapidly changed. We
can identify two critical changes that influenced this idea, however several recommendations
designed by this group have been successfully implemented in real life solutions. The first
change was the extension of the internet connectivity in mobile devices. This fact
significantly influences the area of telephony-based services that was grounded on the idea of
browsing the web through spoken dialogue interaction in telephony network, because internet
became available directly in mobile devices. The possibility to obtain e.g. information about
weather directly through web browser or mobile application included in mobile phone seems
to be more comfortable in comparison with calling the number of the voice platform and talk
25
with automat. The second change has been caused by an extension of mobile devices with
touchscreens that radically changed the way of human-computer interaction. This new way of
interaction together with availability of internet connection in such devices moves interaction
from keyboard-based input to touches/touch gestures input and from unimodal mode (spoken
interaction) to multimodal way. The new generation of mobile applications often combines
touches or touch gestures with other modalities, mainly voice as input modality or graphics as
output modality. Despite mentioned changes, the concept of the Voice Browser can be
considered as actual and its idea can be easily extended for enabling multimodal
communication with speech modality. This is one of the main reasons why this idea and
designed recommendation are successful and widely accepted as standards in this area.
The W3C Speech Interface Framework (SIF) is a suite of independent standards that have
been designed for rapid development of speech-based services. They are organized around the
VoiceXML standard, which is a dialogue description language (DDL). The framework
consists of following recommendations:
 VoiceXML – a language for creating spoken dialogues. It enables to write dialogue

scripts that define dialogue flow, content and other attributes of the spoken dialogue
interaction.
 SRGS (Speech Recognition Grammar Specification) – a recommendation that defines
format of speech grammars. Speech grammars are documents that define words and
word phrases, which should be accepted by ASR system.
 SISR (Semantic Interpretation for Speech Recognition) – a recommendation that
defines format for adding semantic markups into the SRGS grammars to extract
semantic result from output of speech recognition
 SSML (Speech Synthesis Markup Language) – a language, which defines way how to
manage system output in a form of synthetic speech, pre-recorded speech or music. It
enables to write instructions that changes properties of produced synthetized speech.
 PLS (Pronunciation Lexicon Specification) – a language for representing phonetic
information for ASR and TTS systems.
 CCXML (Call Control XML) – a language, which enables to write rules for
communication between telephony resources and voice portal/system
 SCXML (State Chart XML) – a markup language that enable to write state machines
to manage sessions, dialogues and other processes.
Following subsections provide their short description.
3.1.1.1. Voice eXtensible Markup Language

The VoiceXML (Voice eXtensible Markup Language) is a markup language designed for
composing the voice applications. It enables to write scripts that serve for managing the
spoken dialogue interaction between human and computer systems. Such scripts contain
instructions for dialogue flow, content and a way of presentation. The power of VoiceXML
lies in easy syntax based on XML language, which leads to good readability and
26
maintainability. The wide acceptance of VoiceXML as a standard brings the next advantage
in a form of good portability of services designed in this language. An example of VoiceXML
code can be seen in Fig. 15.
Fig. 15. Example of simple VoiceXML code
VoiceXML had not been designed at the beginning by W3C organization. In 1999 four
companies AT&T, IBM, Lucent and Motorola established the VoiceXML forum [8] for
designing a language, which would increase the development of voice applications. The first
version of the language was introduced in august 1999. The first official version of the
VoiceXML language (VoiceXML 1.0), prepared by VoiceXML forum, was presented in
March 2000. After that the W3C adopted the responsibility for VoiceXML language, and it
had started working on the next versions of VoiceXML.
Whereas the VoiceXML 1.0 language specification implied, besides tags (markups) for
dialogue description, also tags for call management, speech grammars and speech synthesis,
the second version of the language (VoiceXML 2.0) was focused only on dialog description.
Markups for call management, speech grammars and speech synthesis control were adopted
as the background for CCXML, SRGS and SSML languages. VoiceXML 2.0 was released as
the W3C recommendation in March 2004. This recommendation became the industry
standard in area of voice services [9].
In June 2007 the VoiceXML 2.1 was introduced, which attaches a tiny set of additional
features to the second version of the language. Then working on the new specification
(VoiceXML 3.0) has started, with the concept of three layers - dialog, flow and management.
Work on the third version of the language is still in progress.
27
VoiceXML is the most important language from this group. It is supported mainly by SRGS,
SISR, SSML and CCXML recommendations. As was mentioned in previous paragraphs, it
delegates several competences to these recommendations. VoiceXML 2.0 was widely
accepted in academic as well as commercial community and became an international standard
for writing spoken dialogue services. We will pay more attention to this standard in other
parts of this book.
3.1.1.2. Speech recognition grammar specification

As mentioned above, a tiny set of markups used in VoiceXML 1.0 language has created a
base of the Speech Recognition Grammar Specification (SRGS). SRGS specification brings a
language, which enables arranging context-free grammars for speech or DTMF input.
Grammar can be specified in either XML or an equivalent augmented BNF (ABNF) syntax.
Work on this language has been started in 1999 and it became the recommendation in March
2004 (SRGS 1.0) [10].
The main advantage of the SRGS is well readable form both for designers and computers. It
enables composing possible language structures that are expected from user in actual state of
interaction (dialog). Creation of such structures helps the speech recognition system to be
more accurate and faster.
SRGS specification can describe (handle) also speech input in a form of utterances in natural
language, but it does not support stochastic language models (N-grams) directly. The Ngram
specification serves for that purpose, but it has never been published as the W3C
recommendation and its preparation did not continue. The power of SRGS specification is in
cooperation with the next W3C recommendation – Semantic interpretation for speech
recognition (SISR) [11].
SRGS forms of writing grammars were also widely accepted as standard and there is a large
group of platforms and systems, which supports it. An example of simple speech grammar in
XML form that enable to say one of the days of week can look like one in Fig. 16.
Fig. 16. An example of SRGS grammar for entering day of week
28
3.1.1.3. Semantic Interpretation for Speech Recognition (SISR)

The semantic interpretation specification describes annotations to grammar rules for
extracting the semantic results from recognition. This provides markups and attributes, which
can be included in to context-free grammar and thus some semantic information can be
extracted by interpretation of these markups. De facto, it does not really “understand”, but it is
the acceptable approach to the interpretation of spoken language. Such approach can be used
also with the input utterances in natural language. In this case, the system can be viewed like
keyword-spotting system. It enables capturing keywords in a natural language utterance and
assigning them some semantic value and creating pairs of keywords and their semantic
values. This concept is very powerful in domain-specific task-oriented voice services, but
almost unusable in communication with conversational agents.
Work on this specification had started in April 2003 and in April 2007 it became the W3C
recommendation.
In comparison with the first two described recommendations, SISR is often seen as a part of
SRGS recommendation, which is caused by the fact that SISR specifies only additional
elements, which can be add into SRGS elements, and a definition of interpretation process. In
a lot of applications, only simpler form of semantic content (String literal tag syntax) is
usually used. An example of the grammar with SISR semantic tags is in Fig. 17.
Fig. 17. Speech grammar with SISR tags
29
3.1.1.4. Speech Synthesis Markup Language

The speech synthesis specification (SSML) defines a markup language for prompting users
via a combination of pre-recorded speech, synthetic speech and music. It provides uniform
API between voice platforms and Text-to-Speech engines and enables changing voice
characteristics, like gender, speed, volume, etc.
As in the case of SRGS specification, designing the Speech Synthesis Markup Language
(SSML) has started in year 1999. This process led to the first recommendation of the language
(SSML 1.0) in September 2004. SSML became W3C recommendation in September 2004
[12]. Although there is a group of TTS systems and platforms that support SSML-based input,
SSML is not as widespread as VoiceXML or SRGS. SSML found its application rather as an
input format for reading large texts by TTS systems. In spoken dialogue systems or user
interfaces, where TTS systems read often only separate utterances, the format of SSML seems
to be too redundant. An example of text input in SSML form, which obtains instructions for
reading information about emails, can be seen in Fig. 18.
Fig. 18. An example of SSML document
3.1.1.5. Pronunciation Lexicon specification

Pronunciation lexicons describe phonetic information for use in speech recognition and
synthesis. The requirements were first published on March 12, 2001, and updated on October
29, 2004. The pronunciation lexicon is designed to enable developers providing supplemental
information on pronunciation for items as are place names, proper names and abbreviations.
The W3C Recommendation was published in October 2008 [13]. Such lexicon can be used
both by automatic speech recognition systems and text-to-speech systems. The support of this
recommendation is limited yet.
30
Fig. 19. An example of PLS document
Example in Fig. 19 illustrates PLS document that defines pronunciation of the names Newton
and Scahill using IPA alphabet (International Phonetic Aplhabet). In Fig. 20. speech grammar
that refers on pronunciation in PLS document (see line 9) can be found.
Fig. 20. An example of SRGS grammar with PLS document referencing
3.1.1.6. Call Control eXtensible Markup Language

The W3C is designing the Call Control eXtensible Markup Language (CCXML) to enable
fine-grained control of speech (signal processing) resources and telephony resources in a
VoiceXML telephony platform. CCXML is designed to manage resources in a platform on
the telecommunication network edge. It can handle actions like call screening, call
waiting/answering and call transfer. Requirements for that language were prepared in April
2001 and now the language has status of W3C recommendation (Jul 2012) [14]. This
specification brings very important unification into the call traffic handling, because of large
range of telephony hardware producers. It releases voice services designers from concerning
about hardware-specific application interface and it gives them the high-level interface by the
CCXML language. Almost all telephony platforms support CCXML recommendation, which
31
became an international standard. Fragment of CCXML code, which answers an incoming

telephony call and then connects it to a VoiceXML dialog, is shown in Fig. 21.
Fig. 21. Fragment of CCXML code for managing the phone connection
3.1.1.7. State Chart eXtensible Markup Language

A State Chart XML or the State Machine Notation for Control Abstraction is the last part of
the Speech Interface Framework. SCXML is a candidate for being the control language within
VoiceXML 3.0, the future version of CCXML, and the multimodal authoring language. Its
development has started in July 2005 and currently the W3C Last Call Working Draft [15]
was published, which is a document status one step before W3C recommendation. SCXML
provides an XML-based language that enable to write state machines to define controlling
mechanisms for various tasks. One of them can be the management inside the platform.
3.1.1.8. The W3C-based architecture of the voice browser

The W3C Speech Interface Framework languages have their main employment in voice
browsers as well as in spoken dialogue systems. The languages from SIF determine the key
ideas about cooperation between voice browser components; de facto they determine the
architecture of such system. At the end of year 1999 Voice Browser Working Group
published working draft of document “Model Architecture for Voice Browser Systems” [16].
Authors of the document allege that they only wanted to illustrate one of possible solutions of
the Voice browser architecture and that the other types of architecture can be adopted for
implementation of Speech Interface Framework languages. Interfaces between voice
browser’s components were not specified directly. The languages within the SIF determine
the way of communication between them, what is the main advantage of the Speech Interface
Framework. The model architecture of Voice browser from document mentioned above,
32
redrawn into well-arranged form in [1], with stand-alone I/O component is displayed in Fig.
22.
Fig. 22 Voice Browser Architecture
The shadow ellipses represent SIF languages, which should be supported by voice browser
components. The main components of the browser are dialogue manager, automatic speech
recognition system (ASR), text-to-speech system (TTS) and Input/Output component. The
NLU component is responsible for extracting the meaning from user’s input. DTMF processor
enables processing the DTMF input and both types of input (speech/DTMF) are finally
processed in the Interpreter. On the other side, there are Media planner component and audio
output block. A block of the prosodic processing is often the part of TTS system.
It can be concluded, that W3C voice browser architecture is similar to architecture of general
spoken dialogue system. Real voice platforms implement this architecture with platform-
dependent changes. Natural language understanding and interpretation are often included
directly into the Dialogue manager. Similar situation is with media planner, because,
VoiceXML language perform media planning inside the VoiceXML code. As we concluded
in previous section not all recommendation were widely accepted in the community of voice
platforms and services development. However, almost all commercial platforms support
VoiceXML and SRGS recommendations.
3.1.2. W3C Multimodal Interaction

As we mentioned earlier in this book, nowadays the way of interaction moves from unimodal
communication to multimodal interaction, which relates with availability of devices that are
capable to interact using several modalities. They tracks cameras, microphones, depth sensors
touchscreens, displays and they have included software for movement tracking, localization,
face and body detection, eye-tracking, long-distance speech recognition, gesture recognition,
or speech synthesis. They is also another group of hardware-software solutions that are
33
constructed to have a human embodiment and try to interact in human-like way. We are
referring to artificial agents (avatars) and humanoid robots. Due to the large complexity of
such systems, there are involved a lot of technologies that use a large group of languages,
standards and recommendations. This group defines a way of managing movement of
avatars/robots, modalities acquisition and interpretation, fusion and fision of modalities and
representation of their meaning or behavior management and description and others.
From this large group of languages, we have selected languages, which have been developed
by another W3C group, with the name W3C Multimodal Interaction. The suite of
specifications designed by this group is known as the W3C Multimodal Interaction
Framework and actually consists of:
 Multimodal Architecture and interfaces

 Extensible Multi-Modal Annotations (EMMA) specification
 InkML - an XML language for digital ink traces
 Emotion Markup Language (EmotionML) 1.0
3.1.2.1. Multimodal Architecture

The W3C Multimodal Architecture and Interfaces recommendation became a W3C rec. in
2012 [17]. It describes the architecture of the Multimodal Interaction (MMI) framework
(MMIF) and sketches interfaces between its components. The aim of the MMI Working
Group was to provide a general and flexible framework, which can increase an
interoperability of MMIF components and their portability. Rather than defining strong
restrictions to the multimodal architecture components, authors focuses on defining the basic
infrastructure for controlling applications and platform services as well as general means for
interaction.
This recommendation defines a list of the basic constituents of the MMI architecture and the
run-time architecture diagram. Following components are identified as the basic constituents
of MMI architecture:
 The Interaction Manager, which coordinates the different modalities.

 The Data Component, which provides the common data model.
 The Modality Components, which provide modality-specific interaction capabilities.
 The Runtime Framework, which provides the basic infrastructure and enables
communication among the other Constituents.
Listed Constituents are arranged in the run-time architecture as is shown in Fig. 23. We can
see that in comparison to W3C Voice Browser architecture (in Fig. 22), MMI architecture is
described on higher level of abstractness due to the bigger variability of such architecture
caused by involving various modalities, interaction scenarios or desired platforms and
devices.
34
Fig. 23. Run-Time MMI Architecture Diagram with examples of components realization
To overcome this abstractness, possible arrangement can be shown by next example, where
existing languages are associated with particular Constituents.
 CCXML could be used as both the Controller Document and the Interaction Manager
language, with the CCXML interpreter serving as the Runtime Framework and
Interaction Manager.
 SCXML could be used as the Controller Document and Interaction Manager language
 HTML could be used as the markup for a Modality Component.
 VoiceXML could be used as the markup for a Modality Component.
 SVG (Scalable Vector Graphics language) could be used as the markup for a Modality
Component.
 SMIL (Synchronized Multimedia Integration Language) could be used as the markup
for a Modality Component.
Together with defining architecture, interfaces between its constituents should be defined
also. Interfaces that are defined in described architecture are based on events, mostly on
request/response pairs. Communication between components and interaction manager is
asynchronous.
3.1.2.2. Extensible MultiModal Annotation markup language

The Extensible MultiModal Annotation markup language (EMMA) is a markup language,
which defines a data exchange format for the interface between input processing components
(e.g. Speech or handwriting recognizer, NLU, fusion module) and interaction management in
multimodal and voice-enabled systems. It enables to annotate semantics of recognized input
with information such as confidence scores, time stamps, and input mode classification (e.g.
touch, speech, or pen). EMMA also enables to represent alternative recognition hypotheses.
35
EMMA 1.0 is the W3C recommendation from 2009 [18], but the new version will be
published as a 1.1 version, which will incorporate new features.
Fig. 24. An example of EMMA document
An example of EMMA document is shown in Fig. 24. Such XML code can be generated by
the block of NLU as the description of the recognition result and its semantic. In a pair of
<one-of> markups (lines from 8 to 21) two hypothesis (interpretations) are encapsulated.
Each interpretation is encapsulated in <emma:interpretation> tags (lines 11 to 14 and 16 to
19). Analyzed utterance “from Košice to Budapešť” is included in emma:tokens attribute.
Inside the interpretation blocks, two XML markups are located (<origin> and
<destination>), which contain extracted semantic information about analyzed tokens.
Interpretation block provide also information about a confidence score of both interpretations
(attribute emma:confidence).
Ink Markup Language
Ink Markup Language (InkML), an XML language for digital ink traces, provides a range of
features to support real-time ink streaming or richly annotated ink archival [19]. InkML
enables a complete and accurate representation of digital ink. It allows describing the pen
position over time, information about device characteristics and detailed dynamic behavior to
support applications such as handwriting recognition and authentication. InkML has reached
the W3C recommendation stage in 2011.
A simple example of InkML code can be found in Fig. 25. Five traces are described using
InkML <trace> markups. Each trace is describes of a sequence of X and Y value pairs that
define points of particular trace. On the right side of the figure, rendered text, which is
described by the example code, is shown.
36
Fig. 25. An example of InkML code and rendered text9
3.1.2.3. Emotion Markup Language

With the spreading of the human-like human-machine interfaces, also our expectations
towards HMIs are rising. We expect a human-like behavior, what includes user’s emotion
recognition and understanding and also generation of emotions by machines. To help emotion
usage in HMI, a common language for describing emotions is required. The Emotion Markup
Language (EmotionML) 1.0 reflects these needs [20]. It enables to represent emotions and
related states for HMI applications. It can be used for manual annotation of data; automatic
emotions recognition and generation of emotion-related system behavior.
Fig. 26. Simple example of EmotionML inside EMMA document 10
Fig. 26 offers a simple example of emotion description using EmotionML language included
in EMMA document. In this example a non-verbal vocalization was analyzed
(emma:mode=”voice” emma:verbal=”false”). Emotion, that is conveyed by this non-verbal
expression is described in a pair of markups <emotion> and </emotion> from line 6 to 8.
Attributes of detected emotion are described by attributes of the <category> element.
Emotion type “bored” was recognized with level of intensity 0.1 and confidence score 0.1.
EmotionML can be used also with SSML or SMIL language. In case of SSML, it can serve
for influencing of the presented emotion delivered to the user through synthetic voice. A
simple example of EmotionML usage in SSML document is presented in Fig. 27.
9
Source: http://www.w3.org/TR/2011/REC-InkML-20110920/
10
Source: http://www.w3.org/TR/emotionml/#s5.2
37
Fig. 27. Example of including EmotionML description into the SSML code 11
3.1.3. Speech Application Language Tags

When, we describe languages that enable to prepare speech-based HMI systems, we cannot
forget to mention the SALT language (Speech Application Language Tags) as an alternative
to VoiceXML, developed by the SALT Forum founded by Microsoft in 2001. SALT language
defines a group of markup, which enables to integrate speech dialog into the existing services.
It enables also multimodal access to information through applications or web services. It was
desired as an extension to the web languages as are HTML, XHTML or XML. SALT consists
of a small set of XML markups, similar to VoiceXML. SALT has not been able to achieve the
popularity of the VoiceXML language and remained alive only thanks to Microsoft support in
Internet Explorer or the Microsoft Speech Server (MSS). Also MSS has started to support
VoiceXML language in 2006 and SALT became only the second one.
3.2. Interfaces in interactive telecommunication

systems
To define inter-component interfaces is equally important in comparison with languages for
designing speech-based human-machine interfaces. The importance of defining standards for
inter-component interfaces lies in the complexity of such interfaces that requires cooperation
of several complex technologies. There are a lot of vendors and research teams, which focus
only on one technology from a variety of speech technologies. To be able to join required
components together, there is only one possible way – standardization of interfaces.
In the following text, we shortly introduce four most important inter-component interfaces in
area of speech-based systems. They will be presented in chronological order.
3.2.1. Java Speech API

The history of the Java Speech API has started in 1998, when Sun Microsystems published
the first public draft of the Java Speech API (JSAPI). JSAPI is an application programming
interface (API) that helps to integrate speech technologies as are speech recognition, dictation
or text-to-speech into the applications based on Java. It defines control commands and other
control mechanisms for speech recognition and speech synthesis process. JSAPI defines only
11
Source: http://www.w3.org/TR/emotionml/#s5.2
38
an interface, before implementation. (There are few implementations created by third parties,
e.g. FreeTTS or JSAPI 2.0).
Java Speech API consists of two languages (Java Speech API Markup Language, Java Speech
Grammar Format) and JSAPI classes and interfaces.
Java Speech API Markup Language (JSML) is a text format, which can be used to annotate
text input for text-to-speech engines. It defines elements that describe document structure,
word’s pronunciation, voice pitch, emphasis or speech rate and other attributes of produced
synthetic speech. While the first version of JSAPI (1.0) implemented JSML language, the
newer release 2.0 takes W3C SSML standard.
Java Speech Grammar Format (JSGF) is a platform-independent text-based speech grammar

format for speech recognition process. It is a predecessor of the SRGS grammar specification.
The grammar markups, which were included in VoiceXML 1.0 specification, were directly
derived from JSGF. Except JSGF, the JSAPI 2 supports the more popular SRGS format, but
unfortunately it does not support dictation.
Java Speech API offers three packages of classes. javax.speech offers classes and interfaces
for a generic speech engine. javax.speech.synthesis contains classes and interfaces for
speech synthesis and javax.speech.recognition holds classes and interfaces for speech
recognition.
The Java Speech API is based on event-handling. Events generated by the speech engine can
be identified and handled as required. Speech events can be handled through the
EngineListener interface, and more specifically through the RecognizerListener and the
SynthesizerListener.
3.2.2. Microsoft Speech API

One of the very important application interfaces is the Microsoft Speech API (MS SAPI),
because it enables to integrate speech technologies into the OS Windows environment and to
use Microsoft’s embedded speech engines in applications written for MS Windows. Using
MS SAPI speech technologies can be integrated with MS Office applications or other
applications.
Fig. 28. Architecture of MS SAPI
39
MS SAPI defines an interface (middleware) between applications and speech engines, as is

illustrates in Fig. 28. Such arrangement brings significant simplification to the speech-based
interfaces development, because they reduce development time using only higher
programming methods.
MS Speech API consists of two interfaces – application level interface and engine level
interface. The first interface serves for application developer to easily use speech engines in
his applications. On the other side, the engine level interface enables to extend a designed
speech engine with the MS SAPI interface, which makes such engine easily usable in MS
Windows environment or with applications that supports MS SAPI.
MS SAPI is delivered either as a part of Speech SDK and also it is a part of each MS
Windows distribution. During the time, several MS SAPI versions were released (actual
version 5.4). We can conclude that SAPI 5 uses a completely different interface, where
application and engines cannot communicate directly. They communicate through SAPI
runtime component.
MS SAPI consists, except API definition and runtime component, of following components,
which make together a Speech Software Developmenet Kit (Speech SDK):
 Control Panel configuration utility for selecting and configuring speech recognizer and
synthesizer.
 Text-To-Speech engines for most common languages.
 Speech Recognition engines for most common languages.
 Redistributable components, which allow developers to package the engines and
runtime with their application code.
 Sample application code.
 Sample engines - implementations of the necessary engine interfaces but with no true
speech processing.
 Documentation.
3.2.3. Media Resource Control Protocol

Media Resource Control Protocol (MRCP) is another type of the inter-component interfaces
that is focused on the data exchange and communication inside the distributed platforms
providing speech services. It enables to control devices (resources) as are speech synthesizers,
speech recognizers and others, through telecommunication network. For that purpose it
involves two important protocols – RTSP (Real Time Streaming Protocol) and SIP (Session
Initiation Protocol) to control the connection and streaming from/to external devices. It also
enables to implement distributed SDS platforms, which can use VoiceXML. MRCP defines
requirements, responses and events, which are required on resources management. It also
defines a state machine for each device (resource) and transitions between states for each
requirement. The actual version of MRCP is v. 2.0. The general client-server architecture of
MRCP-based communication is shown in Fig. 29.
40
Fig. 29. General architecture of MRCP-based communication
MRCP was developed by Cisco Systems, Inc., Nuance Communications a Speechworks, Inc
companies and it became an industrial standard in inter-component communication in speech-
based platforms. It has been adopted by a wide range of vendors, such as IBM WebSphere
Voice Server, Microsoft Speech Server, LumenVox Speech Engine or Nuance Recognizer.
3.2.4. Web Speech API

In February 2013 Google releases new Chrome web browser version (v.25) whith a new Web
Speech API support. Web Speech API is the JavaScript library, which makes possible to
integrate speech recognition technologie into the web pages and applications.
Speech recognition provided by Google supports the largest group of languages in

comparison with other providers of speech recognition. Moreover, Google speech recognition
technologie works very well. It uses client-server solution, what means that audio signal is
captured in the user’s device (PC, notebook, tablet, smartfone) and then sended to Google
servers, where speech recognition is performed and the recognition result is returned back to
the user (through web). Web Speech API together with the new version of HTML (5) enables
to access Google speech recognition technologie from web browser. HTML 5 together with
Google speech recognition can be considered as the critical milestone in domain of speech
interfaces. It brings speech technologies closer to everyday using. The more detailed
description of mentioned technologies will be provided later in this book.
3.3. Tools and Solutions

Due to the high complexity of particular speech technologies, it is often not feasible to design
them from the scratch. Fortunately, there are many more or less free tools, which can be
helpful during designing and developing of speech-based interfaces. We divided such tools
into following categories:
41
 Speech recognition tools

 Language modeling tools
 Speech synthesis tools
 Dialog manager tools
 VoIP software gateways
3.3.1. Speech Recognition toolkits

There are few popular speech recognition toolkits or systems, which can be used to prepare
resources for speech recognition or to prepare the complete ASR system. We can divide these
tools into two categories. The first category of ASR tools are often open-source and free
toolkits, which make possible to process corpuses, train acoustic models and prepare complete
ASR system. Moreover their source codes can be modified to optimize or adapt them for the
concrete solution or for the research purposes. The most popular toolkits from this category
are HTK/ATK, Julius and Kaldi. ASR systems, that can be used only as are, can be grouped
in the second category. They are usually provided for free non-commercial using but often
only in server-client mode, where the ASR system is located on the remote server and user
can send your voice through predefined interface and obtain related text back. There are two
biggest companies – Google and Microsoft, which provides such recognition systems.
3.3.1.1. HTK/ATK
The Hidden Markov Model Toolkit (HTK) with the An Application Toolkit (ATK) is a set of
tools, which can be used for realization of the complex process of preparing and manipulating
hidden Markov models (HMM), to build complete ASR system and applications. It was
developed at the Machine Intelligence Laboratory of the Cambridge University Engineering
Department. In 1993, Entropic Research Laboratory acquired the rights to HTK and in 1999
Microsoft obtains HTK and licensed it back to CUED, which can provide it via HTK3 web
site.
HTK offers a set of modules and tools in form of C libraries. These libraries performs speech
analysis, HMM training, testing and results analysis. HTK contains also two recognition
engines HVite and Hdecode, which are primarily desired as testing tools. In comparison with
other toolkits, HTK provides extensive helpful documentations and many examples. The
success of HTK tools can be demonstrated by large groups of ASR engines built on HTK.
Together with HTK, an ATK Real-Time API for HTK (ATK) is provided. It enables to build
experimental applications based on HTK. It adds a C++ layer to HTK libraries to help easily
to build working systems. ATK supports:
 Multi-threading to allow efficient and responsive real-time running

 Synchronized audio input/output with barge-in support
 Finite-state grammars and trigram language models
 Ability to return recognition results word-by-word as they are recognised to reduce
latency
42
 N-best recognition output

 Support for HLDA
 Integrated Flite speech synthesis
 Make files for single-build under Linux and Windows
3.3.1.2. Julius
Another popular recognition tool is Julius, the open-source Large Vocabulary CSR Engine,
which has been developed since 1997 in range of several projects and currently its
development continues by Interactive Speech Technology Consortium (ISTC).
"Julius" is a high-performance, large vocabulary continuous speech recognition (LVCSR)

system (decoder). It offers real-time speech recognition based on word N-gram language
models and context-dependent HMM acoustic models. Julius adopts standard formats to
support with other free modeling toolkit such as HTK. Julius is distributed with open license
together with source codes.
3.3.1.3. Kaldi
Kaldi is a toolkit for speech recognition, which is similar to HTK. It has been started to
develop in 2009 at Johns Hopkins University. Kaldi is written in C++ and is provided under
the Apache Licences v2.0. It is focused on providing modern and flexible code that can be
easily to modify and extend. One of the important features is supporting of Fine State
Transducers (FSTs).
3.3.1.4. Windows Speech recognizer

Microsoft offers Windows Speech Recognition system (version 8.0) in its operating systems
(Vista, Windows 7, 8). Microsoft’s ASR system actually supports English, French, Spanish,
German, Japanese, Simplified Chinese, and Traditional Chinese. Support of the concrete
language relates to corresponding localization of the Windows. It is available through MS
SAPI interface, what makes possible to use it from Windows applications. Microsoft provides
tools for integrating its functions in the Microsoft Visual Studio.
3.3.1.5. Google server-based ASR

Google provides the speech recognition functionality through a web interface that uses
JavaScript Web Speech API with the new HTML5 standard. Google’s speech recognition is a
server-client solution, where ASR system is located on the Google’s servers and users only
send their voice to the server through web interface. Google’s speech recognition system can
be considered as one of the best free recognition tool with the broadest possible support of
languages. The need of online connectivity can be seen as one of the drawbacks of using of
Google’s ASR. On the other side, this solution can ensure using of the latest version without
the need of actualization.
43
3.3.2. Language modeling toolkits

Language models are one of the resources required for automatic speech recognition.
Deterministic language models in form of speech grammars are often created manually in one
of the text editors or in special grammar builders. Statistical language models require a
training process from a text database. There are several training tools, which can be used, but
three most popular toolkits can be identified: SRI Language Modeling Toolkit (SRILM) [21],
MIT Language Modeling Toolkit (MITLM) [22] and IRSTLM Toolkit [23].
The most popular set of tools for statistical language modeling is the SRI Language Modeling
Toolkit, which contains tools for training, estimation, evaluation, combination and adaptation
of various language models (N-gram models). It contains also other tools for segmentation
and labeling of corpus data.
Another popular tool is the MIT LM Toolkit, which in comparison with SRILM provides also
other adaptation and LM combination techniques. IRSTLM Toolkit adds only one of the
pruning techniques based on weighted difference method.
All mentioned toolkits supports the standard ARPA format for language models.
3.3.3. Speech synthesis toolkits

TTS systems can be divided into two groups –commercial systems and free systems.
Commercial TTS systems focus mainly on the world’s major languages, as are English,
French, Italian, Spanish and other, but recently the situation is improving and also smaller
languages become supported (e.g. Google in Google translate use TTS system that supports
Slovak). Moreover, almost all commercial TTS systems offer high quality speech output. TTS
systems IVONA, CereProc or NeoSpeech can be considered as representative group of high
quality commercial TTS systems.
On the other side, there is a group of free tools or systems, which can be used to prepare
synthetic voices or complete text-to-speech system. Such systems are usually available with
license, which allow to use it for non-commercial purposes (sometimes appears also systems
with more free license). Such systems are widely used in academic community for the
research and development.
Into the group of the most popular TTS toolkits belong Festival, HTS, MARY TTS and
MBROLA.
3.3.3.1. Festival
Festival is a toolkit, which offers a framework to create text-to-speech systems and to prepare
synthetic voices. It supports multilingualism and it was adopted to build many TTS systems in
several languages. It was developed in the Center for Speech Technology Research at the
University of Edinburgh.
44
Festival can be used in three usage scenarios:
 Direct using of Festival TTS system and its free voices (English or Spanish) to
generate synthetic speech
 Using of Festival to create the TTS system for your own application
 Using of Festival to research and development of the new TTS methods and voices
Festival supports three synthesis methods – diphone concatenative synthesis, corpus-based

synthesis and HMM synthesis (statistical parametric approach). It contains also a tool Festvox
for creating new synthetic voices.
TTS system and voices, prepared using Festival can be used also in more compact and more
rapid tool Flite.
3.3.3.2. HMM-based Speech Synthesis System - HTS

The HMM-based Speech Synthesis System (HTS) is the toolkit, which offers several
important tools to train models for preparing synthetic voices and TTS system itself, which
are based on the principles of parametric speech synthesis using HMM models. It has been
developed mainly by the HTS working group. Described toolkit has been implemented as a
modified version of HTK toolkit (Hidden Markov Model Toolkit), which is a toolkit for
training acoustic models for speech recognition and for developing of ASR systems.
HTS toolkit enables to train models for the parametric speech synthesis method. The training
process can be seen as a process of setting parameters of context-dependent models (HMM
models) according labeled speech recordings, which enter into this process. Then such models
are used to generate parameters of the speech synthesizer.
HTS toolkit does not contains tools for input text analysis and processing, therefore it should
be used with other systems, which offers such processing (e.g. Festival, MARI TTS, Flite,
etc.)
3.3.3.3. MARY TTS

MaryTTS is another popular toolkit or platform, which enables to create TTS systems. It was
originally developed in Language Technology Lab at DKFI (Das Deutsche
Forschungszentrum für Künstliche Intelligenz GmbH/German Research Institute of Artificial
Intelligence) and in Institute of Phonetics at Saarland University. Actually, work on this
platform moves under Multimodal Speech Processing Group in the Cluster of Excellence
MMCI and DFKI.
MaryTTS now offers TTS engine, which supports German, British and American English,
French, Italian, Swedish, Russian, Turkish, and Telugu. Moreover, MaryTTS offer a toolkit,
which enable easily to add a new language and to build TTS system based on either unit-
selection method or HMM-based method. MaryTTS system has been integrated into a large
number of HMI systems through creating a connector (an interface) to MaryTTS engine.
45
3.3.3.4. MBROLA
MBROLA project, initiated by the TCTS Lab of the Faculté Polytechnique de Mons
(Belgium), comes with another concept. Its main idea lies in the collecting of the group of
speech synthesis resources for a large group of languages, and to provide TTS systems
developed from these resources, to the research community. Supporting of the research in the
area of speech synthesis can be seen as the main goal of this project. MBROLA offers a
diphone speech synthesizer MBROLA, based on concatenation, which requires as its input
sequences of phonemes together with the prosodic information to be able to produce synthetic
speech. That means, MBROLA synthesizer does not accept raw text on its input and it
requires involving another text preprocessing system, which makes text to phonemes
conversion.
The creation of the new voice requires an agreement between MBROLA authors and author
of the new voice. Then, the process of voice creation has two phases. The author of the new
voice creates the database of diphones in the first step. Then such database is sent to the
MBROLA system, where it is processed and the MBROLA system is adapted to new voice.
The result of this process is the support of the new language or voice, which is freely
available for noncommercial purposes.
The process of creation of the diphone database lies in three steps. Text corpus preparation is
a first step. Text corpus for diphone concatenation TTS system has to consist of the list of all
diphones in desired language and of the set of specific sentences that are constructed to cover
all diphones. Then, sentences from text corpus have to be recorded by the speaker in a studio
quality. Corpus segmentation, which means diphones labeling, is the last step.
3.3.4. Dialog manager tools

The situation with available dialogue manager tools and modules is worse in comparison with
the availability of speech recognition or speech synthesis toolkits. Dialogue manager is often
an integral part of the SDS or a platform. There are only few dialogue managers, that can be
considered as usable for incorporating into the own solution. This situation is also caused by
the fact, that there is still no consensus about the preferred approach to dialogue management.
Although VoiceXML became an industrial standard and it has been used for DM in many
commercial platforms, it can be used only in limited area of applications with well-structured
domain-specific dialog. Using of platform-specific approaches can be observed, when
dialogues that are more complicated are expected. According discussed issues, three free
available representatives will be described – JVoiceXML, RavenClaw and TrindiKit.
3.3.4.1. JVoiceXML
JVoiceXML interpreter is almost only one freely available non-commercial VoiceXML
interpreter that is developed by the team around Dirk Schnelle-Walka. JVoiceXML is written
in Java and it provides an open architecture. It implements JAVA interfaces JSAPI and JTAPI
(Java Telephone API). JVoiceXML supports VoiceXML 2.0 and 2.1 W3C Recommendations.
46
Unfortunately, also this implementation of VoiceXML does not support all VoiceXML
markups.
3.3.4.2. RavenClaw
RavenClaw dialogue manager represent another approach to dialogue management. It is
a plan-based or agenda-based system that uses for dialogue management two data structures –
the task tree and the agenda. The task tree represents a plan for performing domain-specific
tasks. The agenda has a form of ordered list of agents that serves to join inputs with an
appropriate agent in the task tree. The operation of RavenClaw system is driven by input
(semantic representation of incoming input). When the concrete input occurs, its agent is
activated, which process it. RavenClaw can be used aslo for managing dialogue-interaction in
multimodal systems, because it does not depends on type of the input. It relies on the semantic
concepts. The system was designed by Rudnicky and Bohus (see [24]).
RavenClaw dialogue manager is an integral part of the RavenClaw/Olympus architecture,

which is the complete spoken/multimodal interaction system developed at Carnegie Mellon
University (CMU).
3.3.4.3. TrindiKit
TrindiKit toolkit is based on the idea of “information states” proposed by Traum et al. [25]
and Larsson and Traum [26]. It was developed in range of three projects – TRINDI, SIRIDUS
and TALK and was funded by the Centre for Language Technology (CLT) at the University
of Gothenburg.
TrindiKit systems has two main components – information state representation and a dialogue
move engine (DME), which updates that state according observed dialogue moves. An
appropriate dialogue move is generated as an output of the system. The information state here
is the structure, where agent stores information necessary for task completition. Dialogue
move depend on information state changes. TrindiKit, rather than complete architecture,
specifies formats for information states definition, update rules, dialogue moves, and
associated algorithms. To build a dialogue move engine, definitions of update rules, moves
and algorithms, as well as the internal structure of the information state have to be defined.
3.3.5. VoIP software gateways

Asterisk VoIP gateway can be considered as the most popular and widely used free and open
source VoIP framework for building applications with telephony inteface. Digium Company
sponsors its development. Asterix makes possible to build IP PBX systems, VoIP gateways,
conference servers and others. Asterisk is well-suited for IVR solutions, because it supports
audio playback and recording, digit collection, database and web service access, and optional
speech recognition and synthesis. IVR applications can be build using the Dialplan language
or through the Asterisk Gateway Interface and can integrate with virtually any external
system. Low costs and open source code are the benefits of using Asterisk.
47
4. Designing VoiceXML-based voice services
This chapter focuses on the design process of voice services based on VoiceXML language.
Basic design principles will be discussed to improve usability and perceived quality of the
service. Then a short introduction into the writing of VoiceXML applications and speech
grammars according W3C SRGS and SISR recommendations will be offered. At the end of
the chapter an example of VoiceXML application with its source code will be described.
4.1. Basic principles

The very important question is “What does it mean to design voice service?” We can define
that it means to design:
 dialogue structure and flow (call flow)

 system prompts (questions) and expected user responses
 speech grammars, which delimit allowable user’s utterances
 and access to the information databases and other resources
Each item from listened steps has its own rules that can be considered during the development
process.
4.1.1. Dialogue structure and flow

Dialogue structure and dialogue flow is determined mostly by the information that have to be
acquired through the dialog and by the information structure, which will be provided to the
user.
The first step in the design process should be an analysis of the user scenarios and the task,
which should be completed by dialogue interaction. The client, who plans to provide designed
voice service, should formulate the main goal of desired service. Then, values, which need
to be obtained from the user in dialogue interaction, should be identified together with the
type of presented information. Dialogue should be constructed in such a way, to be able to
collect all required values necessary for the task completition.
In this stage, several decisions have to be made. Such decisions relate to the dialog strategy,
error recovering or selecting of the confirmation strategy.
4.1.1.1. Dialogue strategies

There are three different dialogue strategies:
 Strategy with the initiative on the side of the system (system-initiative dialog)
 Strategy with the initiative on the side of the user (user-initiative dialog)
 Strategy with the mixed initiative (mixed initiative dialog)
48
System-initiative dialog can be characterized as an interaction, where system asks questions

and user answers them. Such dialogue strategy usually leads to the successful task completion
without increased cognitive load of the user, but it often makes the dialogue too long and
tiresome. It can be considered as the most primitive interaction pattern, which has the lowest
requirements on the modules as are ASR, TTS, NLU and leads to the good reliability of the
service. System-initiative dialogue interaction can looks like interaction in the next example:
System: Welcome to the Weather forecast service.

S: For what city do you want to obtain Weather forecast?
User: Košice
S: For what day?
U: Friday
S: Did you select city Košice and day Friday?
U: Yes
S: …
The opposite scenario is performed in case of the dialog with the initiative on the side of the
user. In this case, user asks questions and system answers them (see example dialog below).
Such interaction scenario places high demands on the system’s components, mainly ASR and
NLU units, because of the high variability of possible user’s utterances. This is also the
reason, why dialogues with user’s initiative are less successful in real-life applications.

U: Hello, I would like to hear Weather forecast for Caracas.
S: Ok
U: for Friday.
S: Forecast for Caracas and Friday.
U: Yes, right.
Mixed-initiative dialogues best models the human-human dialogues. Here, dialog initiative is
moved from one participant to another according the interaction needs. VoiceXML language
supports mixed-initiative dialogues in a specific, limited version, which puts also limited
requirements on the voice platform modules. VoiceXML’s mixed initiative strategy enables
the system to ask the user so called “How may I help you?” question, which can be answered
by the user in more natural utterance. User can summarize in his answer all information that is
required for task completion. If some values are missing, the system takes the initiative and
asks the user to collect missing information. Such mixed-initiative dialog can looks like
following one:
49

S: How may I help you?
U: Hello, I need Weather forecast for Caracas, Friday.
S: For Caracas, Friday?
U: Yes
VoiceXML language enables to create only system-initiative dialogues and limited version of
the mixed-initiative dialogues as was mentioned above. Although, it is important at the
beginning of voice service designing to decide about initiative in the dialog, because, it relates
with the platform possibilities and with the prompts and grammars designing.
4.1.1.2. Error recovery strategies

The next issue is the error recovery strategies. Error recovery process can use several
strategies. In case of VoiceXML language, the situation is simpler and errors, which can
occure, are reduced into following categories:
 Nomatch error (event) – represent situations, where user provide input utterance,
which does not match to the active grammar in particular dialog stage.
 Noinput – represent situations, where user does not provide any input within the
specified time interval.
 Wrong recognition without nomatch event occurring – respresents situation, when
user’s input was recognized incorrectly, but it matches the active grammar and also
the confidence level was high.
 Other system errors – represents any other system’s errors, that can occure during
the interaction.
Nomatch and noinput are the standard events defined in VoiceXML language and it provides
handling mechanisms to recover from mentioned errors. There is also a more general way of
catching errors using <catch> element.
In case of nomatch and noinput events or errors, one of the simplest recovery strategies is
replaying of the last system prompt, which repeatly prompts the user to provide the input
(<repeat> element can be used). Better solution could use strategy with several layers, whith
following steps [27]:
1. Tell the user what happened (e.g.”Sorry, I didn’t catch that.”).

2. Then, tell the user what to do (e.g. “Please repeat your answer”).
3. Give the user more information on what to do next. (Provide an example of
the expected answer, e.g. “Please say a city and state, for example: L.A. and
California.”)
4. If needed, tell the user about the help items (e.g. “Please say –help- for more
help about allowed inputs.”).
50
The more difficult situation can occur in case of wrong recognition of the user input, but
without catching this event by the system. In this case, the user has to initiate the error
recovery process, which can bring a difficult situation for the system. For such situations, user
should be navigated to say one of the universal navigation commands, which help him to
return back to the wrong recognized item and repair it. Typical universal navigation
commands are [27]: repeat, cancel, back, backup, quit, help, and exit. Support of such
commands is platform dependent, but VoiceXML defines <help> and <exit> elements,
which requires a default platform grammar with corresponding commands to be active during
whole interaction.
4.1.1.3. Input confirmation

Confirmation is another important capability, which is required in dialogue interaction. As in
human-human dialogues, also in case of human-machine dialog, there can occur situations,
when obtained information needs to be confirmed, because there is some level of uncertainty.
Confirmation strategies can be divided into implicit and explicit methods.
Implicit confirmation strategy (see next example) incorporates confirmation act directly into
the next system prompt, which has also other communication function. This method saves the
duration of the dialog and can be perceived as more natural, but the risk of error is higher,
because speech grammars able to catch such utterances are significantly more complex.
System: For what city do you want to obtain Weather forecast?

User: Košice
S: For what day do you want to know forecast in Košice?
a) U: No Košice, Vienna.
b) S: For Friday.
On the other side, explicit confirmation strategy offers more reliable way of confirmation,
where confirmation is done by a new system prompt with no other communication function
(example below). It has significantly lower requirements on speech grammars, but it enlarges
the dialog and can be perceived as tiresome.
System: For what city do you want to obtain Weather forecast?

User: For Košice.
S: Did you select Košice?
U: Yes
S: For what day?
U: Friday
S: Did you select Friday?
U: Yes
51
Conditional confirmation is an approach, which can help to reduce drawbacks of the explicit
confirmation strategy. This method involves the confidence score of the recognized user’s
utterances to decide about the need of confirmation. If that score exceeds the determined
threshold (usually higher than 90%), such utterance can be evaluated as enough reliable, and
the confirmation cannot be performed.
Further improvement can be obtained by joining the confirmation of several input items into
the one explicit confirmation prompt. Instead of asking the user for confirmation after
acquisition of each input value, confirmation can be done after collecting several items.
4.1.1.4. Input and output properties

There are also some other decisions in initial design of dialogue flow and structure, which
should be done. One need to decide about the type of the system output, which can have a
form of synthesized speech or prerecorded audio files. Whereas using of TTS offers more
flexibility, the output quality of artificial speech is still not perfect. On the other side,
prerecorded speech can provides a studio quality with nice voice, it lacks for flexibility. For
some languages, using of audio recordings is only one possible way.
If we look at inputs, not all input information can be delivered to the system by voice. In case
of PIN codes or other identification or personal information, another input type has to be
selected. Using of DTMF (Dual-Tone Multi-Frequency) tones or other input modality
(virtual keyboard) should be considered as more safety. DTMF tones that relates to telephone
buttons can be helpful also in situations, when the spoken communication is disturbed by
noise or by other conditions.
Another decision can be done, which relates to the possibility of the user to interrupt the
system prompt. Such possibility is called “barge-in”. Interrupting of the other participant in
the dialog belongs to the often used interaction patterns, which can significantly accelerate the
interaction. VoiceXML platforms have to support barge-in function. It can be switched on/off
by the attribute of the <prompt> element.
After all issues were decided, the dialog diagram could be scetched in form of a flow chart.
Nodes of such flow chart represent the system questions and transitions represent possible
reactions of the user. An example of the flow chart for pizza delivery dialogue service is
shown in Fig. 30.
52
Fig. 30 Example of pizza delivery service flow chart12
12
Source: http://www.ling.gu.se/~lager/teaching/dialogue_systems/labs/img/dialog_flow1.gif
53
4.1.2. Writing prompts and grammars

There can be formulated few recommendations for writing system prompts, which tend to the
more reliable and more ergonomic voice services. An appropriate formulation of system
prompts significantly influences the user answers and speech grammars that should match
them.
The level of the prompt “openness” determines how restrictive will the system question
(prompt) be. Open prompts are less restrictive and allow users to use more free language. The
typical representative of the open prompt is
“How may I help you?“
prompt, which can be answered by any spoken language utterance. It is clear that to write an
appropriate deterministic speech grammar for such input utterances will be a difficult work.
To prepare statistical language model can be seen as an appropriate solution, but not together
with the VoiceXML language, because it is designed predominantly to be used with
deterministic grammars written according SRGS and SISR recommendations. Of course,
mentioned recommendations offer also solution in form of garbages and fillers, but they
support depends on voice platform (or SDS). Instead of described difficulties, VoiceXML
enables to use open prompts in the mixed-initiative mode, but it still expects only filling of
predefined set of value slots.
Closed prompts stays on the opposite side. They directly restrict user to choose from few
choices. The typical closed prompt can look like:
“Please, choose one of: sport, movies or music.”
In almost all situations, we will construct prompts somewhere in the range from open to
closed prompts.
Following recommendations should be considered while writing prompts:
 To formulate prompts with an appropriate length.

o Length of the prompts is an important property. Prompts should not be too
long, because, the user’s attention can decrease. Too long prompts also
prolong the dialogue, which can be uncomfortable for the user. On the opposite
site, too short prompts cannot be enough explaining.
 To formulate polite prompts, to be pleasant for the user.
 To consider TTS system possibilities during prompts construction. Sometimes can
happen that some combination of diphones (or some words) can be worse synthesized
as other. Synthesized prompts should be listened before the dialog application will be
presented to users, to ensure understandability and the sufficient level of quality.
 To construct prompts, which are enough helpful and leads user to provide information,
which is expected by the system.
54
The initial design of the dialog including constructed prompts can be evaluated using
Wizard-of-Oz method, which can be very helpful for constructing appropriate speech
grammars for particular prompts. Wizard-of-Oz method is a research method in which test
subjects interact with a computer system, thinking that system is autonomous, but it is not
true. In this method, system is operated or partially operated by an unseen human being.
Trained operator replaces in Wizard-of-Oz experiment the dialogue manager and he manages
the interaction with the test subjects to obtain information about the user’s behavior and
language, which user usually use. Obtained user’s answer can help the service designer to
construct speech grammars that cover utterances usually spoken by users as an answer for
considered questions (prompts).
Another often used approach for preparing speech grammars for desired dialog is a brain-
storming method, when the designer tries to collect all possible answers on designed system
prompts. Then, designer tries to uncover a logical structure of such answers and it transforms
this logic into the speech grammars rules.
4.2. Writing VoiceXML applications

VoiceXML applications consist of one or more VoiceXML documents. These VoiceXML
document files are denoted by a ".vxml" file extension. Although, each document can run as
an standalone application, often, they can be joined together using so called Application Root
Document (ARD). ARD enables to set global variables and settings and also to share any
other variables, defined in this scope (application) by the designer. The typical structure of the
VoiceXML application is shown in Fig. 31.
Fig. 31 The structure of typical VoiceXML application
All VoiceXML documents start with the XML declaration at the first line:
<?xml version="1.0"?>
55
The content of the document is delimited in the pair of <vxml> and </vxml> elements. Within
this tag, the version attribute is required, which determines the version of VoiceXML being
used as follows:
<vxml version="2.0">
document content …
</vxml>
Each VoiceXML document consists of a header and of one or more dialogues, which can be
written using <form> or <menu> elements. Header usually contains metadata and declaration
of variables with the document scope. For including metadata, VoiceXML offers <metadata>
and <meta> elements. In following example in Fig. 32, meda data of type author is included,
with its value. The variable welcome_mess is declared with initial value “Welcome”. The
type of the variable in VoiceXML is set according the first included value. In presented
example, the type of the variable is text.
Fig. 32. VoiceXML header example
The main building units of the VoiceXML application are forms and menus that represent
particular dialog. Each form has an unique name and contain instructions for executing those
dialog. Forms are delimited in the pair of <form> … </form> tags. Each form has a
mandatory attribute “id”, which defines the name of that form. Defining form’s name makes
possible to reference it from other place in the application (within the same VoiceXML
document, or from other document of the same application) or by another application. Forms
can contain various elements, which perform tasks required to execute particular dialog. Fig.
33 brings an example of the simple form, which say hello to the user.
56
Fig. 33 Simple VoiceXML Hello application
The example above is a VoiceXML document with only one form, which contains a block of
executable content encapsulated in <block> </block> elements. The <prompt> element inside
this block serves for prompting the user with any utterance, which can be synthesized using
TTS system.
VoiceXML code presented in the Fig. 33 cannot be used for obtaining the user input. It is able
only to say “Hello!” to the user. To prepare really interactive voice service, also other form
items have to be used inside the form, including so called “input fields” that are designed to
collect user’s input.
Form items are a group of the elements that may be used inside the < form> tag to perform
tasks related to dialog execution. These items can be divided into two main categories: field
items and control items. Field items collect information from the user to fill variables (input
field variable). They may contain prompts directing the user what to say, grammars, which
define the interpretation of what is said, and any event handlers. Control items, e.g. blocks,
enclose an executable content, but they cannot be used to collect user’s input.
Forms can contain following input fields (field items):
<field> collects input from the user via speech or DTMF recognition
<record> enables to record the user input as an audio recording
<transfer> makes possible to transfer the user call to another phone number
<object> invokes a platform-specific object, which can collect other types of the
user input and it returns back the result as an ECMAScript object
<subdialog> reroutes the dialog into the another dialog or document and returns
back the result as an ECMAScript object
57
There are two types of control items:
<block> cane be used to define an executable content and playing some message to
the user.
<initial> serves for creating a mixed-initiative interaction with the user
Each form item has associated a form item variable, which is set to undefined by default
when the form is entered. This form item variable serves for storing the result of this item
interpretation. In case of input items, that variable is called an input item variable, and it holds
the value collected from the user. A form item variable can be given a name using the name
attribute, or left nameless (an internal name is generated by the platform).
Each form item also has a guard condition, which is tested to decide whether that form item
can be selected by the Form Interpretation Algorithm (FIA) to interpretation. The default
guard condition just tests whether the form item variable has a value. If this condition is true,
such form item will not be visited.
Fig. 34 Simple VoiceXML application Favorite Day
58
The most preferred way how to collect input from the user is to construct simple form with
one <field> input field element. Fig. 34. shows a simple VoiceXML application that asks
the user to say its favorite day of the week.
Presented VoiceXML code enables two interaction scenarios as are following:
Fig. 35. Interaction scenarios in simple VoiceXML application Favorite Day
Proposed VoiceXML code consists of one form (<form> </form> elements), in which an
input field <field> is encapsulated (from line 5 to 29). This input field has the name attribute,
which defines the name of its input item variable to be “favorite_day”. This field encapsulates
a dialog unit, which is defined by the prompt, the speech grammar and the piece of code
inside the <filled> </filled> items, which will be executed after collecting the user input.
The input prompt is specified by the <prompt> element and will be played to the user after
entering this field. Entering the field also means that the process of speech recognition is also
activated and after prompting the user with the system prompt, the system will wait for the
user utterance.
All possible user’s inputs are defined by the speech grammar, which can be specified using
<grammar> VoiceXML element. Speech grammars in VoiceXML applications are written
according the W3C SRGS specification, that defines two formats – XML-based and ABNF
form. In our example, the grammar is written according XML form of SRGS (from line 7 to
19) directly in the VoiceXML document (internal grammar). Due to the complexity of speech
grammars, especially in real applications, grammars are more often constructed as separated
documents, with the extension “.grxml”. In this case, we called such grammar as external. Fig.
36 provides an example of the referring an external grammar file, with the same content as its
internal version in Fig. 34.
When the speaker provides its answer on the system prompt, its utterance is converted to the
text and the active grammar is compared. If the utterance matches the grammar, the input field
variable is filled in by this utterance or its semantic value. In our example the variable
favorite_day, when consider Scenario 1 from Fig. 35, will be filled by the text string
“Sunday”.
59
Then, the input field processing moves in to the <filled> element, which is executed after
collecting the user input. Here, the element of the conditional logic is located (lines from 22 to
27), which decide about the next dialog flow according the condition defined in cond
attribute. If this condition is true, <if> content is executed. In opposite case, <else> content is
executed.
If we consider code from the previous example, the interaction stops in this point, because, no
transition to other form or document is specified. To specify the transition to another dialog
item, dialog or document, <goto> element can be used, as is illustrated in the example below
(Fig. 36). <goto> element is used inside the <filled> element (line 21), which routes the
dialog to the next form. It can use one of the next, nextitem, expr or expritem attributes to
define the destination, where the dialog should continue. Next and nextitem attributes can
hold identificators (id) or names of the next form item, form (using # preposition) or
VoiceXML document. Expr and expritem attributes make possible to define an expression,
which is evaluated to obtain the name of the item, where the dialog should move.
Fig. 36. VoiceXML application to obtain favorite day of the user with the external grammars and event handlers
60
In the next example, <goto> element contains next attribute with the name of the next dialog
– goodbye. Referring the next form requires the # as a preposition. In this step, interpretation
of the form with id hello stops and the interpreter enter the goodbye dialog, which caused
playing the goodbye prompt to the user.
The interaction, which can be done as a result of previous VoiceXML document

interpretation, can be seen in Fig. 37. Scenario 1 shows the smooth, errorless dialog.
Completely different situation illustrates the Scenario 2, where some misunderstandings
occurred. The first problem occurred, when the user did not answer the first system prompt in
defined time interval (line 10 and 11). The second problem occurred, when the user provide
an unexpected answer (line 13). Fortunately, finally, user spoke the help command, which
helped him to move forward in the dialog.
As we can see in the Scenario 2, three types of events occurred: Noinput, Nomatch and Help.
VoiceXML offers several ways, how to catch not only these events and process them. For
these three types of events, the special elements are defined in VoiceXML specification -
<noinput>, <nomatch> and <help>. There is also <error> element to catch other types of
errors, usually produced by the system. VoiceXML also offers a general event handling
mechanism to catch and process also other types of events, which can occur during dialog
interaction. The element <catch> can be used to catch and process any other events, which
can be platform specific.
Fig. 37 Interaction scenarios in VoiceXML application Favorite Day with transition to goodbye dialog and with event
handlers
Event handling elements can be defined on one of the application levels, which define their
scope. If, event handing elements are defined in the application root document, they are able
to catch events in each place of the application. Event handling elements can be defined on
document level, in the document header. It this case, they can catch events that occur in the
same document. Event handlers defined on dialog level (inside <form> element) process
events in that dialog. In our example in Fig. 36, event handlers (lines 7-9) are defined on
61
“anonymous” level or scope, what means that they catch only events that occur inside the
same form item (here in the <field> element which starts at line 5). All three event handlers
are constructed in a simplest way, when they play related message to the user, which should
help him to provide an appropriate input. After playing defined message, the <field> element
is reentered without replaying the prompt defined inside (to replay the input field prompt,
<reprompt> element can be including into the event handler), and the user input is expected
once again.
The proposed tutorial can be considered as the minimum to be able to write simple dialogues,
which are produced by interpretation by the static VoiceXML documents. More information
about VoiceXML can be found directly in the W3C Recommendation [9] or in one of the
many tutorials on the web. The complete list of the VoiceXML 2.0 elements can be found
also in Appendix B: VoiceXML 2.0 elements.
Up to now, we did not speak about the connection of the VoiceXML applications to the data
sources as are databases or internet. This topic relates also to VoiceXML applications with
“dynamic” content, what means, that not all VoiceXML documents, from which the
application consists, are written before the application start, but one part of the application can
be generated dynamically by the some web server or “document server” as is shown in Fig.
38. The main idea is that, after obtaining required user’s data, which are necessary for
querying a data source (database, internet), VoiceXML interpreter sends a request to the
document server with a group of attribute-value pairs, which hold obtained data. Then,
document server performs the query and obtains information, which should be delivered to
the user and generates a new VoiceXML document, which will present that information.
Fig. 38 Dynamic VoiceXML architecture model
Generation of such VoiceXML document can be done by the web page with dynamic content
written in PHP or JSP. Generated VoiceXML document is delivered back to the VoiceXML
interpreter and the dialog continuous by interpretation of dynamically generated VoiceXML
document.
62
Such scenario is realized using the <submit> element of the VoiceXML language. The
<submit> element enables to submit information to the Web server, where the VoiceXML
document, with dynamically generated content, will be created. It also performs the transition
to returned document to continuous in the dialog by its interpretation.
<submit> element offers the important possibility to send a list of variables to the document
server via an HTTP GET or POST request. The values of those variables are then used in a
database query as its parameters.
For example, imagine the Weather forecast service. There are two important values that need
to be acquired from the user – city and day. If we suppose the same variable names, these
values can be submitted to the web server as is illustrated in Fig. 39.
Fig. 39 Submitting variables city and day to the web server
After querying the weather forecast service, the new VoiceXML document is generated by the
web server, using one scripting languages (PHP, JSP, ASP). Generated document is then
returned back to the VoiceXML platform. It can looks like the VoiceXML document below.
Fig. 40 Dynamically generated VoiceXML document with the required information
63
5. Web applications with voice modality
Web applications imagine the new trend in area of interactive services. The main idea of web
applications lies in the providing services, accessible through variety of devices without any
specific requirements. That means hardware and software-independency. This is the natural
requirement, because of the actual situation, where people use several personal devices –
personal computers, notebooks, netbooks, tablets, smartphones, etc. They require access to
their favorite everyday services from all their devices. If these services (like email, facebook,
etc.) are provided through the unified interface, they can be seen as user-friendly and their
using is more comfortable. To be able to provide such services, an appropriate platform needs
to be found to bring hardware and software independency. Web browsers have been
established as such platform. The combination of HTML5 language, JavaScript language and
new CSS3 language make possible to write modern web applications, which is also well
suited for multimedia streaming and accessing other technologies as are ASR or TTS. In case
of speech recognition, joining previously introduced concept with the Google’s speech
recognition technology means the possibility to prepare web applications with speech
interface.
Fig. 41 HTML5 + CSS3 for creating different HMIs
In the following text, we will focus on technologies for creating web applications with the
speech interface.
64
5.1. Technologies for creating web application

5.1.1. HTML5
HTML5 is the new version (published as W3C standard in October 2014) of HTML standard,
which has been prepared to support latest multimedia. HTML five extends and improve
markups, which can be use to write web, mainly to easily add and process multimedia
content. It also brings application interfaces (APIs) to enable preparing comlex web
applications. HTML5 was also designed with consideration of its using on low-powered
devices (smartphones, tablets). Because of its properties, it is the candidate for the cross-
platform mobile applications. W3C also published a new logo for this standard, as is shown in
Fig. 42.
Fig. 42. HTML5 Logo
New APIs make from the web browser a new platform, which can provide variety of services.
For example, web browsers, thanks to HTML5 can store data and it can run also in offline
mode.
HTML5 alone cannot be used for animations or to bring any interactivity. For this purpose, it
involves JavaScript and CSS3 standards.
5.1.2. JavaScript
JavaScript programming language is required to make some web page dynamic; what
means, that they start to react to the events invoked by the user. There is also other way, how
to write dynamic pages (Adobe Flash), but this requires plugins installatation. Moreover, such
plugins cannot by available for all operating systems or browsers. JavaScript languages and
APIs became an important technology in the new HTML5 standard. Whereas HTML5 defines
web page content, JavaScript specifies web page behavior and CSS, as the third important
technology, defines a way of presentation.
Each modern web page contains also a piece of code in JavaScript and all modern web
browsers support it (including browsers in game consoles, tablets, smartphones, smart TVs,
etc).
JavaScript can be joined with HTML documents using of one of its interface, called
DOM (Document Object Model). DOMs represent document as a hierarchy of nodes, which
65
are arranges into the tree, where the root node has a type “document” (see Fig. 43). Nodes in
such model can be added, manipulated, browsed or removed. DOM imagines a platform and
language independent interface, which allows a dynamic access to the HTML document.
Fig. 43 Tree structure of the HTML DOM model.
JavaScript together with DOM enables to add dynamic behavior to the web page, what
means that it will be able to react to events invoked by the user. These reactions have a form
of content and look of such page.
JavaScript makes possible to add another application interfaces (APIs) to web pages written in
HTML5. Using this approach, web pages are changing on web applications, which offers
more interactivity.
5.2. Technologies for enabling speech modality to web

applications
HTML5 interfaces that are important for adding voice control functionality to the web
applications will be described here.
The simplest way, how to add speech recognition capability into the web application is to use
Google’s speech recognition available through internet. It can be done using Google’s
Chrome browser and one of the Speech Input API or Web speech API.
In this section, HTML5 interfaces that enable to obtain an access to the device’s microphone
in real time mode will be also introduced. Data stream obtained through web page can be send
to the remote server with the recognition engine through internet. Then, related text can be
send back to the web application.
66
5.2.1. Google Chrome and the speech control

At the beginning of 2011, Google’s new interface Speech Input API has been integrated into
the Chrome browser. This interface was substituted in half of 2014 with the new version with
the name Web Speech API. Mentioned interfaces enable to send user’s speech to the Google’s
remote speech recognition system. First mentioned, Speech Input API directly extended
HTML markup <input> to allow also speech input. The second one, Web Speech API, is
JavaScript-based interface, which adds user’s speech support using JavaScript code.
5.2.1.1. Speech Input API

Google introduced his version of the Speech Input API at the beginning of 2011 and it
also implemented this interface into its own web browser Chrome. Speech InputAPI extends
the HTML element <input> with attribute speech, whit the vendor preposition as is
illustrated by following piece of code:
<input type="text" x-webkit-speech />
Using x-webkit-speech attribute, the microphone icon appeared on the right side of the
input field (as is shown in Fig. 44.). The window with the text “Start speaking” appeared after
clicking on the icon. After saying the utterance, data obtained from the microphone are
converted from speech to the text using a web service and recognized text is delivered back to
the web page. This text can be obtained thanks to DOM as a value of the input HTML
element. Then, it can be assigned to other variable and processed.
Fig. 44. HTML input field, which allows filling by voice using x-webkit-speech interface
Speech Input API, thanks to remote Google speech recognition engine, supported many
languages (including Slovak language). The language of the browser is also a default
language for speech recognizer. It can be changes using lang attribute, as is demonstrated
below.
<input type="text" lang=”en” x-webkit-speech />
Although Speech Input API offered very simple and convenient interface, it had also few
limitations:
67
 Recognition process stops after a short pause, what is not suitable for entering longer
texts (e.g. dictation)
 Each new recognition rewrites previously recognized text in the same input field.
 Recognition can be done only in HTML element <input>
Speech Input API was well suitable for voice searching, controlling web page navigation or
for simple web games, unfortunately it was completely substituted by the Web Speech API in
half of the 2014.
5.2.1.2. Web Speech API

Following the previous specifications and the final report of the W3C HTML Speech
Incubator Group, Google, at the end of the 2011, published an extended version of the Speech
JavaScript API Specification, later renamed on Web Speech API. Web Speech API
Specification was finished in October 2012 and it was firstly implemented in Google Chrome
browser version 25.
Web Speech API interface is based on JavaScript. To enable using of speech recognition
technology, a new JavaScript object webkitSpeechRecognition()has to be created as
follows:
var reco = new webkitSpeechRecognition();
Web Speech API defines several functions to dispatch events related to speech recognition
process. Those events are thrown when the recognition process starts or stops, when the
speech is detected or if results are available or some error occurs.
Fig. 45. Speech interface created using the Web Speech API
In comparison with the Speech Input API, creating of the web application with the speech
interface requires more complex source code.
Web Speech API solves also security problems, which occurs in case of Speech Input API. In
case of using standard HTTP protocol, the confirmation window appears (Fig. 46), that ask
the user to allow microphone usage. In case of HTTPS protocol, the transition is considered as
safety, and allowing the microphone is not required.
Fig. 46. Popup window for allowing microphone usage
68
The possibility to capture continuous user’s speech (limited to 60 seconds) is the main
extension in comparison with the Speech Input API. Another important extension is the
possibility to display also partial recognition results (hypothesis), but with lower probability.
Thus the Web Speech API can be used for text dictation or e-mail writing.
Web Speech API is supported by several web browsers, which are designed for mobile
devices. Google Chrome for Android (4.0) supports Web Speech API from version 32.0. On
iOS, Web Speech API is supported by Safari browser from its seventh version.
5.3. Audio signal capturing using HTML5

Although Google offers the simplest way how to implement speech recognition technology
into web pages, it is not the only one possibility. There can be a lot of reasons, why to use
another speech recognition system to be involved in the web application. Especcialy in case of
commercial applications, you cannot use Google’s engines, which is provided for free. The
next important reason is that you are not able to adapt Google’s speech recognition or speech
synthesis for your own purposes. It is not possible to change its acoustic or language models,
neither pronunciation dictionary. Therefore, it is important to know other HTML5 interfaces
that enable to capture and process audio stream and then to send data to the remote server,
which provides speech recognition functionality.
5.3.1. getUserMedia API interface

getUserMedia API is an interface that makes possible to acces external device data (such as a
webcam video stream or microphone audio stream) without involving any plug-in. This API
is currently an experimental feature; therefore, the API name is called
navigator.webkitGetUserMedia. It can be used for real-time communication or fro creating
other applications that enables realtime multimedia, as are home monitoring or on-line
courses.
The main purpose of the interface is obtaining of the data stream with the audio or video
signal, but the getUserMedia interface is not able to send these data or to save this data into
the file. For mentioned purposes another interface need to be used, e.g. Web audio API.
The getUserMedia API offers one main method GetUserMedia() (webkitGetUserMedia()

), which belongs to the object navigator. It takes three parameters. The first parameter is an
object, that specify the type of media you want to access (audio/video or both). The next two
parameters are callback functions, which are called in case of success or fail. Using of
described function can look like following example:
navigator.webkitGetUserMedia({audio:true}, onSuccess, onFail);
The first parameter can be substituted by the JavaScript object, which enables to have more
control of the data stream.
69
5.3.2. Web Audio API interface

The next important interface of HTML5 is a Web Audio API, which enables to process audio
stream acquired using getUsermedia API. Obtained data stream returns as a parameter of the
success callback function and it can be further processed using Web Audio API.
Web Audio API is a high-level JavaScript API for processing and synhetizing of audio in web
applications. It enables audio signal processing including mixing, filtering or applying effects
to make possible to create interactive web applications, games or advanced music
applications.
Basic audio operations can be performed using so called audio nodes, which are connected
together to form an audio routing graph. This modular design provides the flexibility for
creating complex audio functions.
Audio nodes are connected together using their inputs and outputs, to form a chain. Such a
chain can start with one or more sources. Then, the chain can go through one or more nodes,
then it can end at an output destination. An example of the workflow with several input
sources for web audio can look like in Fig. 47.
Fig. 47 Example workflow of the web audio
Such workflows in Web Audio API are called AudioContext. Web Audio API is available for
developers through JavaScript. It offers several objects and functions that can be used to
deliver and transform of the signal from audio sources to the desired destination. Web Audio
API interface enables the web browser to process audio content to support audio and speech.
70
5.4. Public lighting control web application – a case

study
In this section, the public lighting control web application will be described as a case study of
the modern web application with the multimodal user interface. Described web application
was designed to create a multimodal interface to controlling public lighting (switching on/off)
in the city environment. There are three types of input modalities that can be used for
controlling the application – speech commnads, using of touchscreen pointing or mouse
clicking upon the map, clicking on buttons. The graphical output in the form of map with the
lamps icons has been designed as the output modality, where the color of the lamps is
changing according its state. The main interface of the public lighting control web application
is shown in Fig. 48.
Fig. 48 Multimodal interface of the public lighting control web application
Described application was developed in HTML5 language including CSS3 and JavaScript.
For enabling the voice control Google Web Speech API has been involved to use Google
speech recognition from the web. For accessing the map of desired city area and to make
possible to include the lamp icons there, the Google Maps API was used.
To switch on/off of the lamps on the concrete street, user can speak one of the “switch
on/switch off” commands followed by the street name. The other way how to switch on/off
lamps is to click (touch) the lamp icon on the map, which is located on the concrete street,
where lamps should be switched on/off. The third possibility is to use one of the buttons,
located on the left side. After switching lamps on, the color of the lamp will change to yellow.
When lamps are switched off, the color will change back to gray.
71
The registration on the web of “Google Developers Console” needs to be done to obtain a
unique key to make possible to use Google Maps API. This key has to be included in the
head of the web page as follows:
<script type="text/javascript"
src="https://maps.googleapis.com/maps/api/js?key=AIzaSyBb4RbvSomKDJyvy2
fbjizJy6L9895TRIE&sensor=false">
</script>
To insert the map into the web page, the new JavaScript object of the Google Maps API need
to be created:
var mapa = new google.maps.Map (document.getElementById(‘divMapy‘),
moznostiMapy);
where, divMapy is an identificator of the HTML <div> element, where the map will be
located. The second parameter of the Map object, moznostiMapy, is a JavaScript object with
the basic values of variables, which are important for map loadig (map center coordinates,
initial zoom level and the possibility of further control …)
JavaScript object moznostiMapy with the basic parameters can look as follows:
var moznostiMapy = {
center: new google.maps.LatLng(48.366928, 21.867789),
zoom: 16,
mapTypeControl : true,
mapTypeControlOptions : {
mapTypeIds: [google.maps.MapTypeId.ROADMAP,
google.maps.MapTypeId.SATELLITE]
Lamp icons were created using JavaScript object of the interface for creating map markers
google.maps.Marker. Such object has two mandatory parameters, which are the coordinates
of the marker and the map identifier. Next prameters are volunatary and enables to assign the
name of the markup, its icon and the z-position. Marker definition is shown below:
var znackovac = new google.maps.Marker({
position: new google.maps.LatLng(48.37, 21.86),
map: mapa,
icon: ‘lampa_vyp.png’,
title: ‘nazov_ulice’,
zIndex: 1});
72
Fig. 49 Lamp icons
Lamps are divided according streets and their coordinates are located in the field. Then, each
lamp is associated with its event handler. Event handlers catch input events, which occur as a
reaction on speech command or clicking on the lamp or button. Such event handler fuction
changes the state (switched on/off) of the lamps on the concrete street and in the future, it can
call other function, which can produce real controlling commands for the system of public
lights control.
To enable speech control, Web Speech API is used. The new JavaScript object
webkitSpeechRecognition()is firstly cretated.
var rec = new webkitSpeechRecognition();
Then, the speech recognition process can be started using start() method inside the
function, that is invoked after clicking on the microphone icon (ziskajPovel(event) function):
function ziskajPovel(event){
if(nahravanie){
rec.stop();
return;
}
konecnyPrepis = "";
rec.lang = "sk";
rec.start();
document.getElementById("disp").innerHTML = "";
document.getElementById("mic").src = "mic-slash.gif";
}
}
In this function, firstly, the variable nahravanie is checked, which is true, if the recognition
has been started before. In this case, it needs to be stopped before the next start. Then, the
variable konecnyPrepis, which is storage for the final version of the recognition result, need
to be erased. Before starting the recognition process, language can be set by rewriting the
lang attribute of the rec object (rec.lang = “sk”;). Then, recognition process can be
started by calling the start() function of the rec object (rec.start();)
After calling the start() method, user speech starts to be captured and then the speech signal
together with the value of lang attribute are delivered to the Google server with the speech
recognition technology, where the speech is recognized. SpeechRecognitionEvent
object is delivered back from the server to the web application, which contains the recognition
results. This object is returned as an argument of the event handler onresult:
rec.onresult = function(event){
for(var i = event.resultIndex; i < event.results.length; i++){
if(event.results[i].isFinal){
73
konecnyPrepis += event.results[i][0].transcript;
}
}
document.getElementById("disp").innerHTML = konecnyPrepis;
}
Delivered object event contains the array results[], which consists of two parts:
results[].transcript, where recognized words are located and results[].confidence,
which holds the confidence scores of these word.
In onresult event handler in previous piece of source code, we can see the cyclic reading of
the results[] array and constructing of the final recognized word sequence, which is then
placed into the window near the microphone icon.
Next functions perform the comparison of the recognized text with the allowed speech
commands, and in case of the equality, the function for switching on/off of lamps on the
concrete street is invoked.
Source codes of the described web application can be found in Appendix A: Public lighting
control application source codes. It can be concluded that implementation of the speech
modality into the web application using Web Speech API is relative simple task.
74
6. Table of pictures
Fig. 1. Human machine communication chain .................................................................. 6

Fig. 2 Pipeline architecture of the Spoken Dialogue System ............................................ 7
Fig. 3 Hub-server architecture of the Galaxy Communicator ........................................... 8
Fig. 4 Architecture of the multimodal dialogue system .................................................... 9
Fig. 5. An example of user's input in multimodal interaface .......................................... 10
Fig. 6. MOBILTEL Weather Forecast multimodal application ...................................... 11
Fig. 7. An example of interaction scenario of Apple SIRI virtual asistant and a selection
of its possibilities ............................................................................................. 13
Fig. 8. GRETA - Embodied Conversational Agent ........................................................ 13
Fig. 9. Car-O-bot 3 robot (left) vs. Aldebaran NAO humanoid robot (right) ................. 14
Fig. 10 Fujitsu prototype of the Interactive Voice-recognition Car Navigation Unit ..... 14
Fig. 11 General architecture of the ASR system ............................................................. 15
Fig. 12 Example of dialogue finite state machine ........................................................... 19
Fig. 13 Block diagram of the diphone-based concatenative speech synthesis ................ 22
Fig. 14. Block diagram of the HMM-based speech synthesis ......................................... 23
Fig. 15. Example of simple VoiceXML code ................................................................. 27
Fig. 16. An example of SRGS grammar for entering day of week ................................. 28
Fig. 17. Speech grammar with SISR tags ........................................................................ 29
Fig. 18. An example of SSML document ........................................................................ 30
Fig. 19. An example of PLS document ........................................................................... 31
Fig. 20. An example of SRGS grammar with PLS document referencing ..................... 31
Fig. 21. Fragment of CCXML code for managing the phone connection....................... 32
Fig. 22 Voice Browser Architecture ............................................................................... 33
Fig. 23. Run-Time MMI Architecture Diagram with examples of components
realization ........................................................................................................ 35
Fig. 24. An example of EMMA document...................................................................... 36
Fig. 25. An example of InkML code and rendered text .................................................. 37
Fig. 26. Simple example of EmotionML inside EMMA document ................................ 37
Fig. 27. Example of including EmotionML description into the SSML code ................ 38
Fig. 28. Architecture of MS SAPI ................................................................................... 39
Fig. 29. General architecture of MRCP-based communication ...................................... 41
Fig. 30 Example of pizza delivery service flow chart ..................................................... 53
75
Fig. 31 The structure of typical VoiceXML application ................................................. 55

Fig. 32. VoiceXML header example ............................................................................... 56
Fig. 33 Simple VoiceXML Hello application ................................................................. 57
Fig. 34 Simple VoiceXML application Favorite Day ..................................................... 58
Fig. 35. Interaction scenarios in simple VoiceXML application Favorite Day .............. 59
Fig. 36. VoiceXML application to obtain favorite day of the user with the external
grammars and event handlers .......................................................................... 60
Fig. 37 Interaction scenarios in VoiceXML application Favorite Day with transition to
goodbye dialog and with event handlers ......................................................... 61
Fig. 38 Dynamic VoiceXML architecture model ............................................................ 62
Fig. 39 Submitting variables city and day to the web server .......................................... 63
Fig. 40 Dynamically generated VoiceXML document with the required information ... 63
Fig. 41 HTML5 + CSS3 for creating different HMIs ..................................................... 64
Fig. 42. HTML5 Logo ..................................................................................................... 65
Fig. 43 Tree structure of the HTML DOM model. ......................................................... 66
Fig. 44. HTML input field, which allows filling by voice using x-webkit-speech
interface ........................................................................................................... 67
Fig. 45. Speech interface created using the Web Speech API ........................................ 68
Fig. 46. Popup window for allowing microphone usage................................................. 68
Fig. 47 Example workflow of the web audio .................................................................. 70
Fig. 48 Multimodal interface of the public lighting control web application ................. 71
Fig. 49 Lamp icons .......................................................................................................... 73
76
7. References
[1] R. López-Cózar Delgado, M. Araki Spoken, Multilingual and Multimodal Dialogue

Systems: Development and Assessment., John Wiley & Sons, ISBN 0-470-02155-1
[2] M. A. Walker, R. Passonneau, and J. E. Boland. Quantitative and qualitative evaluation

of DARPA Communicator spoken dialogue systems. In Proc. of the Meeting of the
Association of Computational Lingustics, ACL 2001, 2001.
[3] Galaxy communicator website, http://communicator.sourceforge.net/
[4] Y. Wilks et al. Some background on dialogue management and conversational speech for
dialogue systems, Computer Speech & Language (2008), doi:10.1016/j.csl.2010.03.001
[5] John R. Searle, Speech Acts: An Essay in the Philosophy of Language, Cambridge
University Press, 1969, ISBN 9780521096263
[6] G. Jefferson, Side sequences. In D.N. Sudnow (Ed.) Studies in social interaction, pp.294-
33, New York, NY: Free Press, 1972
[7] K. Fukui, Y. Ishikawa, E. Shintaku, K. Ohno, N. Sakakibara, A. Takanishi, M. Honda,

Masaaki: Vocal Cord Model to Control Various Voices for Anthropomorphic Talking
Robot. In: The 8th International Seminar on Speech Production. Štrasburg (Francúzsko),
2008. s. 341-344.
[8] www.voicexml.org
[9] VoiceXML 2.0 W3C Recommendation, http://www.w3.org/TR/voicexml20/, 16 March

2004
[10] Speech Recognition Grammar Specification Version 1.0, W3C Recommendation 16

March 2004, http://www.w3.org/TR/speech-grammar/
[11] Semantic Interpretation for Speech Recognition (SISR) Version 1.0, W3C
Recommendation 5 April 2007, http://www.w3.org/TR/semantic-interpretation/
[12] Speech Synthesis Markup Language (SSML) Version 1.0, W3C Recommendation 7
September 2004, http://www.w3.org/TR/speech-synthesis/
[13] Pronunciation Lexicon Specification (PLS) Version 1.0, W3C Recommendation 14

October 2008, http://www.w3.org/TR/pronunciation-lexicon/
[14] Voice Browser Call Control: CCXML Version 1.0, W3C Recommendation 05 July 2011,
http://www.w3.org/TR/ccxml/
77
[15] State Chart XML (SCXML): State Machine Notation for Control Abstraction, W3C Last
Call Working Draft 29 May 2014, http://www.w3.org/TR/scxml/
[16] Model Architecture for Voice Browser Systems, W3C Working Draft 23 December
1999, http://www.w3.org/TR/voice-architecture/
[17] Multimodal Architecture and Interfaces, W3C Recommendation 25 October 2012,

http://www.w3.org/TR/mmi-arch/
[18] EMMA: Extensible MultiModal Annotation markup language, W3C Recommendation

10 February 2009, http://www.w3.org/TR/emma/
[19] Ink Markup Language (InkML), W3C Recommendation 20 September 2011,

http://www.w3.org/TR/InkML/
[20] Emotion Markup Language (EmotionML) 1.0, W3C Recommendation 22 May 2014,
http://www.w3.org/TR/emotionml/
[21] A. Stolcke, SRILM – an extensible language modeling toolkit, Proc. of. ICSLP, Denver,
Colorado, pp. 901-904
[22] B.-J. (Paul) Hsu and J. Glass. Iterative Language Model Estimation: Efficient Data
Structure & Algorithms. In Proc. Interspeech, 2008.
[23] M. Federico, N. Bertoldi, M. Cettolo, IRSTLM: an Open Source Toolkit for Handling
Large Scale Language Models, In proc. Interspeech 2008.
[24] D. Bohus, and A. Rudnicky, RavenClaw: Dialog Management Using Hierarchical Task
Decomposition and an Expectation Agenda, in Eurospeech-2003, Geneva, Switzerland,
2003
[25] D. Traum et al., A model of dialogue moves and information state revision, Tech.rept.
Deliverable D2.1. Trindi, 1999
[26] S. Larsson and D. Traum, Information state and dialogue management in the TRINDI
Dialogue Move Engine Toolkit. In Natural Language Engineering Special Issue on Best
Practice in Spoken Language Dialogue Systems Engineering, Cambridge University
Press, U.K., pp. 323-340, 2000
[27] CH. Sharma, J. Kunnins, VoiceXML, Strategies and Techniques for Effective Voice
Application Development with VoiceXML 2.0, John Wiley & Sons; illustrated edition
edition (4 Feb. 2002), ISBN: 0471418935
[28] M. F. McTear, Spoken Dialogue Technology, Springer; Softcover reprint of the original
1st ed. 2004 edition (August 12, 2004), ISBN: 1852336722, 2004
78
Appendix A: Public lighting control

application source codes
index.html
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>Verejné osvetlenie</title>
<link rel="stylesheet" type="text/css" href="main.css">
<script type="text/javascript"
src="https://maps.googleapis.com/maps/api/js?key=AIzaSyBb4RbvSomKDJ
yvy2fbjizJy6L9895TRIE&sensor=false">
</script>
<script src="osvetlenie.js" type="text/javascript"></script>
</head>
<body>
<div id="kontajner">
<h1>Ovládanie verejného osvetlenia</h1>
<div id="mapa"></div>
<h3>Hlasové ovladanie</h3>
<h3 id="varovanie">Prehliadač nepodporuje<br>hlasové rozpoznávanie</h3>
<div id="rec">
<div id="hlasoveOvladanie">
<img id="mic" src="mic.gif" alt="microfon">
<span id="disp"></span>
</div>
<p class="info">Po kliknutí na mikrofón, je potrebné povoliť jeho
použitie v hornej časti prehliadača. Povolené hlasové povely:</p>
<ul>
<li>"zapnúť" + meno ulice (napr. "zapnúť hlavná")</li>
<li>"vypnúť" + meno ulice (napr. "vypnúť hlavná")</li>
<li>"zapnúť všetky"</li>
<li>"vypnúť všetky"</li>
</ul>
</div>
<div id="rucne">
<h3>Alternatívne ovládanie</h3>
<p class="info">Kliknite na konkrétnu lampu alebo použite
tlačidlá!</p>
<button id="control1" type="button">Hlavná</button>
<button id="control2" type="button">Kamenecká</button>
<button id="control3" type="button">Karčianská</button>
<button id="control4" type="button">Ružová</button>
79
<button id="control5" type="button">Cintorínska</button>

<button id="control6" type="button">Východná</button>
<button id="control7" type="button">Všetky</button>
</div>
</div>
</body>
</html>
main.css
@charset "utf-8";
/* CSS Document */
#kontajner { height: 600px;

width: 1000px;
margin-right: auto;
margin-left: auto;
background-color: rgba(204,255,255,1); }
h1 { text-align: center;
height: 25px;
width: 1000px; }
#mapa { height: 500px;
width: 700px;
margin-right: 20px;
float: right; }
#hlasoveOvladanie { height: 50px;
width: 200px;
background-color: rgba(255,255,255,1);
position: relative;
left: 20px;
top: 0px; }
#disp{ position: relative;
top: -20px; }
h3 { position: relative;
left: 20px; }
.info { text-align: left;
position: relative;
left: 20px;
font-size: 12px;
width: 250px; }
li { font-size: 12px;
color: rgba(153,0,0,1); }
#varovanie { color: rgba(255,0,0,1);
visibility: visible; }
#rec { position: relative;
top: -50px;
visibility: hidden;
width: 250px; }
80
#rucne { position: relative;

top: -50px;
width: 250px; }
osvetlenie.js
// JavaScript Document
window.onload = init;
function init(){
verejneOsvetlenie.nacitanieMapy();
if(webkitSpeechRecognition)
webSpeech();
}
//Deklaracia objektu verejneOsvetlenie, ako literal typu objekt

var verejneOsvetlenie = {
ikona_vyp : "lampa.png", //Obrazok vypnutej lampy
ikona_zap : "lampa_on.png", //Obrazok zapnutej lampy
//Pole s koordinátami pre lampy podla ulic

lampy : [
["Kamenecká", 48.370180, 21.862339, 48.36975, 21.865086, 48.369394,
21.867811, 48.369380, 21.870729],
["Hlavná", 48.368652, 21.867188, 48.368011, 21.865922, 48.366799,
21.866008, 48.365217, 21.866287, 48.364162, 21.866695, 48.363093,
21.867188, 48.361938, 21.868326, 48.361211, 21.869978],
["Karčianská", 48.367441, 21.865043, 48.36784, 21.862318, 48.366628,
21.861996, 48.365188, 21.862318, 48.36419, 21.864463, 48.363164, 21.865322,
48.361924, 21.866201],
["Ružová", 48.364404, 21.865365, 48.364689, 21.867339, 48.365374,
21.868669, 48.365787, 21.87118, 48.366243, 21.873411],
["Východná", 48.3666, 21.870622, 48.367500, 21.870343, 48.36841,
21.869914],
["Cintorínska", 48.364362, 21.867639, 48.363292, 21.868304]
],
vsetkyLampy : [], //Pole obsahujuce vsetky objekty "lampa"
//Funkcia pre nacitanie mapy Google maps, a priradenie posluchacov udalosti

nacitanieMapy : function(){
var mapOptions = {
center: new google.maps.LatLng(48.366928, 21.867789),
zoom: 16,
mapTypeControl : true,
mapTypeControlOptions : {
mapTypeIds : [google.maps.MapTypeId.ROADMAP,
google.maps.MapTypeId.SATELLITE]
}
};
81
verejneOsvetlenie.mapa = new
google.maps.Map(document.getElementById("mapa"),mapOptions);
verejneOsvetlenie.stanovenieLampy(verejneOsvetlenie.mapa,
verejneOsvetlenie.lampy);
verejneOsvetlenie.pridatUdalost();
document.getElementById("control1").addEventListener("click",
verejneOsvetlenie.zapnutHlavnu, false);
verejneOsvetlenie.zapnutKamenecku, false);
verejneOsvetlenie.zapnutKarciansku, false);
verejneOsvetlenie.zapnutRuzovu, false);
verejneOsvetlenie.zapnutCintorinsku, false);
verejneOsvetlenie.zapnutVychodnu, false);
verejneOsvetlenie.zapnutVsetky, false);
},
/*Funkcia pre umiestnenie značiek ako lampy na mape a vytvorené lampy

umiestní do pola*/
stanovenieLampy: function(mapa, poloha){
var j = 1;
var k = 2;
for(var i = 0; i < poloha.length; i++){
for(var l = 1; l < (poloha[i].length); l++){
var lampa = new google.maps.Marker({
position : new google.maps.LatLng(poloha[i][j], poloha[i][k]),
map : verejneOsvetlenie.mapa,
icon : verejneOsvetlenie.ikona_vyp,
title : poloha[i][0],
zIndex : 1
})
verejneOsvetlenie.vsetkyLampy.push(lampa);
j += 2;
k += 2;
}
j = 1;
k = 2;
}
},
//Funkcia pre pridelenie poslucháca udalosti pre vsetky lampy na mape

pridatUdalost : function(){
for(var i = 0; i < verejneOsvetlenie.vsetkyLampy.length; i++){
82
google.maps.event.addListener(verejneOsvetlenie.vsetkyLampy[i],
"click", verejneOsvetlenie.zapnutLampy);
}
},
//Funkcia na zapnutie lampy, kliknutim na lampu sa zistuje

//jeho nazov a podla toho zapina prislusnu ulicu
zapnutLampy: function(){
var nazov = this.getTitle();
switch(nazov){
case "Hlavná": verejneOsvetlenie.zapnutHlavnu();
break;
case "Kamenecká": verejneOsvetlenie.zapnutKamenecku();
break;
case "Karčianská": verejneOsvetlenie.zapnutKarciansku();
break;
case "Ružová": verejneOsvetlenie.zapnutRuzovu();
break;
case "Východná": verejneOsvetlenie.zapnutVychodnu();
break;
case "Cintorínska": verejneOsvetlenie.zapnutCintorinsku();
break;
}
},
//Funkcia na prepínanie lampy, ak je vypnutá zapne, a opacne
prepnutLampy: function(lampa){
if(lampa.getIcon() == "lampa.png"){
lampa.setIcon("lampa_on.png");
}
else{
lampa.setIcon("lampa.png");
}
},
//Zapnutie lampy na Hlavnej ulici
zapnutHlavnu : function(){
if(verejneOsvetlenie.vsetkyLampy[i].getTitle() == "Hlavná"){
verejneOsvetlenie.prepnutLampy(verejneOsvetlenie.vsetkyLampy[i]);
}
}
},
//Zapnutie lampy na Kameneckej ulici
zapnutKamenecku : function(){
if(verejneOsvetlenie.vsetkyLampy[i].getTitle() == "Kamenecká"){
}
}
83
},
//Zapnutie lampy na Karčianskej ulici
zapnutKarciansku: function(){
if(verejneOsvetlenie.vsetkyLampy[i].getTitle() == "Karčianská"){
}
}
},
//Zapnutie lampy na Ružovej ulici
zapnutRuzovu : function(){
if(verejneOsvetlenie.vsetkyLampy[i].getTitle() == "Ružová"){
}
}
},
//Zapnutie lampy na Východnej ulici
zapnutVychodnu : function(){
if(verejneOsvetlenie.vsetkyLampy[i].getTitle() == "Východná"){
}
}
},
//Zapnutie lampy na Cintorínskej ulici
zapnutCintorinsku : function(){
if(verejneOsvetlenie.vsetkyLampy[i].getTitle() == "Cintorínska"){
}
}
},
//Zapne vsetky lampy naraz
zapnutVsetky : function(){
verejneOsvetlenie.zapnutHlavnu();
verejneOsvetlenie.zapnutKamenecku();
verejneOsvetlenie.zapnutVychodnu();
verejneOsvetlenie.zapnutRuzovu();
verejneOsvetlenie.zapnutCintorinsku();
verejneOsvetlenie.zapnutKarciansku();
},
//ovladanie hlasom hodnoty su vysledky z rozpoznavaca reci

zapnutHlasom : function(){
var povel = document.getElementById("disp").innerHTML;
switch(povel){
case "zapnúť hlavná" : verejneOsvetlenie.zapnutHlavnu();
break;
84
case "vypnúť hlavná" : verejneOsvetlenie.zapnutHlavnu();

break;
case "zapnúť kamenecká" : verejneOsvetlenie.zapnutKamenecku();
break;
case "vypnúť kamenecká" : verejneOsvetlenie.zapnutKamenecku();
break;
case "zapnúť východná" : verejneOsvetlenie.zapnutVychodnu();
break;
case "vypnúť východná" : verejneOsvetlenie.zapnutVychodnu();
break;
case "zapnúť ružová" : verejneOsvetlenie.zapnutRuzovu();
break;
case "vypnúť ružová" : verejneOsvetlenie.zapnutRuzovu();
break;
case "zapnúť cintorínska" : verejneOsvetlenie.zapnutCintorinsku();
break;
case "vypnúť cintorínska" : verejneOsvetlenie.zapnutCintorinsku();
break;
case "zapnúť karča" : verejneOsvetlenie.zapnutKarciansku();
break;
case "vypnúť karča" : verejneOsvetlenie.zapnutKarciansku();
break;
case "zapnúť všetky" : verejneOsvetlenie.zapnutVsetky();
break;
case "vypnúť všetky" : verejneOsvetlenie.zapnutVsetky();
break;
}
}
}
//Funkcia inicializuje rozpoznavac reci a vysledky su zobrazene

function webSpeech(){
document.getElementById("varovanie").style.visibility = "hidden";
document.getElementById("rec").style.visibility = "visible";
document.getElementById("mic").addEventListener("click", ziskajPovel,
false);
var konecnyPrepis ="";
var nahravanie = false;
var rec = new webkitSpeechRecognition();
rec.onstart = function(){
nahravanie = true;
document.getElementById("mic").src = "mic-animate.gif";
}
//funkcia pri ukonceni rozpoznavanie reci zavola funkciu na zapinanie lamp

rec.onend = function(){
nahravanie = false;
document.getElementById("mic").src = "mic.gif";
85
verejneOsvetlenie.zapnutHlasom();
}
//funkcia ulozi vysledky rozpoznavania do premennej

rec.onresult = function(event){
for(var i = event.resultIndex; i < event.results.length; i++){
if(event.results[i].isFinal){
konecnyPrepis += event.results[i][0].transcript;
}
}
document.getElementById("disp").innerHTML = konecnyPrepis;
}
// Funkcia sa spusti kliknutim na mikrofon a zacne sa rozpoznavanie reci,

// ak uz bezi tak zastavi ho, tu sa nastavuje aj jazyk rozpoznavania
function ziskajPovel(event){
if(nahravanie){
rec.stop();
return;
}
konecnyPrepis = "";
rec.lang = "sk";
rec.start();
document.getElementById("disp").innerHTML = "";
document.getElementById("mic").src = "mic-slash.gif";
}
}
86
Appendix B: VoiceXML 2.0 elements
Element Purpose
<assign> Assign a variable a value
<audio> Play an audio clip within a prompt
<block> A container of (non-interactive) executable code
<catch> Catch an event
<choice> Define a menu item
<clear> Clear one or more form item variables
<disconnect> Disconnect a session
<else> Used in <if> elements
<elseif> Used in <if> elements
<enumerate> Shorthand for enumerating the choices in a menu
<error> Catch an error event
<exit> Exit a session
<field> Declares an input field in a form
<filled> An action executed when fields are filled
<form> A dialog for presenting information and collecting data
<goto> Go to another dialog in the same or different document
<grammar> Specify a speech recognition or DTMF grammar
<help> Catch a help event
<if> Simple conditional logic
<initial> Declares initial logic upon entry into a (mixed initiative)
form
<link> Specify a transition common to all dialogs in the link's
scope
<log> Generate a debug message
<menu> A dialog for choosing amongst alternative destinations
<meta> Define a metadata item as a name/value pair
<metadata> Define metadata information using a metadata schema
<noinput> Catch a noinput event
<nomatch> Catch a nomatch event
<object> Interact with a custom extension
87
<option> Specify an option in a <field>

<param> Parameter in <object> or <subdialog>
<prompt> Queue speech synthesis and audio output to the user
<property> Control implementation platform settings.
<record> Record an audio sample
<reprompt> Play a field prompt when a field is re-visited after an
event
<return> Return from a subdialog.
<script> Specify a block of ECMAScript client-side scripting logic
<subdialog> Invoke another dialog as a subdialog of the current one
<submit> Submit values to a document server
<throw> Throw an event.
<transfer> Transfer the caller to another destination
<value> Insert the value of an expression in a prompt
<var> Declare a variable
<vxml> Top-level element in each VoiceXML document13
13
Source: http://www.w3.org/TR/voicexml20/
88
Appendix C: SRGS elements (XML form)
Element Purpose
<grammar> Root element of an XML grammar
<meta> Header declaration of meta content of an HTTP equivalent
<metadata> Header declaration of XML metadata content
<lexicon> Header declaration of a pronunciation lexicon
<rule> Declare a named rule expansion of a grammar
<token> Define a word or other entity that may serve as input
<ruleref> Refer to a rule defined locally or externally
<item> Define an expansion with optional repeating and probability
<one-of> Define a set of alternative rule expansions
<example> Element contained within a rule definition that provides an example of
input that matches the rule
<tag> Define an arbitrary string that to be included inline in an expansion which
may be used for semantic interpretation14
14
Source: http://www.w3.org/TR/speech-grammar/
89
NÁZOV: Interaktívne telekomunikačné systémy a služby

AUTORI: prof. Ing. Jozef Juhár, CSc., Ing. Stanislav Ondáš, PhD.
VYDAVATEĽ: Technická univerzita v Košiciach
ROK: 2015
ROZSAH: 90 strán
NÁKLAD: 50 ks
VYDANIE: prvé
ISBN: 978-80-553-2018-2
90

Interactive Telecommunication Systems and Sevices

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Interactive Telecommunication Systems and Sevices

Uploaded by

Copyright:

Available Formats

Moderné vzdelávanie pre vedomostnú spoločnosť/

Projekt je spolufinancovaný zo zdrojov EÚ

Fakulta elektrotechniky a informatiky

Juhár Jozef, Ondáš Stanislav

Prioritná os 1 Reforma vzdelávania a odbornej prípravy

Názov projektu: Balík prvkov pre skvalitnenie a inováciu vzdelávania na TUKE

NÁZOV: Interaktívne telekomunikačné systémy a služby

Rukopis neprešiel jazykovou úpravou.

1.1. SPOKEN DIALOGUE SYSTEMS ........................................................................................................................ 7

2. COMPONENTS OF THE SPEECH-BASED HUMAN-MACHINE INTERFACE .................................................. 15

2.1. AUTOMATIC SPEECH RECOGNITION ............................................................................................................ 15

3. TECHNOLOGIES FOR DESIGNING ITSS .................................................................................................... 25

3.1. LANGUAGES FOR DESIGNING ITSS .............................................................................................................. 25

4. DESIGNING VOICEXML-BASED VOICE SERVICES ..................................................................................... 48

4.1. BASIC PRINCIPLES .................................................................................................................................... 48

5. WEB APPLICATIONS WITH VOICE MODALITY ........................................................................................ 64

5.1. TECHNOLOGIES FOR CREATING WEB APPLICATION .......................................................................................... 65

6. TABLE OF PICTURES ............................................................................................................................... 75

APPENDIX A: PUBLIC LIGHTING CONTROL APPLICATION SOURCE CODES ....................................................... 79

APPENDIX B: VOICEXML 2.0 ELEMENTS .......................................................................................................... 87

APPENDIX C: SRGS ELEMENTS (XML FORM) ................................................................................................... 89

Interactivity, mobility and real-time multimedia streaming becomes new

Proposed book aspires to capture this effect. It is focused mainly on the

Regarding to the considerable width of the described problematics, it

Interactive telecommunication systems (ITS) are systems that enable human-machine

Fig. 1. Human machine communication chain

1.1. Spoken dialogue systems

Fig. 2 Pipeline architecture of the Spoken Dialogue System

A typical example of distributed architecture is Hub-server architecture as used in Galaxy

Fig. 3 Hub-server architecture of the Galaxy Communicator1

The Galaxy Communicator is distributed, message-based, hub-and-spoke software

1.2. Multimodal interactive systems and services

Fig. 4 Architecture of the multimodal dialogue system2

Fig. 5. An example of user's input in multimodal interaface

 Data level fusion

Typical input modalities are:

 Speech and speech characteristics (emotions)

The multimodal generation (fision) is an inverse process to multimodal integration (fusion).

Typical output modalities are as follows:

 Visual modalities (text, graphics, animations or virtual agent embodiment)

Fig. 6. MOBILTEL Weather Forecast multimodal application

1.3. Applications of SDS and MDS systems

Fig. 8. GRETA - Embodied Conversational Agent5

Fig. 10 Fujitsu prototype of the Interactive Voice-recognition Car Navigation Unit 7

2. Components of the speech-based human-

2.1. Automatic Speech Recognition

The general architecture of the ASR system is shown in Fig. 11.

Fig. 11 General architecture of the ASR system

Decoding component implements one of the recognition approaches. In general, it performs

2.2. Natural Language Understanding

2.3. Dialogue management

 Finite state/dialogue grammar-based approaches

Fig. 12 Example of dialogue finite state machine8

In a focus of collaborative approaches is a collaborative process, which is one of the

2.4. Spoken language generation and speech synthesis

FRAME: The train [train-name] leaves the station [station-name] at [departure-time].

REALIZATION: The train IC502 leaves the station Košice at 9:15.

Natural language generation block is usually followed by the Text-to-Speech (TTS)

Articulation-based speech synthesis method is based on the physical model of speech

Fig. 13 Block diagram of the diphone-based concatenative speech synthesis

Corpus synthesis takes advantages of concatenation approach and suppresses its

Fig. 14. Block diagram of the HMM-based speech synthesis

3. Technologies for designing ITSS

Speech technologies can be classified to several groups as follows: