You are on page 1of 6

International Journal of Advanced Computer Science, Vol. 3, No. 9, Pp. 434-439, Sep., 2013.

Multiplatform Instantiation Speech Engines Produced with FIVE


Alexandre M. A. Maciel & Edson C. B. Carvalho
Abstract Oral communication is, without the shadow of a doubt, the most natural form of human communication. By virtue of human-computer interaction becoming more and more common, a natural demand has arisen for systems with a voice-based interface. This paper presents a set of applications, each with a voice interface, which were implemented on various technological platforms so as to investigate the process of instantiating speech engines produced with FIVE. The experiments undertaken presented some technical restrictions, which, however, did not prevent the applications from being run.

Manuscript
Received: 19,Mar.,2013 Revised: 15,May,2013 Accepted: 10,Jul.,2013 Published: 15,Aug.,2013

Keywords
VUI, FIVE, Portability

1. Introduction
In recent years, the area of voice-based interfaces has received great attention from academics for two reasons: first, due to improvements in the performance of systems that automatically process speech, including speech recognition, in translating spoken languages and in voice synthesis; and secondly, due to the convergence of devices and the mass production of multimedia contents, which have come to require faster and more efficient modes of interaction with users [1]. Besides being one of the most natural forms of communication between people, a voice-based interface offers many advantages when compared to other forms of interface, for example: speed (most people can easily talk at rates of 200 words per minute while few manage to type more than 60 words per minute); mobility (in many environments it is not suitable to use keyboards and a mouse or the user's eyes must remain fixed on a display; and safety (the voice is a biometric mechanism that can be used to verify access to restricted systems and environments) [2]. Despite these advantages, the voice interface area still has some weaknesses that permeate research into it. The process of developing applications with a voice interface, whether in constructing speech engines or in instantiating these engines in the application layer, suffers natural deficiencies caused by the lack of a policy on productivity, an innate feature of software that implements best practices in Software Engineering, and the tools available to assist
Alexandre M. A. Maciel, University of Pernambuco, alexadre.maciel@upe.br, Edson C. B. Carvalho, Federal University of Pernambuco, ecdbcf@cn.ufpe.br.

this process serve specific sub-areas of the voice interface are what see to it that integrating them with diverse technology environments is not always trivial [3]. Given this scenario, FIVE (Framework for an Integrated Voice Environment) was developed with the objective of assisting the development of applications with a voice interface [4]. To do so, engines for Automatic Speech Recognition (ASR), Automatic Speaker Verification (ASV) and Text-To-Speech (TTS) can be created, and, using their specific API (Application Programming Interface), these engines can be instantiated in several technological environments. This article sets out the process of integrating speech engines created from FIVE in telephone, mobile device and digital TV environments. To this end, the state of the art of the architecture of voice-based interface applications was studied; the tool FIVE was used to construct speech engines; and finally three applications with a reduced scope were developed so as to evaluate the instantiation process of these engines in the environments proposed.

2. Voice User Interface


The Voice User Interface (VUI) consists of the interaction of a person with a system via voice, using a spoken language application [5]. VUIs came into being during research on Artificial Intelligence, especially when "conversational interfaces" were being developed, in the 1950s. But it has only been since the late 90s that the technology has displayed significant improvement [6]. In order to meet the great demand of society for applications with VUI, the voice technologies can be classified as belonging to one of the following subareas: coding, voice synthesis, speech recognition, speaker recognition and language identification [7]. According to Huang et al. [7] the typical architecture for developing an application with a voice interface has three main components: the first is the set of engines that are responsible for recognizing speech or the synthesis of voice; and the second is an API that is normally used to facilitate communication between the engines and the application, the latter consisting of a number of possible applications that can be developed. Figure 1 gives an overview of these components within the architecture mentioned.

Alexandre A.M. Maciel et al.: Multiplatform Instantiation Speech Engines produced with FIVE.

435

Application 1

Application 2

...

Application N

API (Application Programing Interface)

Engine 1

Engine 2

...

Engine N

Fig. 1 Architecture for voice user applications

This architecture makes it possible to simplify maintenance further since one component of a layer can be altered without affecting another, thus enabling a clear separation to be made between the task of constructing engines and instantiating them in the application layer. Thus, a team of speech processing specialists can look after developing engines, whether these be for speech recognition, speech synthesis or verification of speaker, while another team, comprising interface designers, visual artists, computer scientists and users can be directed to develop the interface [8]. A. Construction of engines The construction of engines, whether for speech recognition or voice synthesis, at bottom, consists of a process of recognizing patterns. According to Duda et al. [9], this process normally follows an architecture common to any pattern (static image, voice, financial data, etc.) and can be partitioned into four main modules, as shown in Figure 2.
Patter Acquisition Features Extraction Pattern Classification Analyzing the Results

representations of the voice patterns based on a set of labeled training samples, and to make a reliable comparison with the test samples [12]. According to O'Shaughnessy [11], the most commonly used approaches to classifying patterns are based on Hidden Markov Models, in Artificial Neural Networks and in Support Vector Machines. Specifically for problems of speaker verification, Holmes [13] states that the main approaches related to classifying patterns are based on the technique of Vector Quantization or on Gaussian Mixture Models. Finally, the module for analyzing the results consists of a set of metrics used to offer a better presentation of the outputs of the tests for classifying patterns. For speech recognition systems, the Word Error Rate - WER) associated with a Confusion Matrix is one of the most commonly used metrics [14]. As to speech synthesis systems, there are several approaches for assessing synthetic voices. The choice of which approach to choose depends on the purpose of the assessment and on the specific application which is devoted to the synthetic voice. [15] B. Instantiation o f the engines The step of instantiating speech engines in an application is normally supported by a software layer that hides the details of implementing an application with a voice-based interface from a developer. This software suite, habitually, designated API, enables speech recognition and voice synthesis recognition engines to be controlled in real time and also the interfaces of audio input/output [16]. In the context of applications with a voice interface, Huang [7] affirms that the use of API is important because it ensures that multiple applications can work with a large number of components (engines) made available by different suppliers of speech technology, in various computer environments . Currently there is a large variety of initiatives that aim at making APIs available for instantiating speech engines. The Microsoft Speech API and the Loquendo API are proprietary solutions that allow interaction with diverse types of engines independent of the manufacturer, whereas the Java Speech API and W3C VoiceXML [17] are solutions created with the objective of offering a pattern that is able to recognize and synthesize speech independently from the platform. In general, the functions available in an API are accessible only by means of programming. That is, the requests that developers wish to make to the API are necessarily carried out by means of calls inserted into the source code of its applications. The syntax needed to make these calls is normally defined in the documentation made available with the API, without which, such a task becomes unviable. Most of APIs provides functionalities for two areas: speech recognition and voice synthesis. This is useful to simulate the natural mode that man communicates with itself, ie, the interaction must simulate a dialog with conversations characterized for changes of initiative verbal and nonverbal feedbacks [19].

Fig. 2 Components of a Pattern Recognition System

Specifically for speech recognition systems, the pattern acquisition module consists of capturing the voice signal and converting it into digital samples. The process of capturing the sound waves is normally achieved by using a microphone and conversion is carried out in conjunction with a digitalizing sound board. Associated with the audio acquired, speech systems require the textual transcription of the audio content as a way to adjust the language models to the audio samples [10]. The feature extraction module is responsible for computing the data from the audio signal and generating the representative items of information needed by the pattern classification module. Using only the important features from the signal, the amount of data used for comparison is considerably reduced. Thus less computational power is required and less processing time is needed. According to O'Shaughnessy [11], the two main parameters of speech are Linear Prediction Coefficients - LPC and the MelFrequency Cepstral Coefficients- MFCC. The module for classifying patterns consists of using algorithmic approaches to establish consistent
International Journal Publishers Group (IJPG)

436

International Journal of Advanced Computer Science, Vol. 3, No. 9, Pp. 434-439, Sep., 2013.

3. Framework FIVE
FIVE is a framework that targets objects, developed in the Java programming language which uses the model for sharing responsibilities (Model-View-Controller) [18] and the Hibernate persistence framework to provide independence from a data-bank system. The general architecture of FIVE was designed, independently, based on three modules [4]: the CORE module contains the algorithms needed to construct the engine. Algorithms for natural language processing (Grapheme to Phoneme Rules), feature extraction (LPC and MFCC) and pattern classification (HMM, SVM and GMM) are available; the API module consists of a reduced implementation of JSAPI with a set of useful features to intermediate the communication between engines, originated by the CORE module, and the applications; and the GUI module consists of a graphical user interface available to facilitate the development process for users with little experience in speech processing and pattern recognition. The GUI module was designed in sequential steps in order to attend the FIVE requirements for usability, as shown in Figure 3. These steps are the same for any type of engine and they are presented in seven tabs: Speakers, where specialized information about the speaker are available; Utterances, where audio transcription is inserted; Dictionary, where phonetic representation are informed; Samples, where audio samples are captured and related to the corresponding speaker and utterances; Feature Extraction, where parametric information about the technique of feature extraction is adjusted; Pattern Classification, where parametric information about pattern classification techniques is adjusted; Engine Generation, where features needed by specific engines (speech, speaker and synthesis) are merged.

4. Speech Engines
For the purposes of optimizing the demonstration of instantiating engines in example applications, the authors decided to use three engines previously developed in Maciel [3]. The sections that follow present the parameters and the results used in creating a speech recognition engine, a speech synthesis engine and a speaker verification engine. A. Speech recognition To validate the results of the speech engines built using FIVE, it was decided to create an engine that recognizes words in isolation. 40 volunteer speakers (30 men and ten women) with a mean age of 23.5 years, from the Northeast Region of Brazil, were selected and 20 words in isolation that represent control commands were chosen to form part of the speech engine. For the process of extracting features, the standard MFCC algorithm were used, the MFCC based on the format of HTK [19], and finally the standard MFCC based on the ETSI format [20]. For the process of classifying patterns, what were used were HMM techniques with MFCC_HTK and HTK_ETSI features and the phonetic models adopted were: whole word, phonemes and triphonemes. For the experiments using the SVM technique, the features used were: the standard MFCC algorithm and the Kernel functions used were: Linear, Polynomial and Sigmoid. Table 1 shows the best percentages of right answers obtained in these experiments.
TABLE 1 RESULTS FROM THE SPEECH RECOGNITION EXPERIMENTS

Technique HMM-HTK

HMM-ETSI

SVM

Phonetic unit Word Phoneme Triphoneme Word Phoneme Triphoneme Linear Polynomial Sigmoid

Success rate 88.12% 95.72% 87.89% 96.21% 97.28% 89.10% 98.30% 98.30% 97.67%

B. Voice synthesis

Fig. 3 Screenshot of FIVE framework

FIVE is free software. The idea of those who maintain FIVE is to create a large network of collaborators so that the academic community can help in enhancing and implementing improvements to the tool.

The validation of the results from the synthesis engines was conducted in two stages: the first consisted of training to obtain acoustic models of two synthetic voices (one male and one female), and the second consisted of a survey to assess the quality of these voices. The training step started by selecting two professional speakers, who were given phono-audiological instructions in order to avoid accents and regionalisms. Then, it was selected 800 phonetically balanced locutions for training. The training process used in FIVE to obtain acoustic models of the synthetic voices followed the HMM-based model proposed by Maia [21]. At least 10 hours of processing were required in a Pentium Dual Core CPU (2.1GHz) and 3GB of RAM memory.
International Journal Publishers Group (IJPG)

Alexandre A.M. Maciel et al.: Multiplatform Instantiation Speech Engines produced with FIVE.

437

The evaluation stage, with the synthesization of 30 phonetically balanced sentences with a number of varied words, interrogatives and statements. These phrases were synthesized with male and female voices based in the acoustic models obtained with FIVE; with Mbrola dyphones models [22] and with the Loquendo commercial synthesizer. Figure 4 shows the results for the six voices analyzed (MM: Mbrola male, MF: Mbrola female; FM: FIVE male, FF: FIVE female, LF: Loquendo female, and LM: Loquendo male). To evaluate the quality of the transcription of the spoken phrases, the percentage was calculated of phrases transcribed in full and individual words transcribed correctly. Figure 4 shows the best percentages of right transcribes obtained in these experiments.

developed with voice interface in diverse technology environments: Digital TV, mobile and telephone. To minimize the effort of studying the technologies that support these environments, the option was taken to implement applications with a limited scope so that this would make it possible to evaluate the API Instantiation features of FIVE. A. Digital TV The evaluation of instantiating FIVE in the digital TV environment was promoted by instantiating the speech recognition engine, described in section 4.A, in an interactive application compatible with the Brazilian standard for Digital TV. The objective of this application is to provide the user with a mechanism for interaction by applying the coloring of the chameleon such that, whenever the user says a color, the chameleon shown changes to the color said out loud. The application was developed using the Ginga platform [23]. Ginga is the middleware that enables interactive applications to be developed for Digital TV to the Brazilian standard independently of the hardware platform of the manufacturers of receivers. The integration between FIVE and the GINGA platform occurred through the GingaJ API [24]. It provides an infrastructure for running Java applications and extensions specifically geared to the TV environment. However, this API has a limitation for compatibility with regard to version 1.7 of Java (the version in which FIVE was developed). To work around this problem, a solution based on sockets was implemented so as to bring about communication between the Chameleon application and FIVE. The architecture used for implementing this project was all local. That is, FIVE was hosted locally on the (emulated) device that accessed the voice features by instantiating the API provided by FIVE. Figure 5 shows the demonstration screen of the digital TV application in its initial state as soon as it is initialized on the emulator with the green chameleon. Then, the user sends a voice command with the word "yellow" and the chameleon changes its color to yellow.

Fig. 4 Results for voice synthesis experiments

C. Verification of speaker Engines To investigate the process of generating speaker engines it was decided to create a speaker verification engine dependent on the text. 30 volunteer speakers (24 men and 6 women) with a mean age of 22.8 years, from the Northeast Region of Brazil, were selected and the locutions that served as passwords were name and surname. The process of acquiring audio samples occurred as follows: the speakers were separated into groups of six and each speaker recorded his/her password 10 times and the five passwords of his/her colleagues from the group so as to serve as a sample space of impostors. The extraction of features was carried out using the MFCC technique and the classification of the patterns using the GMM technique. The results were compared to a threshold based on the mean and standard deviation and compared to the threshold. The basic measures of error presented by FIVE offered a False Acceptance Rate (FAR), a False Rejection Rate (FRT) and a Total Success Rate (TSR). Table 2 shows the results for each configuration of mixtures of Gaussians.
TABLE 2 RESULTS FROM SPEAKER VERIFICATION EXPERIMENTS

Fig. 5 Demonstration screen of Digital TV application

Mixtures 16 32 64

FAR 11.23% 10.98% 9.43%

FRT 8.92% 8.47% 8.32%

TSR 89.92% 90.27% 91.12%

5. Samples of Instantiation
With a view to evaluating the instantiation of the speech recognition, speech synthesis and speaker verification engines produced in Maciel (2012), three applications were
International Journal Publishers Group (IJPG)

B. Mobile Devices The evaluation of instantiating FIVE in the environment of mobile devices was promoted by instantiating the speech synthesis engine, described in section 4.B, in an application for reading text messages. The aim of this application is to provide the user with a text box so that he/she can test the quality of synthesized speech produced by FIVE. The application was developed using the Android platform [25]. Android is a mobile operating system developed by Google that allows developers to write software in Java

438

International Journal of Advanced Computer Science, Vol. 3, No. 9, Pp. 434-439, Sep., 2013.

programming language, thus controlling the device via development libraries. The integration between FIVE and the Android platform came about by using Android SDK [26]. This tool provides a rich set of features that helps to build applications for mobile devices, which include: emulator, tools for debugging the code, memory and analysis of performance. A plugin available for Eclipse IDE allows greater integration between the application in question and other features of the Java language. The architecture used to implement this project was all local. That is, FIVE was hosted locally on the (emulated) device that accessed the voice features by instantiating the API that FIVE made available. Figure 6 shows the screen in its initial state, ie, as soon as the application is initialized in the emulator. Thereafter, the user should type in the text he/she wants and should listen to the spoken words synthesized by FIVE.

locator verification engine created by FIVE were installed on the web server. The capture of the audio of the telephone and its transmission to the web server was facilitated by the FastAGI protocol. Thus FIVE managed to process the input and return the result to the telephone server. As a result of this application, Figure 7 and Figure 8 show two dialogues carried out by the access control system. In the first, a registration process of a vocalic password for a new user is shown while the second presents access process.
System: Welcome to the access system. Let's start the registration process. Please say your vocalic password aloud and then press the five key when finished. User: Alexandre Maciel System: Please say your vocalic password aloud again and press the five key when finished. User: Alexandre Maciel System: OK. Say aloud your vocalic password for one last time. User: Alexandre Maciel System: Vocalic password registered successfully. Fig. 7 Process for registering vocalic password System: Welcome to the access control system. Please say your vocalic password aloud. User: Alexandre Maciel System: Access granted. Fig. 8 Access process by vocalic password

6. Conclusions
Fig. 6 Screen for Mobile Device Application

C. Telephones The evaluation of instantiating FIVE in the telephone environment was promoted by instantiating the speaker verification engine, described in section 4.C, in an access control application via a phone call. The objective of this application is to provide the user with a mechanism for accessing a call centre system using his/her voice as a password. The application was developed using the Asterisk platform [27]. Asterisk is a free, highly-respected piece of telephone software with an open source code that can transform a simple computer into a communications server. With it, IP PABX systems, VoIP gateways, conference servers and much more can be created. Asterisk has been widely used by small and large businesses, call centers and governments all around the world. The integration between FIVE and Asterisk platform occurred using the Asterisk Gateway Interface library [28]. This library consists of an interface that adds functionalities to the Asterisk platform by means of several programming languages, including Java. An important component of this library is the FastAGI protocol that allows communication between Asterisk features and the applications developed in Java. The architecture used to implement this project had two servers: the telephone server responsible for controlling the Asterisk features and the web server responsible for hosting the access control application. The voice server was equipped with a telephone board and configured in the Linux environment. Apache Tomcat version 6 and the

This paper presented a set of applications developed in digital TV, mobile telephone and telephone environments in order to investigate the instantiation ability of the FIVE framework on multiple platforms. The percentage of right answers of the engines used for instantiation developed by (Maciel, 2012) were sufficient to undertake the instantiation process and the functionalities made available by the FIVE API proved to be efficient and easy to use. Even though they had a limited scope, the applications did not adversely affect the evaluation of the instantiation process. However, however some difficulties were encountered. The use of a power socket was needed for the application for digital TV so as to get round the problem of versioning Java. Two alternatives can be implemented to mitigate this problem: either an upgrade of the GingJ platform to version 1.7 of Java or a downgrade of FIVE to version 1.4 of Java. As to the application for mobile devices, there were no problems regarding the instantiation of FIVE on the Android platform. However, some resources of the framework are written in C and a specific compilation of Android to C needed to be undertaken. An alternative so as to mitigate this problem would be the algorithmic conversion of the matters outstanding written in C language to Java language. Regarding the application to land phones, no problem as to integration was found. The fact that the telephone server made a simple hardware infrastructure in the Linux
International Journal Publishers Group (IJPG)

Alexandre A.M. Maciel et al.: Multiplatform Instantiation Speech Engines produced with FIVE.

439

environment available did not require considerable effort on instantiating API in the application. Given these results, the authors successfully managed to instantiate the engines produced by FIVE in digital TV, mobile and telephone environments. As FIVE is still a tool under development, many improvements can still be made in order to make it more platform-independent.

[17]

[18]

References
[1] Salvador, V.F.M., Oliveira NETO, J.S., Paiva, M.G. (2010) Evaluating Voice User Interfaces in Ubiquitous Applications. In: Designing Solutions-Based Ubiquitous and Pervasive Computing Edited by Milton Mendes, Pedro Fernandes. New Issues and Trends. Natal, RN: IGI Global. Fechine J. M. (2002) Reconhecimento Automtico de Identidade Vocal utilizando Modelagem Hbrida: Paramtrica e Estatstica, Thesis (Doctorate in Eletcrical Engineering). Federal University of Paraba. Maciel, A. M. A. (2012) Investigao de um Ambiente de Desenvolvimento Integrado de Interface de Voz. Thesis (Doctorate in Computing Science). Federal University of Pernambuco. Maciel, A. M. A., Carvalho, E. C. B. (2010) FIVE Framework for an Integrated Voice Environment. In: Proceedings of International Conference on Systems, Signal and Image Processing, Rio de Janeiro. Shneiderman, B. (2000) The limits of speech recognition, In: Communications of the ACM, v. 43, n. 9, p. 24-27. Cohen, M. H., Giangola, J. P., Balogh, J., Voice User Interface Design, Addison Wesley, 2004. Huang, X., Acero, A., Hon, H.W., Spoken Language Processing A Guide to Theory, Algorithm, and System Development, Prentice Hall, 2001. Martins, V. F. (2011) Avaliao de Usabilidade para Sistemas de Transcrio Automtica de Laudos em Radiologia. Thesis (Doctorate in Engineering) University of So Paulo. Duda, R.O., Hart, P.E., Stork, D.G., Pattern Classification, Wiley-Interscience. 2000. McLoughlin, I., Applied Speech and Audio Processing. Cambridge, Cambridge University Press, 2009. OShaughnessy, D. (2008) Automatic Speech Recognition: History, Methods and Challenges. In: Pattern Recognition, v.41, p.2965-2979. Gaikwad, S.K., Gawali, B.W., Yannawar, P. (2010) A Review on Speech Recognition Technique, In: International Journal of Computer Applications, v.10, No 3. Holmes, J., Speech Synthesis and Recognition, CRC Press, 2002. Stehman, P. (1997) Selecting and Interpreting Measures of Thematic Classification Accuracy, In: Remote Sensing of Environment, v.62, p.7789. Cryer, H. and Home, S. (2010) Review of Methods for Evaluating Synthetic Speech, In: RNIB Centre for Accessible Information (CAI), Technical report # 8. Maciel, A. M. A., Veiga, A., Neves, C., Lopes, J., Lopes, C., Perdigo, F., S, L. A. (2008) Robust Speech Command Recognizer for Embedded Applications. In: Proceedings of

[19]

[20] [21] [22]

[2]

[3]

[23]

[4]

[24]

[25]

[5] [6] [7]

[26] [27]

[8]

[28] [29]

International Conference on Signal Processing and Multimedia Application. Porto. McGlashan, S. et al. (2010) Voice Extensible Markup Language (VoiceXML) 3.0. Available at <http://www.w3.org/TR/voicexml30>. Accessed in March, 2013. Gamma, E., Helm, R., Johnson, R. Vissides, J., Design Patterns: Elements of a Re-usable Object-Oriented Software, Addison Wesley, 1995. Maciel, A. M. A. (2007) "Investigao de um Ambiente de Processamento de Voz utilizando VoiceXML" Dissertation (Master in Computer Science) Federal University of Pernambuco. HTK Hidden Markov Models ToolKit. Available at <http://htk.eng.cam.ac.uk> Accessed on March 16, 2013. ETSI - ETSI ES 202 050, 2007. Technical Report. Available at <http://www.etsi.org/>. Acessed on March 16, 2013. Maia, R. S. (2006) Speech Synthesis and Phonetic Vocoding for Brazilian Portuguese based on Parameter Generation from Hidden Markov Models. Thesis (Doctorate in Engineering). Nagoya Institute of Technology. Mbrola The Mbrola Project. Available at <http://tcts.fpms.ac.be/synthesis>. Accessed on: March 16, 2013. Ginga Middleware Aberto do Sistema Brasileiro de TV Digital. Available at <http://www.ginga.org.br>. Accessed on: March 16, 2013. GingaJ API para desenvolvimento Java em ambiente de TV digital. Available at <http://www.lavid.ufpb.br>. Accessed on: March 16, 2013. Android Sistema Operacional Mvel. Available at <http://www.android.com >. Accessed on: March 16, 2013. Android Software Development Kit (SDK). Available at <http://developer.android.com/sdk>. Accessed on: March 16, 2013 Asterisk the Open Source Telephony Projects. Available at <http://www.asterisk.org>. Accessed on: March 16, 2013. Asterisk Gateway Interface. Available at <http://asterisk-java.org>. Accessed on: March 16, 2013. Alexandre M. A. Maciel, holder of a doctorate in Computing Science from the Federal University of Pernambuco. Currently teaches on the Information Systems course at the University of Pernambuco. Has done research at the Institute of Telecommunications of the University of Coimbra.

[9] [10] [11]

[12]

[13] [14]

[15]

[16]

Edson C. B. Carvalho PhD in Artificial Intelligence from the University of Canterbury, has been an assistant professor at the Federal University of Pernambuco since 1982, has supervised more than 50 masters and doctorate students and has had more than 100 papers published in journals and conferences.

International Journal Publishers Group (IJPG)