Jeong 2013

2013 Seventh International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing
A Remote Computer Control System Using Speech Recognition Technologies of

Mobile Devices
Hae-Duck J. Jeong, Sang-Kug Ye, Jiyoung Lim, Ilsun You, and WooSeok Hyun∗ Hee-Kyoung Song
Department of Computer Software
Software Development Center
Korean Bible University
KT Corporation
Seoul, South Korea
Seoul, South Korea
Email: hdjjeong@gmail.com, pajamasi726@hanmail.net,
Email: hk.song@kt.com
{jylim, isyou, wshyun}@bible.ac.kr
∗
Corresponding Author
Abstract—This paper presents a remote control computer are inching closer to machine interfaces we can actually talk
system using speech recognition technologies of mobile devices to [2].
for the blind and physically disabled population. These peo- This paper presents a remote control computer system
ple experience difficulty and inconvenience using computers
through a keyboard and/or mouse. The purpose of this system using speech recognition technologies of mobile devices
is to provide a way that the blind and physically disabled for the blind and physically disabled population. These
population can easily control many functions of a computer via people experience difficulty and inconvenience using com-
speech. The configuration of the system consists of a mobile puters through a keyboard and/or mouse. The purpose of
device such as a smartphone, a PC server, and a Google server this system is to provide a way the blind and physically
that are connected to each other. Users can command a mobile
device to do something via speech such as directly controlling disabled population can easily control many functions of
computers, writing emails and documents, calculating num- a computer via speech. The configuration of the system
bers, checking the weather forecast, and managing a schedule. consists of a mobile device such as a smartphone, a PC
They are then immediately executed. The proposed system also server, and a Google server that are connected to each
provides the blind people with a function via TTS (text to other. Users command a mobile device to do something
speech) of the Google server if they want to receive contents
of the document stored in a computer. via speech such as directly controlling computers, writing
emails and documents, calculating numbers, checking the
Keywords-Speech recognition technology; mobile device; An- weather forecast, and managing a schedule. They are then
droid; wireless communication technique
immediately executed. The proposed system also provides
blind people with a function via TTS (text to speech) of
I. I NTRODUCTION
the Google server when they want to receive contents of the
Speech recognition technology, which is able to recog- document stored in a computer.
nize human speech and change to text, or to perform a In Section II, a few related works and technologies of
command, has emerged as the ’Next Big Thing’ of the IT the proposed remote computer control system are discussed.
industry. Speech recognition is technology that uses desired Section III presents how the proposed system using speech
equipment and the service which can be controlled through recognition technologies is designed and implemented, and
voice without using items such as a mouse or keyboard. finally the conclusions are described in Section IV.
It also appeared as part of ongoing research in progress
in 1950s, but was not popularized until the mid-2000s, II. R ELATED W ORKS AND T ECHNOLOGIES
with a low voice recognition. Until now, related speech Related works and technologies of the proposed remote
recognition technologies, which have been used limitedly computer control system using speech recognition technolo-
for special-purposes, have been rapidly evolving because of gies of mobile devices are Android, and speech recognition
the proliferation of portable computing terminals such as algorithms as follows.
smartphones interconnected with the expansion of the cloud
infrastructure [1]. A. Android
One of the most prominent examples of a mobile voice Android is a Linux-based open mobile platform for mo-
interface is Siri, the voice-activated personal assistant that bile devices such as smartphones and tablet computers. It
comes built into the latest iPhone. But voice functionality is is composed of not only an operating system, but also
also built into Android, the Windows Phone platform, and middleware, user interface (UI), browser, and application.
most other mobile systems, as well as many applications. It also includes C/C++ libraries that are used in components
While these interfaces still have considerable limitations, we of various Android systems [3]. Figure 1 shows that Android
978-0-7695-4974-3/13 $26.00 © 2013 IEEE 595

DOI 10.1109/IMIS.2013.105
Figure 1. Android system architecture. Figure 2. Data flow diagram of speech recognition.
system architecture is divided into five hierarchical cate- c) Syntactic Analysis analyzes texts, which are
gories: applications, application framework, libraries, An- made up of a sequence of tokens, to determine
droid runtime, and Linux kernel [4], [5], [6]. The proposed their grammatical structure.
application was designed and developed on Android. d) Semantic Analysis relates syntactic structures
B. Speech Recognition Algorithms from the levels of phrases and sentences to their
language-independent meanings.
1) Google Speech Recognition
Google uses artificial intelligence algorithms to 2) Hidden Markov Model
recognize spoken sentences, stores voice data Modern general-purpose speech recognition systems
anonymously for analysis purposes, and cross are based on Hidden Markov Models (HMM). HMM
matches spoken data with written queries on is a doubly stochastic process with an underlying
the server. Key problems of computational stochastic process that is not observable (it is hidden),
power, data availability and managing large but can only be observed through another set of
amounts of information are handled with ease stochastic processes that produce the sequence of ob-
using android.speech.RecognizerIntent served symbols [7], [8]. HMMs are statistical models
package [5]. Client application starts up and prompts that output a sequence of symbols or quantities, and
user to input using Google Speech Recognition. are used in speech recognition because a speech signal
Input data is sent to the Google server for processing can be viewed as a piecewise stationary signal or
and text is returned to client. Input text is passed a short-time stationary signal. In a short time-scales
to the natural language processing (NLP) server for (e.g., 10 milliseconds), speech can be approximated
processing using HTTP (HperText Transfer Protocol) as a stationary process. Speech can be thought of as a
POST 1 . Then the server performs NLP. Data flow Markov model for many stochastic purposes [9]. An-
diagram of speech recognition in Figure 2 shows that other reason why HMMs are popular is because they
there are several steps involved in NLP as in the can be trained automatically and are simple and com-
following: putationally feasible to use. In speech recognition, the
a) Lexical Analysis converts sequence of characters hidden Markov model would output a sequence of n-
into a sequence of tokens. dimensional real-valued vectors (with n being a small
b) Morphological Analysis identifies, analyzes, integer, such as 10), outputting one of these every 10
and describes the structure of a given language’s milliseconds. The vectors would consist of cepstral
linguistic units. coefficients, which are obtained by taking a Fourier
transform of a short time window of speech and de-
1 POST request is used to send data to a server. The string detected by
correlating the spectrum using a cosine transform,
speech recognizer is passed to the server using this method. It accomplishes
this using in-built HttpCore API (i.e., org.apache.http package).
then taking the first (most significant) coefficients. The
The server performs processing and returns a JSON (JavaScript Object hidden Markov model will tend to have in each state
Notation) response. JSON is a lightweight data-interchange format, is based a statistical distribution that is a mixture of diagonal
on a subset of the JavaScript programming language, and is completely
language independent. In Java, org.json.JSONObject is used to parse
covariance Gaussians, which will give a likelihood for
strings [5]. each observed vector. Each word, or (for more general
596
speech recognition systems), each phoneme, will have modeling approach in automatic speech recognition
a different output distribution; a hidden Markov model (ASR) in the late 1980s. Since then, neural networks
for a sequence of words or phonemes is made by have been used in many aspects of speech recognition
concatenating the individual trained hidden Markov such as phoneme classification, isolated word recog-
models for the separate words and phonemes. nition, and speaker adaptation [9], [10]. In contrast to
The following notations for a discrete observation HMMs, neural networks make no assumptions about
HMM are defined. Let 𝑇 = {1, 2, ⋅ ⋅ ⋅ , 𝑇 } be the feature statistical properties and have several qualities
observation sequence (i.e., number of clock times), making them attractive recognition models for speech
and 𝑇 is length of the observation sequence. Let recognition. When used to estimate the probabilities of
𝑄 = {𝑞1 , 𝑞2 , ⋅ ⋅ ⋅ , 𝑞𝑁 } be states, where 𝑁 is the a speech feature segment, neural networks allow dis-
number of states, 𝑉 = {𝑣1 , 𝑣2 , ⋅ ⋅ ⋅ , 𝑣𝑀 } be discrete criminative training in a natural and efficient manner.
set of possible symbol observations, where 𝑀 is Few assumptions on the statistics of input features are
the number of possible observations, 𝐴 = {𝑎𝑖𝑗 } be made with neural networks. However, in spite of their
state transition probability distribution, where 𝑎𝑖𝑗 = effectiveness in classifying short-time units such as
𝑃 𝑟(𝑞𝑖 a𝑡𝑡+1∣𝑞𝑖 a𝑡𝑡), 𝐵 = {𝑏𝑗 (𝑘)} be observation sym- individual phones and isolated words, neural networks
bol probability distribution in state 𝑗, where 𝑏𝑗 (𝑘) = are rarely successful for continuous recognition tasks,
𝑃 𝑟(𝑣𝑘 a𝑡𝑡∣𝑞𝑗 a𝑡𝑡), and 𝜋 = {𝜋𝑖 } be initial state distri- largely because of their lack of ability to model
bution, where 𝜋𝑖 = 𝑃 𝑟(𝑞𝑖 a𝑡𝑡 = 1) [8]. temporal dependencies. Thus, one alternative approach
The mechanism of the HMM is explained in the is to use neural networks as a pre-processing e.g.
following: feature transformation, dimensionality reduction, for
Step-1. Choose an initial state, 𝑖1 , according to the initial the HMM based recognition.
state distribution, 𝜋. 4) Other Speech Recognition Systems
Step-2. Set 𝑡 = 1. Modern speech recognition systems use various com-
Step-3. Choose 𝑂𝑡 , according to 𝑏𝑖𝑡 (𝑘), the symbol binations of a number of standard techniques in order
probability distribution in state 𝑖𝑡 . to improve results over the basic approach described
Step-4. Choose 𝑖𝑡+1 according to {𝑎𝑖𝑡 𝑖𝑡+1 }, 𝑖𝑡+1 = above. A typical large-vocabulary system would need
1, 2, ⋅ ⋅ ⋅ , 𝑁 , the state transition probability dis- context dependency for the phonemes (so phonemes
tribution for state 𝑖𝑡 . with different left and right context have different
Step-5. Set 𝑡 = 𝑡+1; return to Step-3 if 𝑡 < 𝑇 ; otherwise realizations as HMM states); it would use cepstral
terminate the procedure. normalization to normalize for different speaker and
We use the compact notation 𝜆 = (𝐴, 𝐵, 𝜋) to recording conditions; for further speaker normaliza-
represent an HMM. For every fixed state sequence tion it might use vocal tract length normalization
𝐼 = 𝑖1 𝑖2 ⋅ ⋅ ⋅ 𝑖𝜏 , the probability of the observation (VTLN) for male-female normalization and maximum
sequence 𝑂 is 𝑃 𝑟(𝑂∣𝐼, 𝜆), where likelihood linear regression (MLLR) for more general
speaker adaptation. The features would have so-called
𝑃 𝑟(𝑂∣𝐼, 𝜆) = 𝑏𝑖1 (𝑜1 )𝑏𝑖2 (𝑜2 ) ⋅ ⋅ ⋅ 𝑏𝑖𝜏 (𝑜𝜏 ). (1) delta and delta-delta coefficients to capture speech
dynamics and in addition might use heteroscedastic
In other words, the probability of such a state sequence
linear discriminant analysis (HLDA); or might skip
𝐼 is
the delta and delta-delta coefficients and use splic-
𝑃 𝑟(𝐼∣𝜆) = 𝜋𝑖1 𝑎𝑖1 𝑖2 𝑎𝑖2 𝑖3 ⋅ ⋅ ⋅ 𝑎𝑖𝜏 −1 𝑖𝜏 . (2)
ing and a linear discriminant analysis (LDA)-based
The joint probability of 𝑂 and 𝐼 is simply the product projection followed perhaps by heteroscedastic linear
of the above two terms, discriminant analysis or a global semi-tied covariance
transform (also known as maximum likelihood linear
𝑃 𝑟(𝑂, 𝐼∣𝜆) = 𝑃 𝑟(𝑂∣𝐼, 𝜆)𝑃 𝑟(𝐼∣𝜆). (3)
transform, or MLLT). Many systems use so-called
Then the probability of 𝑂 is obtained by summing this discriminative training techniques that dispense with a
joint probability over all possible state sequences: purely statistical approach to HMM parameter estima-
∑ tion and instead optimize some classification-related
𝑃 𝑟(𝑂∣𝜆) = 𝑃 𝑟(𝑂∣𝐼, 𝜆)𝑃 𝑟(𝐼∣𝜆) (4) measure of the training data. Examples are maximum
𝑎𝑙𝑙𝐼 mutual information (MMI), minimum classification
∑
= 𝜋𝑖1 𝑏𝑖1 (𝑜1 )𝑎𝑖1 𝑖2 𝑏𝑖2 (𝑜2 ) ⋅ ⋅ ⋅ 𝑎𝑖𝜏 −1 𝑖𝜏 𝑏𝑖𝜏 (𝑜𝜏 ). error (MCE) and minimum phone error (MPE) [9].
𝑖1 ,𝑖2 ,⋅⋅⋅,𝑖𝜏
(5) III. I MPLEMENTATION AND R ESULTS
3) Neural Networks Figure 3 shows the architecture of the proposed system
Neural networks emerged as an attractive acoustic and command transmission methods among a mobile device,
597
Figure 3. Command transmission methods among a mobile device, a Figure 4. Overall flowchart of the proposed system.
Google server, and a personal computer server.
a Google server, and a personal computer server. The roles

of each number are in the following:
1) A user commands using the speech recognition appli-
cation of the mobile device.
2) Execute STT (speech to text) through the Google
server.
3) Transmit results obtained from STT to the mobile
device.
4) Transmit results obtained from STT to the personal
computer server via wireless communications such as
3G, WIFI, and Bluetooth.
5) The personal computer server analyzes corresponding
commands, and executes to distinguish between infor-
mation which is sent to the Google server, and infor- Figure 5. Java code of speech recognition.
mation which is executed on the personal computer
server.
6) Transmit information to the Google server if there nition activity.
is information to use the Google server among com- Figure 6 shows speech recognition by touching the mobile
mands. device screen. When executing speech recognition by touch-
7) The Google server returns corresponding values after ing the top of the mobile device screen, all speech contents
analyzing corresponding services. are typed and saved on the computer. When executing speech
8) Give the user information received from the Google recognition by touching the bottom, corresponding service
server with voice messages or execute. is executed by recognizing all speech contents. For example,
Figure 4 shows overall flowchart of the proposed sys- a user commands the mobile device to do ’what is today’s
tem that contains five main functions: speech recognition, weather?’ and then the remote system answers ’Today is 20
keyboard control, mouse control, simple mode, and text degrees Celsius and the weather is fine.’ Another example is
transmission. that a user from the outside commands his/her mobile device
A Remote computer control system using speech recog- to do ’Send meeting document in the document folder.’ and
nition technologies of mobile devices was implemented then the system finds it in the folder and transmits it to
by Java programming language. The proposed applica- user’s mobile device or a personal computer that he/she
tion was designed and developed on Android. Figure 5 wants. Figure 7 demonstrates computer keyboard control by
shows Java code of speech recognition for the proposed touching the smartphone screen; Figure 8 presents computer
application. startVoiceRecognitionActivity fires mouse control by touching the smartphone screen. There are
an intent to start the speech recognition activity and double click, left click, and right click buttons. Execution of
onActivityResult handles the results from the recog- applications users want on the simple mode is shown in
598
Figure 8. Computer mouse control by touching the smartphone screen.
There are double click, left click, and right click buttons.
Figure 6. Speech recognition by touching the smartphone screen.
Figure 7. Computer keyboard control by touching the smartphone screen.

Figure 9. Execution of applications users want on the simple mode.
Figure 9.
checking the weather forecast, and managing the schedule.
IV. C ONCLUSION They are then immediately executed. The proposed system
This paper presented a remote control computer system also provides blind people with a function via TTS (text to
using speech recognition technologies of mobile devices for speech) of the Google server if they want to receive contents
the blind and physically disabled population. Those people of the document stored in a computer.
experience difficulty and inconvenience in using computers
through a keyboard and/or mouse. The major purpose of this ACKNOWLEDGMENT
system was to provide a system that the blind and physically The authors would like to give thanks to the funding
disabled population can easily control many functions of agencies for providing financial support. Parts of this work
a computer via speech. The system is very useful for the were supported by a research grant from Korean Bible
general population as well. Users command a mobile device University. The authors also thank Robert Hotchkiss and
to do something via speech such as directly controlling com- anonymous referees for their constructive remarks and valu-
puters, writing emails and documents, calculating numbers, able comments.
599
R EFERENCES
[1] Korea Creative Contents Agency, “Trends and Prospects of
Speech Recognition Technologies,” 2011.
[2] W. Knight, “Where Speech Recognition Is Going,” MIT

Technology Review, technologyreview.com, 2012.
[3] Android, “Android Operating System, Wikipedia,”

http://en.wikipedia.org/wiki/Android OS.
[4] J. Aguero, M. Rebollo, C. Carrascosa, and V. Julian, “Does

Android Dream with Intelligent Agents?” Advances in Soft
Computing, vol. 50, pp. 194–204, 2009.
[5] A. Agarwal, K. Wardhan, and P. Mehta, “JEEVES - A

Natural Language Processing Application for Android,”
http://www.slideshare.net, 2012.
[6] C.-Y. Lee, B. An, and H.-Y. Ahn, “Android based Local SNS,”
Institute of Webcating, Internet Television and Telecommuni-
cation, vol. 10, no. 6, pp. 93–98, 2010.
[7] S.-S. Jarng, “Analysis of HMM Voice Recognition Algo-

rithm,” Journal of Advanced Engineering and Technology,
vol. 3, no. 3, pp. 241–249, 2010.
[8] L. Rabiner and B. Juang, “An Introduction to Hidden Markov

Models,” IEEE ASSP Magazine, pp. 4–16, 1986.
[9] Wikipedia, “http://en.wikipedia.org/wiki/Speech recognition.”
[10] Z.-H. Tan and I. Varga, “Network, Distributed and Embedded

Speech Recognition: An Overview,” Advances in Patterns
Recognition, 2008.
600

Jeong 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jeong 2013

Uploaded by

Copyright:

Available Formats

2013 Seventh International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing

A Remote Computer Control System Using Speech Recognition Technologies of

978-0-7695-4974-3/13 $26.00 © 2013 IEEE 595

a Google server, and a personal computer server. The roles

Figure 6. Speech recognition by touching the smartphone screen.

Figure 7. Computer keyboard control by touching the smartphone screen.

[2] W. Knight, “Where Speech Recognition Is Going,” MIT

[3] Android, “Android Operating System, Wikipedia,”

[4] J. Aguero, M. Rebollo, C. Carrascosa, and V. Julian, “Does

[5] A. Agarwal, K. Wardhan, and P. Mehta, “JEEVES - A

[7] S.-S. Jarng, “Analysis of HMM Voice Recognition Algo-

[8] L. Rabiner and B. Juang, “An Introduction to Hidden Markov

[9] Wikipedia, “http://en.wikipedia.org/wiki/Speech recognition.”

[10] Z.-H. Tan and I. Varga, “Network, Distributed and Embedded

You might also like