You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261287458

Automatic Speech Recognition

Chapter · November 2012


DOI: 10.1002/9781405198431.wbeal0066

CITATIONS READS

26 14,964

2 authors:

John M Levis Ruslan Suvorov


Iowa State University The University of Western Ontario
160 PUBLICATIONS 3,055 CITATIONS 32 PUBLICATIONS 380 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by John M Levis on 21 July 2018.

The user has requested enhancement of the downloaded file.


Automatic Speech Recognition
JOHN LEVIS AND RUSLAN SUVOROV

Definition

Automatic speech recognition (ASR) is an independent, machine-based process of decoding


and transcribing oral speech. A typical ASR system receives acoustic input from a speaker
through a microphone, analyzes it using some pattern, model, or algorithm, and produces
an output, usually in the form of a text (Lai, Karat, & Yankelovich, 2008).
It is important to distinguish speech recognition from speech understanding (or speech iden-
tification), the latter being the process of determining the meaning of an utterance rather
than its transcription. Speech recognition is also different from voice recognition: whereas
speech recognition refers to the ability of a machine to recognize the words that are spoken
(i.e., what is said), voice recognition involves the ability of a machine to recognize speaking
style (i.e., who said something).

Historical Overview

Pioneering work on ASR dates to the early 1950s. The first ASR system, developed at Bell
Telephone Laboratories by Davis, Biddulph, and Balashek (1952), could recognize isolated
digits from 0 to 9 for a single speaker. In 1956, Olson and Belar created a phonetic typewriter
that could recognize ten discrete syllables. It was also speaker-dependent and required
extensive training.
These early ASR systems used template-based recognition based on pattern matching
that compared the speaker’s input with prestored acoustic templates or patterns. Pattern
matching operates well at the word level for recognition of phonetically distinct items in
small vocabularies, but is less effective for larger vocabulary recognition. Another limita-
tion of pattern matching is its inability to match and align input speech signals with
prestored acoustic models of different lengths. Therefore, the performance of these ASR
systems was lackluster because they used acoustic approaches that only recognized basic
units of speech clearly enunciated by a single speaker (Rabiner & Juang, 1993).
An early attempt to construct speaker-independent recognizers by Forgie and Forgie
(1959) was also the first to use a computer. Later, researchers experimented with time-
normalization techniques (such as dynamic time warping, or DTW) to minimize differences
in speech rates of different talkers and to reliably detect speech starts and ends (e.g., Martin,
Nelson, & Zadell, 1964; Vintsyuk, 1968). Reddy (1966) attempted to develop a system
capable of recognizing continuous speech by dynamically tracking phonemes.
The 1970s were marked by several milestones: focus on the recognition of continuous
speech, development of large vocabulary speech recognizers, and experiments to create
truly speaker-independent systems. During this period, the first commercial ASR system
called VIP-100 appeared and won a US National Award. This success triggered the
Advanced Research Projects Agency (ARPA) of the US Department of Defense to fund
the Speech Understanding Research (SUR) project from 1971 to 1976 (Markowitz, 1996).
The goal of SUR was to create a system capable of understanding the connected speech

The Encyclopedia of Applied Linguistics, Edited by Carol A. Chapelle.


© 2013 Blackwell Publishing Ltd. Published 2013 by Blackwell Publishing Ltd.
DOI: 10.1002/9781405198431.wbeal0066
2 automatic speech recognition

a 0.2 student 0.7 learn 0.6

book 0.2
the 0.8 must 1.0 go 0.4
rain 0.1

Figure 1 A simple four-state Markov model with transition probabilities

of several speakers from a 1,000-word vocabulary in a low-noise environment with an


error rate of less than ten percent. Of six systems, the most viable were Hearsay II, HWIM
(hear what I mean), and Harpy, the only system that completely achieved SUR’s goal
(Rodman, 1999). The systems had a profound impact on ASR research and development
by demonstrating the benefits of data-driven statistical models over template-based
approaches and helping move ASR research toward statistical modeling methods such as
hidden Markov modeling (HMM).
Unlike pattern matching, HMM is based on complex statistical and probabilistic analyses
(Peinado & Segura, 2006). In simple terms, hidden Markov models represent language
units (e.g., phonemes or words) as a sequence of states with transition probabilities between
each state. To move from one state to another, the model will use the highest transition
probability (see Figure 1).
The main strength of HMM is that it can describe the probability of states and represent
their order and variability through matching techniques such as the Baum-Welch or Viterbi
algorithms. In other words, HMM can adequately analyze both the temporal and spectral
variations of speech signals, and can recognize and efficiently decode continuous speech
input. However, HMMs require extensive training, a large amount of memory, and huge com-
putational power for model-parameter storage and likelihood evaluation (Burileanu, 2008).
Although HMM became the primary focus of ASR research in the 1980s, this period
was also characterized by the reintroduction of artificial neural network (ANN) models,
abandoned since the 1950s due to numerous practical problems. Neural networks are
modeled on the human neural system. A network consists of interconnected processing
elements (units) combined in layers with different weights that are determined on the
basis of the training data (see Figure 2). A typical ANN takes an acoustic input, processes

Input layer Hidden layer Output layer

Input 1

Input 2
Output

Input 3

Input 4

Figure 2 A simple artificial neural network


automatic speech recognition 3

it through the units, and produces an output (i.e., a recognized text). To correctly classify
and recognize the input, a network uses the values of the weights.
The main advantage of ANNs lies in the classification of static patterns (including noisy
acoustic data), which is particularly useful for recognizing isolated speech units. However,
pure ANN-based systems are not effective for continuous speech recognition, so ANNs
are often integrated with HMM in a hybrid approach (Torkkola, 1994).
The use of HMM and ANN in the 1980s led to considerable efforts toward constructing
systems for large-vocabulary continuous speech recognition. During this time ASR was
introduced in public telephone networks and portable speech recognizers were offered to
the public. Commercialization continued in the 1990s, when ASR was integrated into
products from PC-based dictation systems to air-traffic-control training systems.
During the 1990s, ASR research focused on extending speech recognition to large vocabu-
laries for dictation, spontaneous speech recognition, and speech processing in noisy
environments. This period was also marked by systematic evaluations of ASR technologies
based on word or sentence error rates and constructing applications that would be able
to mimic human-to-human speech communication by having a dialogue with a human
speaker (e.g., Pegasus and How May I Help You?). Additionally, work on visual speech
recognition (i.e., recognition of speech using visual information such as lip position and
movements) began and continued after 2000 (Liew & Wang, 2009).
The 2000s witnessed further progress in ASR, including the development of new algo-
rithms and modeling techniques, advances in noisy speech recognition, and the integration
of speech recognition into mobile technologies. A more recent trend is the development
of emotion recognition systems that identify emotions from speech using facial expressions,
voice tone, and gestures (Schuller, Batliner, Steidl, & Seppi, 2009).

Characteristics of ASR Systems

Speech recognition systems can be characterized by three main dimensions: speaker depend-
ence, speech continuity, and vocabulary size. According to the speech data in the training
database, ASR systems can be speaker-dependent (when the system has to be trained for
each individual speaker), speaker-independent (when the training database contains numer-
ous speech examples from different speakers so the system can accurately recognize any
new speaker), and adaptive (when the system starts out as speaker-independent and then
gradually adapts to a particular user through training). From the dimension of speech
continuity, there are (a) isolated (or discrete) word recognition systems, which identify words
uttered in isolation; (b) connected word recognition systems, which can recognize isolated
words pronounced without pauses between them; (c) continuous speech recognition systems,
which are capable of recognizing whole sentences without pauses between words; and
(d) word spotting systems that extract individual words and phrases from a continuous
stream of speech. Finally, ASR systems can be characterized based on the vocabulary size
(i.e., small or large vocabulary) to which they are trained.
The performance of speech recognition systems depends on each of the three dimensions
and is prone to three types of errors: errors in discrete speech recognition, errors in con-
tinuous speech recognition, and errors in word spotting (Rodman, 1999). Errors in discrete
speech recognition include deletion errors (when a system ignores an utterance due to the
speaker’s failure to pronounce it loudly enough), insertion errors (when a system perceives
noise as a speech unit), substitution errors (when a recognizer identifies an utterance
incorrectly, e.g., We are thinking instead of We are sinking), and rejection errors (when the
speaker’s word is rejected by a system, for instance, because it has not been included in
the vocabulary). Errors in continuous speech recognition can also involve the same types
4 automatic speech recognition

of errors. In addition, this group contains splits, when one speech unit is mistakenly recog-
nized as two or more units (e.g., euthanasia for youth in Asia), and fusions, when two or
more speech units are perceived by a system as one unit (e.g., deep end as depend). Finally,
errors in word spotting include false rejects, when a word in the input is missed, and false
alarms, when a word is misidentified.
According to another classification, errors in ASR systems can be direct, intent, and
indirect (Halverson, Horn, Karat, & Karat, 1999). A direct error appears when a human
misspeaks or stutters. An intent error occurs when the speaker decides to restate what has
just been said. Finally, an indirect error is made when an ASR system misrecognizes the
speaker’s input.

Challenges and Applications of ASR

Automatic speech recognition is multidisciplinary. State-of-the-art ASR systems require


knowledge from disciplines such as linguistics, computer science, signal processing,
acoustics, communication theory, statistics, physiology, and psychology. Developing an
effective ASR system poses a number of challenges. They include speech variability (e.g.,
intra- and interspeaker variability such as different voices, accents, styles, contexts, and
speech rates), recognition units (e.g., words and phrases, syllables, phonemes, diphones,
and triphones), language complexity (e.g., vocabulary size and difficulty), ambiguity (e.g.,
homophones, word boundaries, syntactic and semantic ambiguity), and environmental
conditions (e.g., background noise or several people speaking simultaneously).
Despite these challenges, there are numerous commercial ASR products, including
Dragon NaturallySpeaking, Embedded ViaVoice, Loquendo, LumenVox, VoCon, and
Nuance Recognizer. Some of these have applications in computer-system interfaces (e.g.,
voice control of computers, data entry, dictation), education (e.g., toys, games, language
translators, language learning software), healthcare (e.g., systems for creating various
medical reports, aids for blind and visually impaired patients), telecommunications (e.g.,
phone-based interactive voice response systems for banking services, information services),
manufacturing (e.g., quality control monitoring on an assembly line), military (e.g., voice
control of fighter aircraft), and consumer products and services (e.g., car navigation systems,
household appliances, and mobile devices).

ASR in Applied Linguistics

ASR has tremendous potential in applied linguistics. In one application area, that of
language teaching, Eskenazi (1999) compares the strengths of ASR to effective immersion
language learning in developing spoken-language skills. ASR-based systems can provide
a way for learners of a foreign language to hear large amounts of the foreign language
spoken by many different speakers, produce speech in large amounts, and get relevant
feedback. In addition, Eskenazi (1999) suggests that using ASR computer-assisted language
learning (CALL) materials allows learners to feel at greater ease and get more consistent
assessment of their skills. ASR can also be used for virtual dialogues with native speakers
(Harless, Zier, & Duncan, 1999) and for pronunciation training (Dalby & Kewley-Port,
1999). Most importantly, learners enjoy ASR applications. Study after study indicates that
appropriately designed software that includes ASR is a huge plus to language learners in
terms of practice, motivation, and the feeling that they are actually communicating in the
language rather than simply repeating predigested words and sentences.
The holy grail of a computer that matches human speech recognition remains out of
reach at present. A number of limitations are consistently obvious in attempts to apply
automatic speech recognition 5

ASR systems to foreign-language-learning contexts. The major limitations occur because


most ASR systems have been designed to work with a limited range of native speech
patterns. Consequently, most ASR systems do not do well in recognizing non-native speech,
both at the phone level and for prosody (Eskenazi, 1999). In one study, Derwing, Munro
and Carbonaro (2000) tested Dragon NaturallySpeaking’s ability to identify errors in speech
of very advanced L2 speakers of English. While human listeners were able to successfully
transcribe between 95% and 99.7% of the words, the recognition rates by the program were
a respectable 90% for native English speakers, but only 71–2% for the non-native speakers.
In addition, ASR systems have been built for recognition rather than assessment and
feedback, but language learners require feedback on their speech to make progress.

Automatic Rating of Pronunciation

Many studies have examined whether ASR systems can identify pronunciation errors in
non-native speech and give feedback that can help learners and teachers know what areas
of foreign-language pronunciation are most important for intelligibility. Dalby and Kewley-
Port (1999) demonstrate that such diagnosis and assessment are possible (to some extent)
for minimal pairs, and that automatic ratings of pronunciation accuracy can correlate with
human ratings. However, the kind of feedback given to learners is not usually very helpful.
For those systems that attempt to do so, there are two options: giving a global pronun-
ciation rating or identifying specific errors. To reach either of these goals, ASR systems
need to identify word boundaries, accurately align speech to intended targets and compare
the segments produced with those that should have been produced. A variety of systems
have been designed to provide global evaluations of pronunciation using automatic measures
including speech rate, duration, and spectral analyses (e.g., Neumeyer, Franco, Digalakis,
& Weintraub, 2000; Witt & Young, 2000). All of the studies have found that automatic
measures are never as good as human ratings, but a combination of automatic measures
is always better than a single rating.
ASR systems also have trouble precisely identifying specific errors in articulation,
sometimes identifying correct speech as containing errors, but not identifying errors that
actually occur. Neri, Cucchiarini, Strik, and Boves (2002) found that as few as 25% of pro-
nunciation errors were detected by their ASR system, while some correct productions were
identified as errors. Truong, Neri, de Wet, Cucchiarini, and Strik (2005) studied whether
an ASR system could identify mispronunciations of three sounds typically mispronounced
by learners of Dutch. Errors were successfully detected for one of the three sounds, but
the ASR system was less successful for the other sounds. Overall, ASR systems still cannot
behave like trained human listeners, and the systems remain far more successful for phone-
level recognition and assessment than for prosody.

Other ASR Applications in Applied Linguistics

There are other areas in which ASR has been used by applied linguists: reading instruc-
tion, feedback for spoken liveliness, and the use of ASR in dialogue systems used with
language-learning software.
One use of ASR that seems to have been particularly successful has been in teaching
children to read. Mostow and Aist (1999) found that ASR used in conjunction with an
understanding of teacher–student classroom behavior was successful in teaching oral
reading skills and word recognition. In a more recent study, Poulsen, Hastings, and
Allbritton (2007) found that reading interventions for young learners of English were far
more effective when an ASR system was included.
6 automatic speech recognition

An unusual use of ASR in providing feedback is evaluation of spoken liveliness. ASR


is particularly ill-suited to providing feedback on prosody, despite moderate success
reported by Eskenazi (1999). However, Hincks and Edlund’s (2009) ASR recognizer used
automatic measures of pitch-range variation to provide feedback to learners of English
giving oral presentations in English. Using overlapping 10-second measures of pitch-range
variation, learners were given feedback on how much “liveliness” their voices projected.
By increasing pitch-range variations, learners were able to control the movement of the
feedback display, and thus increase the amount of engagement in their speech.
A third use of ASR technology is in spoken CALL dialogue systems. If a software
program for practicing spoken language provides the first line of a dialogue, learners give
one of two responses. If these responses are dissimilar, the ASR system can recognize
which sentence has been spoken (even with pronunciation errors or missing words).
The computer can then respond, allowing the learner to respond again from a menu of
possible responses (see Bernstein, Najmi, & Ehsani, 1999; Harless, Zier, & Duncan, 1999).
O’Brien (2006) reviews a number of such programs.

Future Directions

Automatic speech recognition holds great promise for applied linguistics, although this
promise has not yet been fully realized. ASR research and usability testing is happening
in areas likely to impact applied linguistics (e.g., Anderson, Davidson, Morton, & Jack,
2008). For example, the annual INTERSPEECH conference (held through the International
Speech Communication Association, or ISCA) and the International Conference on Acoustics,
Speech and Signal Processing (ICASSP) bring together those working in areas that will
eventually influence linguistic applications.
The connections between ASR and text-to-speech software have been insufficiently
explored, but both hold promise for non-native speech applications. Also, the ubiquity of
mobile devices that use ASR-based applications will allow L2 learners to practice their L2
speaking skills and receive feedback on their pronunciation. Further progress in ASR will
result in interactive language-learning systems capable of providing authentic interaction
opportunities with real or virtual interlocutors. These systems will also eventually be
able to produce specific, corrective feedback to learners on their pronunciation errors.
Additionally, the development of noise-resistant ASR technologies will allow language
learners to use ASR-based products in various noise-prone environments such as class-
rooms, transportation, and other public places. Finally, the performance of ASR systems
will improve as emotion recognition and visual speech recognition (based, for instance,
on a Webcam’s capturing of learners’ lip movements and facial expressions) become more
effective and widespread.

SEE ALSO: Computer-Assisted Pronunciation Teaching; Emerging Technologies for


Language Learning

References

Anderson, J. N., Davidson, N., Morton, H., & Jack, M. A. (2008). Language learning with
interactive virtual agent scenarios and speech recognition: Lessons learned. Computer
Animation and Virtual Worlds, 19, 605–19.
Bernstein, J., Najmi, A., & Ehsani, F. (1999). Subarashii: Encounters in Japanese spoken language
education. CALICO Journal, 16(3), 361–84.
automatic speech recognition 7

Burileanu, D. (2008). Spoken language interfaces for embedded applications. In D. Gardner-


Bonneau & H. E. Blanchard (Eds.), Human factors and voice interactive systems (2nd ed.,
pp. 135–61). Norwell, MA: Springer.
Dalby, J., & Kewley-Port, D. (1999). Explicit pronunciation training using automatic speech
recognition technology. CALICO Journal, 16(3), 425–45.
Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic recognition of spoken digits. The
Journal of the Acoustical Society of America, 24(6), 637–42.
Derwing, T. M., Munro, M. J., & Carbonaro, M. (2000). Does popular speech recognition software
work with ESL speech? TESOL Quarterly, 34, 592–603.
Eskenazi, M. (1999). Using a computer in foreign language pronunciation training: What
advantages? CALICO Journal, 16(3), 447–69.
Forgie, J. W., & Forgie, C. D. (1959). Results obtained from a vowel recognition computer
program. The Journal of the Acoustical Society of America, 31(11), 1480–9.
Halverson, C. A., Horn, D. A., Karat, C., & Karat, J. (1999). The beauty of errors: Patterns of
error correction in desktop speech systems. In M. A. Sasse & C. Johnson (Eds.), Human-
computer interaction—INTERACT ’99 (pp. 133–40). Edinburgh, Scotland: IOS Press.
Harless, W., Zier, M., & Duncan, R. (1999). Virtual dialogues with native speakers: The evalu-
ation of an interactive multimedia method. CALICO Journal, 16(3), 313–37.
Hincks, R., & Edlund, J. (2009). Promoting increased pitch variation in oral presentations with
transient visual feedback. Language Learning and Technology, 13, 32–50.
Lai, J., Karat, C.-M., & Yankelovich, N. (2008). Conversational speech interfaces and technolo-
gies. In A. Sears & J. A. Jacko (Eds.), The human-computer interaction handbook: Fundamentals,
evolving technologies, and emerging applications (2nd ed., pp. 381–91). New York, NY: Erlbaum.
Liew, A., & Wang, S. (2009). Visual speech recognition: Lip segmentation and mapping. Hershey,
PA: Medical Information Science Reference.
Markowitz, J. A. (1996). Using speech recognition. Upper Saddle River, NJ: Prentice Hall.
Martin, T. B., Nelson, A. L., & Zadell, H. J. (1964). Speech recognition by feature abstraction tech-
niques (Technical Report AL-TDR-64-176). Air Force Avionics Lab.
Mostow, J., & Aist, G. (1999). Giving help and praise in a reading tutor with imperfect listening
—because automated speech recognition means never being able to say you’re certain.
CALICO Journal, 16(3), 407–24.
Neri, A., Cucchiarini, C., Strik, H., & Boves, L. (2002). The pedagogy-technology interface
in computer assisted pronunciation training. Computer-Assisted Language Learning, 15(5),
441–67.
Neumeyer, L., Franco, H., Digalakis, V., & Weintraub, M. (2000). Automatic scoring of pronun-
ciation quality. Speech Communication, 30, 83–93.
O’Brien, M. (2006). Teaching pronunciation and intonation with computer technology. In
L. Ducate & N. Arnold (Eds.), Calling on CALL: From theory and research to new directions in
foreign language teaching (pp. 127–48). San Marcos, Texas: Calico Monograph Series.
Peinado, A. M., & Segura, J. C. (2006). Speech recognition over digital channels: Robustness and
standards. England: John Wiley.
Poulsen, R., Hastings, P., & Allbritton, D. (2007). Tutoring bilingual students with an automated
reading tutor that listens. Journal of Educational Computing Research, 36(2), 191–221.
Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ:
Prentice Hall.
Reddy, D. (1966). An approach to computer speech recognition by direct analysis of the speech wave
(Technical Report No. C549). Stanford, CA: Stanford University.
Rodman, R. D. (1999). Computer speech technology. Norwood, MA: Artech House.
Schuller, B., Batliner, A., Steidl, S., & Seppi, D. (2009). Emotion recognition from speech: Putting
ASR in the loop. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’09) (pp. 4585–8). Taipei, Taiwan.
Torkkola, K. (1994). Stochastic models and artificial neural networks for automatic speech recogni-
tion. In E. Keller (Ed.), Fundamentals of speech synthesis and speech recognition (pp. 149–69).
England: John Wiley.
8 automatic speech recognition

Truong, K., Neri, A., de Wet, F., Cucchiarini, C., & Strik, H. (2005). Automatic detection of
frequent pronunciation errors made by L2 learners. Proceedings of InterSpeech (pp. 1345–8).
Lisbon, Portugal.
Vintsyuk, T. K. (1968). Speech discrimination by dynamic programming. Kibernetika, 4(2), 81–8.
Witt, S., & Young, S. (2000). Phone-level pronunciation scoring and assessment for interactive
language learning. Speech Communication, 30, 95–108.

Suggested Readings

Holland, V. M. (Ed.). (1999). Tutors that listen: Speech recognition for language learning (Special
issue). CALICO Journal, 16(3).
Holmes, J., & Holmes, W. (2001). Speech synthesis and recognition (2nd ed.). London, England:
Taylor & Francis.
Junqua, J.-C., & Haton, J.-P. (1996). Robustness in automatic speech recognition: Fundamentals and
application. Boston, MA: Kluwer.
Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human and automatic
speech recognition research. Speech Communication, 49, 336–47.
Wachowicz, K., & Scott, B. (1999). Software that listens: It’s not a question of whether, it’s a
question of how. CALICO Journal, 16(3), 253–76.

View publication stats

You might also like