Professional Documents
Culture Documents
net/publication/261287458
CITATIONS READS
26 14,964
2 authors:
All content following this page was uploaded by John M Levis on 21 July 2018.
Definition
Historical Overview
Pioneering work on ASR dates to the early 1950s. The first ASR system, developed at Bell
Telephone Laboratories by Davis, Biddulph, and Balashek (1952), could recognize isolated
digits from 0 to 9 for a single speaker. In 1956, Olson and Belar created a phonetic typewriter
that could recognize ten discrete syllables. It was also speaker-dependent and required
extensive training.
These early ASR systems used template-based recognition based on pattern matching
that compared the speaker’s input with prestored acoustic templates or patterns. Pattern
matching operates well at the word level for recognition of phonetically distinct items in
small vocabularies, but is less effective for larger vocabulary recognition. Another limita-
tion of pattern matching is its inability to match and align input speech signals with
prestored acoustic models of different lengths. Therefore, the performance of these ASR
systems was lackluster because they used acoustic approaches that only recognized basic
units of speech clearly enunciated by a single speaker (Rabiner & Juang, 1993).
An early attempt to construct speaker-independent recognizers by Forgie and Forgie
(1959) was also the first to use a computer. Later, researchers experimented with time-
normalization techniques (such as dynamic time warping, or DTW) to minimize differences
in speech rates of different talkers and to reliably detect speech starts and ends (e.g., Martin,
Nelson, & Zadell, 1964; Vintsyuk, 1968). Reddy (1966) attempted to develop a system
capable of recognizing continuous speech by dynamically tracking phonemes.
The 1970s were marked by several milestones: focus on the recognition of continuous
speech, development of large vocabulary speech recognizers, and experiments to create
truly speaker-independent systems. During this period, the first commercial ASR system
called VIP-100 appeared and won a US National Award. This success triggered the
Advanced Research Projects Agency (ARPA) of the US Department of Defense to fund
the Speech Understanding Research (SUR) project from 1971 to 1976 (Markowitz, 1996).
The goal of SUR was to create a system capable of understanding the connected speech
book 0.2
the 0.8 must 1.0 go 0.4
rain 0.1
Input 1
Input 2
Output
Input 3
Input 4
it through the units, and produces an output (i.e., a recognized text). To correctly classify
and recognize the input, a network uses the values of the weights.
The main advantage of ANNs lies in the classification of static patterns (including noisy
acoustic data), which is particularly useful for recognizing isolated speech units. However,
pure ANN-based systems are not effective for continuous speech recognition, so ANNs
are often integrated with HMM in a hybrid approach (Torkkola, 1994).
The use of HMM and ANN in the 1980s led to considerable efforts toward constructing
systems for large-vocabulary continuous speech recognition. During this time ASR was
introduced in public telephone networks and portable speech recognizers were offered to
the public. Commercialization continued in the 1990s, when ASR was integrated into
products from PC-based dictation systems to air-traffic-control training systems.
During the 1990s, ASR research focused on extending speech recognition to large vocabu-
laries for dictation, spontaneous speech recognition, and speech processing in noisy
environments. This period was also marked by systematic evaluations of ASR technologies
based on word or sentence error rates and constructing applications that would be able
to mimic human-to-human speech communication by having a dialogue with a human
speaker (e.g., Pegasus and How May I Help You?). Additionally, work on visual speech
recognition (i.e., recognition of speech using visual information such as lip position and
movements) began and continued after 2000 (Liew & Wang, 2009).
The 2000s witnessed further progress in ASR, including the development of new algo-
rithms and modeling techniques, advances in noisy speech recognition, and the integration
of speech recognition into mobile technologies. A more recent trend is the development
of emotion recognition systems that identify emotions from speech using facial expressions,
voice tone, and gestures (Schuller, Batliner, Steidl, & Seppi, 2009).
Speech recognition systems can be characterized by three main dimensions: speaker depend-
ence, speech continuity, and vocabulary size. According to the speech data in the training
database, ASR systems can be speaker-dependent (when the system has to be trained for
each individual speaker), speaker-independent (when the training database contains numer-
ous speech examples from different speakers so the system can accurately recognize any
new speaker), and adaptive (when the system starts out as speaker-independent and then
gradually adapts to a particular user through training). From the dimension of speech
continuity, there are (a) isolated (or discrete) word recognition systems, which identify words
uttered in isolation; (b) connected word recognition systems, which can recognize isolated
words pronounced without pauses between them; (c) continuous speech recognition systems,
which are capable of recognizing whole sentences without pauses between words; and
(d) word spotting systems that extract individual words and phrases from a continuous
stream of speech. Finally, ASR systems can be characterized based on the vocabulary size
(i.e., small or large vocabulary) to which they are trained.
The performance of speech recognition systems depends on each of the three dimensions
and is prone to three types of errors: errors in discrete speech recognition, errors in con-
tinuous speech recognition, and errors in word spotting (Rodman, 1999). Errors in discrete
speech recognition include deletion errors (when a system ignores an utterance due to the
speaker’s failure to pronounce it loudly enough), insertion errors (when a system perceives
noise as a speech unit), substitution errors (when a recognizer identifies an utterance
incorrectly, e.g., We are thinking instead of We are sinking), and rejection errors (when the
speaker’s word is rejected by a system, for instance, because it has not been included in
the vocabulary). Errors in continuous speech recognition can also involve the same types
4 automatic speech recognition
of errors. In addition, this group contains splits, when one speech unit is mistakenly recog-
nized as two or more units (e.g., euthanasia for youth in Asia), and fusions, when two or
more speech units are perceived by a system as one unit (e.g., deep end as depend). Finally,
errors in word spotting include false rejects, when a word in the input is missed, and false
alarms, when a word is misidentified.
According to another classification, errors in ASR systems can be direct, intent, and
indirect (Halverson, Horn, Karat, & Karat, 1999). A direct error appears when a human
misspeaks or stutters. An intent error occurs when the speaker decides to restate what has
just been said. Finally, an indirect error is made when an ASR system misrecognizes the
speaker’s input.
ASR has tremendous potential in applied linguistics. In one application area, that of
language teaching, Eskenazi (1999) compares the strengths of ASR to effective immersion
language learning in developing spoken-language skills. ASR-based systems can provide
a way for learners of a foreign language to hear large amounts of the foreign language
spoken by many different speakers, produce speech in large amounts, and get relevant
feedback. In addition, Eskenazi (1999) suggests that using ASR computer-assisted language
learning (CALL) materials allows learners to feel at greater ease and get more consistent
assessment of their skills. ASR can also be used for virtual dialogues with native speakers
(Harless, Zier, & Duncan, 1999) and for pronunciation training (Dalby & Kewley-Port,
1999). Most importantly, learners enjoy ASR applications. Study after study indicates that
appropriately designed software that includes ASR is a huge plus to language learners in
terms of practice, motivation, and the feeling that they are actually communicating in the
language rather than simply repeating predigested words and sentences.
The holy grail of a computer that matches human speech recognition remains out of
reach at present. A number of limitations are consistently obvious in attempts to apply
automatic speech recognition 5
Many studies have examined whether ASR systems can identify pronunciation errors in
non-native speech and give feedback that can help learners and teachers know what areas
of foreign-language pronunciation are most important for intelligibility. Dalby and Kewley-
Port (1999) demonstrate that such diagnosis and assessment are possible (to some extent)
for minimal pairs, and that automatic ratings of pronunciation accuracy can correlate with
human ratings. However, the kind of feedback given to learners is not usually very helpful.
For those systems that attempt to do so, there are two options: giving a global pronun-
ciation rating or identifying specific errors. To reach either of these goals, ASR systems
need to identify word boundaries, accurately align speech to intended targets and compare
the segments produced with those that should have been produced. A variety of systems
have been designed to provide global evaluations of pronunciation using automatic measures
including speech rate, duration, and spectral analyses (e.g., Neumeyer, Franco, Digalakis,
& Weintraub, 2000; Witt & Young, 2000). All of the studies have found that automatic
measures are never as good as human ratings, but a combination of automatic measures
is always better than a single rating.
ASR systems also have trouble precisely identifying specific errors in articulation,
sometimes identifying correct speech as containing errors, but not identifying errors that
actually occur. Neri, Cucchiarini, Strik, and Boves (2002) found that as few as 25% of pro-
nunciation errors were detected by their ASR system, while some correct productions were
identified as errors. Truong, Neri, de Wet, Cucchiarini, and Strik (2005) studied whether
an ASR system could identify mispronunciations of three sounds typically mispronounced
by learners of Dutch. Errors were successfully detected for one of the three sounds, but
the ASR system was less successful for the other sounds. Overall, ASR systems still cannot
behave like trained human listeners, and the systems remain far more successful for phone-
level recognition and assessment than for prosody.
There are other areas in which ASR has been used by applied linguists: reading instruc-
tion, feedback for spoken liveliness, and the use of ASR in dialogue systems used with
language-learning software.
One use of ASR that seems to have been particularly successful has been in teaching
children to read. Mostow and Aist (1999) found that ASR used in conjunction with an
understanding of teacher–student classroom behavior was successful in teaching oral
reading skills and word recognition. In a more recent study, Poulsen, Hastings, and
Allbritton (2007) found that reading interventions for young learners of English were far
more effective when an ASR system was included.
6 automatic speech recognition
Future Directions
Automatic speech recognition holds great promise for applied linguistics, although this
promise has not yet been fully realized. ASR research and usability testing is happening
in areas likely to impact applied linguistics (e.g., Anderson, Davidson, Morton, & Jack,
2008). For example, the annual INTERSPEECH conference (held through the International
Speech Communication Association, or ISCA) and the International Conference on Acoustics,
Speech and Signal Processing (ICASSP) bring together those working in areas that will
eventually influence linguistic applications.
The connections between ASR and text-to-speech software have been insufficiently
explored, but both hold promise for non-native speech applications. Also, the ubiquity of
mobile devices that use ASR-based applications will allow L2 learners to practice their L2
speaking skills and receive feedback on their pronunciation. Further progress in ASR will
result in interactive language-learning systems capable of providing authentic interaction
opportunities with real or virtual interlocutors. These systems will also eventually be
able to produce specific, corrective feedback to learners on their pronunciation errors.
Additionally, the development of noise-resistant ASR technologies will allow language
learners to use ASR-based products in various noise-prone environments such as class-
rooms, transportation, and other public places. Finally, the performance of ASR systems
will improve as emotion recognition and visual speech recognition (based, for instance,
on a Webcam’s capturing of learners’ lip movements and facial expressions) become more
effective and widespread.
References
Anderson, J. N., Davidson, N., Morton, H., & Jack, M. A. (2008). Language learning with
interactive virtual agent scenarios and speech recognition: Lessons learned. Computer
Animation and Virtual Worlds, 19, 605–19.
Bernstein, J., Najmi, A., & Ehsani, F. (1999). Subarashii: Encounters in Japanese spoken language
education. CALICO Journal, 16(3), 361–84.
automatic speech recognition 7
Truong, K., Neri, A., de Wet, F., Cucchiarini, C., & Strik, H. (2005). Automatic detection of
frequent pronunciation errors made by L2 learners. Proceedings of InterSpeech (pp. 1345–8).
Lisbon, Portugal.
Vintsyuk, T. K. (1968). Speech discrimination by dynamic programming. Kibernetika, 4(2), 81–8.
Witt, S., & Young, S. (2000). Phone-level pronunciation scoring and assessment for interactive
language learning. Speech Communication, 30, 95–108.
Suggested Readings
Holland, V. M. (Ed.). (1999). Tutors that listen: Speech recognition for language learning (Special
issue). CALICO Journal, 16(3).
Holmes, J., & Holmes, W. (2001). Speech synthesis and recognition (2nd ed.). London, England:
Taylor & Francis.
Junqua, J.-C., & Haton, J.-P. (1996). Robustness in automatic speech recognition: Fundamentals and
application. Boston, MA: Kluwer.
Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human and automatic
speech recognition research. Speech Communication, 49, 336–47.
Wachowicz, K., & Scott, B. (1999). Software that listens: It’s not a question of whether, it’s a
question of how. CALICO Journal, 16(3), 253–76.