You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318749834

Automatic Speech Recognition: A Review

Article  in  International Journal of Computer Applications · December 2012


DOI: 10.5120/9722-4190

CITATIONS READS

11 4,044

2 authors, including:

Rishi pal Singh


Guru Jambheshwar University of Science & Technology
48 PUBLICATIONS   145 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Rishi pal Singh on 06 September 2017.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

Automatic Speech Recognition: A Review


Shipra J. Arora Rishi Pal Singh
Research Scholar, CSE Department CSE Department
GJUST, Hisar GJUST, Hisar

ABSTRACT vocabulary, environment, acoustic model, language model,


This paper attempts to describe a literature review of Automatic perplexity, transducer etc. Problems such as noisy
environment, different pronouncing of one word by one person
Speech Recognition. It discusses past years advances made so
in several times, dissimilar expressing of one word by two
as to provide progress that has been accomplished in this area of different speakers, incompatibility between train and test
research. One of the important challenges for researchers is conditions led to made system without complete recognition.
ASR accuracy. The Speech recognition System focuses on Resolving each of these problems is a good step towards this
difficulties with ASR, basic building blocks of speech aim.
processing, feature extraction, speech recognition and
performance evaluation. The main objective of the review paper
2. SPEECH AND LANGUAGE
is to bring to light the progress made for ASRs of different PROCESSING
languages and the technological viewpoint of ASR in different 2.1 Basic Building Blocks
countries and to compare and contrast the techniques used in Figure 1 shows some building blocks that perform
various stages of Speech recognition and identify research topic transformations pertinent to speech recognition. A waveform
in this challenging field. We are not presenting exhaustive modifier Figure 1(a) takes an input speech signal and produces
a modified signal. The modification might be a clipping of large
descriptions of systems or mathematical formulations but
values of the signal; a frequency spectrum filtering that alters
rather, we are presenting distinctive and novel features of the shape of signal or enhances the speech and de-emphasises
selected systems and their relative merits and demerits. noise that is present. A symbol transducer Figure 1(b) can take
in one discrete symbol sequence and yields a modified sequence
Keywords on its output. If the input were a sequence of words in one
language and the output were an equivalent word sequence in
Automatic speech recognition, Language Model, Speech another language, this transducer would be a language
Processing, Database, Pattern Recognition, Hidden Markov translation device. Parameter extractor Figure 1(c) takes an
Model. input speech signal and yielding parameters of speech wave. In
recognizers, it is often called the pre-processor. A feature
1. INTRODUCTION extractor Figure 1(d) can receive parameters and produce a
We ask that authors The Speech is one of the most important more abstract set of important information carrying features
tools for communication between human and his environment. such as determining what portions of speech are voiced,
Therefore manufacturing of ASR is need for human being all whether the sound is loud and resonant like a vowel etc. A
the time. Speech recognition made it feasible for machine to segmenter and labeller Figure 1(e) can receive the set of
understand human languages. As information technology has a features and produce a linear string of phonemes or other
bang on more and more aspects of our lives with every year, the identified segments. The unit identifier Figure 1(f) takes input
problem of communication between human beings and symbol sequence which may be compared to the expected
information processing devices becomes increasingly reference sequences for various units to determine what
significant. Up to now, communication has almost fully been linguistic units appear to be in the input. The most common unit
through the use of keyboards and screens, but speech is the identifier is a ‘word matcher’ which finds the closest matching
most widely used, natural and the fastest means of
word, based on which word’s stored pronunciation string is
communication for people. In a speech recognition system,
most like the input string. With these building blocks, we have
many parameters affect the accuracy of the Recognition
the essential prerequisites for discussing the main knowledge
System. These parameters are: dependence or independence
sources needed for machine understanding of speech.
from speaker, discrete or continuous word recognition,

34
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

Figure 1 Basic Building Blocks for processing speech and language

too has a property that the basic speech-recognition unit is the


2.2 Types of Speech
Speech can be classified into following categories word/phrase to much extent.

2.2.1 Isolated words: 2.2.3 Continuous speech:


Isolated word recognisers accept single word at a time. These Continuous speech recognition deals with the speech where
systems have "Listen/Not-Listen" states, where they require words are connected together instead of being separated by
the speaker to wait between utterances. Isolated Utterance pauses. As a result unknown boundary information about
might be a better name for this class. words, co-articulation, production of surrounding phonemes
and rate of speech effect the performance of continuous
2.2.2 Connected words speech recognition systems. Recognizers with continuous
Connected word speech recognition is the system where the speech capabilities are some of the most difficult to create
words are separated by pauses. Connected word speech because they utilize special methods to determine utterance
recognition is a class of fluent speech strings where the set of boundaries.
strings is derived from small-to-moderate size vocabulary
such as digit strings, spelled letter sequences, combination of 2.3 Classification of ASR system
alphanumeric. Like isolated word speech recognition, this set
Classification of speech recognition system is shown in Figure
2.

35
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

Speech Processing

Analysis/Synthesis Speech Recognition Coding

Speaker Recognition Speech Recognition Language Identification

Speech Mode Speaker Mode Vocabulary Size Speaking Style

Isolated Speech Speaker Independent Small Dictation

Continuous Speech Speaker dependent Medium Spontaneous

Large

Figure 2 Speech Processing Classification

Speech recognition systems are classified as discrete or much easier to look up one of 50 words in 50-word dictionary
continuous systems that are speaker dependent or independent. rather than one of the hundreds of thousands of words in a
Discrete systems maintain a separate acoustic model for each Webster’s dictionary. Another important qualifier in the
word, combination of words or phrases referred to as isolated determination of the complexity of a speech recognition
word speech recognition (ISR). Continuous speech recognition system is the type of speech that the recognition system uses:
(CSR) systems responds to a user who pronounces words, discrete or continuous. In a discrete speech system, the user
phrases or sentences that are in a series of specific order. A must pause between each word which makes speech recognition
speaker-dependent system requires that the user record an task much easier. Continuous speech is more difficult because
example of the word, sentence or phrase prior to its being of several reasons. First, it is difficult to find the start and end
recognized by the system. i.e the user trains the system. A boundary of words. Another problem is that the production of
speaker-independent system does not require any recording surrounding phonemes affect the production of each phoneme.
prior to system use. It is developed to operate for any speaker of Also speech rate affect the recognition of continuous speech.
a particular type. Speaker-dependent systems are simpler to
construct and are more accurate than speaker-independent 2.4 Difficulties with ASR
systems. As a result, the focus of early voice recognition Following are the some of the difficulties with ASR
systems was primarily speaker-dependent isolated word
systems that used limited vocabulary. At the time, overcoming 2.4.1 Human comprehension of speech
the restrictions in the state of technology required a greater Human use the knowledge about the speaker and the subject
focus on human-to-computer interaction. The challenge was to while listening more than the ears. Words are not sequences
identify how improved speech recognition technology could be together arbitrarily but there is a grammatical structure that
used to support the enhancement of human interaction with human use to predict words not yet uttered. In ASR, we have
machines. An important element in the creation of speech
recognition system is the size of the vocabulary. Vocabulary only speech signal. We can construct a model for grammatical
affects the complexity and accuracy of the system. Size of the structure and use some statistical model to improve prediction
vocabulary can be small, medium or large. Obviously, it is but there are still the problem how to model world knowledge

36
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

and the knowledge of speaker. Of course, we can not model 2.4.6.2 Realization
world knowledge but the question is how much we actually If same word were pronounced again and again, the resulting
require to measure up to human comprehension in the ASR. speech signal would never be same. There will be some small
differences in the acoustic wave produced. Speech realization
2.4.2 Spoken language is not equal to written
changes over time.
language
Spoken language is less critical than written language and 2.4.6.3 Speaker Sex
human make much more performance error while speaking. Male and Female have different voices. Female have shorter
Speech is two-way communication as compared to written vocal tract than male. That is why the pitch of female voice is
communication which is one-way. The most important issue is roughly twice than male.
disfluencies in speech e.g. normal speech contains repetitions,
2.4.6.4 Dialects
tongue slips, change of subject in the middle of an utterance etc.
Dialects are group related variations with in a language.
In the last few years, it has become clear that spoken language
Regional dialect: It involves features of vocabulary,
is different from written language. In ASR, we have to identify
pronunciation and grammar according to geographical area the
and address these differences.
speaker belongs.
2.4.3 Noise Social dialect: It involves features of vocabulary, pronunciation
Speech is uttered in an environment of sound such as ticking a and grammar according to social group of speaker.
clock, another speaker in the background, TV playing in In many cases, we may be forced to consider dialects as another
another room etc. This unwanted information in the speech language because of the large differences between two dialects.
signal is known as noise. In ASR, we have to identify these We have considered some of the difficulties of speech
noise and filter out it from the speech signal. Echo effect is recognition but not all. The most problematic issue is strong
another kind of noise in which speech signal is bounced on variability. Our goal is efficient user interface not natural verbal
some surrounding object and it appears in the microphone a few communication.
milliseconds later.
3 SPEECH RECOGNITION
2.4.4 Body Language TECHNIQUES CLASSIFICATION
A human speaker not only communicates with speech but also
Basically there are three approaches to speech recognition.
with body gestures such as waving hands, moving eyes,
postures etc. In ASR, such information is not available. This  Acoustic Phonetic Approach
problem is addressed in Multimodality research area where  Pattern Recognition Approach
studies are conducted to incorporate body language to improve  Artificial Intelligence Approach
human-machine communication.
3.1 Acoustic Phonetic Approach
2.4.5 Channel Variability Acoustic phonetic approach (Hemdal and Hughes 1967),
Variability is the context in which acoustic wave is uttered. postulates that there exist distinguishing phonetic units in
Different types of microphones, noise that changes over time spoken language and these units are characterized by a set of
and anything that affects the content of acoustic wave from the acoustics properties. The acoustic properties of phonetic units
speaker to the discrete representation in a computer is known as are highly variable both with speakers and with neighbouring
channel variability. sounds i.e co-articulation effect, it is understood in this
approach that the rules governing the variability are
2.4.6 Speaker Variability
straightforward and can be readily learned by a machine. The
Human speak differently. The voice is not only different
first stage in this approach is the speech spectral analysis .The
between speakers but also wide variation within one specific
next stage is a feature extraction stage that converts the spectral
speaker. Some of the variations are given below
measurements into a set of features that describe the acoustic
2.4.6.1 Speaking Style properties of the different phonetic units. The next stage is a
All speakers speak differently due to their unique personality. segmentation and labelling stage in which the speech signal is
They have a distinctive way to pronounce words. Speaking segmented into isolated regions, followed by attaching one or
style varies in different situations. We do not speak in the same more phonetic labels to each segmented region. The last stage
way in the public area, with our teachers or friends. Humans finds a valid word from the phonetic label sequences produced
also express emotions while speaking i.e happy, sad, fear, by the segmentation to labelling. The acoustic phonetic
surprise or anger. approach has not been widely used .

37
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

Speech
Recognition
Techniques

Pattern Artificial
Acoustic Phonetic
Recogniton Intelligence
Approach
Approach Approach

Model Based Rule Based


Approach Approach

Template Based Knowledge Based


Approach Systems

HMM,SVM,DTW,
Bayesian Classification
Approach

Figure 3 Speech Recognition Techniques Classification

3.2 Pattern Recognition Approach recognition in the last six decades . A block diagram of this
The pattern-matching approach (Rabiner and Juang 1993) approach is shown below. There exists two methods i.e.
shown in Figure 4 is used to extract patterns based on certain template method and stochastic model method.
conditions and to separate one class from the others. .It has 3.2.1 Template Method
four stages i.e. feature measurement, pattern training , pattern
classification and decision logic. In order to define a test In this method, unknown speech is compared against a set of
pattern, a sequence of measurements is made on the input templates in order to find the best match. During the last six
signal. Reference pattern is created by taking one or more test decades, template based method to speech recognition have
patterns corresponding to speech sounds. This can be in the provided a family of techniques. A collection of prototypical
form of a speech template or a statistical model i.e.HMM and speech patterns are stored as reference patterns representing
can be applied to a sound , a word, or a phrase. In the pattern the dictionary of candidate’s words. Recognition is then
classification stage of the approach, a direct comparison is carried out by matching an unknown spoken utterance with
made between the unknown test pattern and class reference each of these reference templates and selecting the category of
pattern and a measure of distance between them is computed. the best matching pattern. This method provides good
Decision logic stage determines the identity of the unknown recognition performance for a variety of practical
according to the integrity of match of the patterns. This applications. But it has the disadvantage that variations in
approach has become the prime method for speech speech can be modelled by using many templates per word,
which eventually becomes unfeasible.

38
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

Pattern Templates
Training or models

Analysis Test
System
Reference Patterns

S(n) pattern

Recognized Speech
Pattern
Decision
Classifier
Logic

Filter Bank/LPC Local Distance Measure


Dynamic Time Warping

Figure 4 Pattern Recognition Approach

Training System :

Signal Si COMPUTER SYSTEM IN


TRAINING MODE
Declared message Identity
(or intended response)

Storage of one or more STORED TEMPLATES FOR


Representative Pronunciation MESSAGE CATEGORIES

Recognition System :

COMPUTER SYSTEM IN
Signal Sx RECOGNITION MODE Selection of the
closet of the
reference patterns
STORED TEMPLATES FOR
MESSAGE CATEGORIES

Figure 5 Template matching Method

39
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

3.2.2 Stochastic method 4.1 Modules of ASR


Stochastic methods depict the use of probabilistic models to
deal with incomplete information. In speech recognition, Modules for a speech recognition system are shown in Figure
incompleteness arises from various sources. e.g. speaker 6.
variability, contextual effect etc. The most popular stochastic i. Speech Signal acquisition
approach now a day is hidden Markov modelling (HMM). A ii. Feature Extraction
HMM is characterized by a finite state Markov model and a iii. Acoustic Modelling
set of output distributions. The transition parameters in the iv. Language & Lexical Modelling
Markov models are temporal variability’s, while the v. Recognition
parameters in the output distribution model are spectral
variability’s. These two types of variabilites are the core of Two of these modules Speech acquisition and Feature
speech recognition. It is more general as compared to template extraction are common to both the phases of ASR.
based approach. The primary problems for HMM design are
a) The evaluation of the probability of a sequence of
observations given a specific HMM. b) The determination of a
best sequence of modal states and c) the adjustment of modal
parameters so as to best account for the observed signal. Once
these fundamental problems are solved, we can apply HMMs
to selected problems in speech recognition.

3.3 Artificial intelligence approach


The Artificial Intelligence approach is a fusion of the acoustic
phonetic approach and pattern recognition approach. Some
researchers developed recognition system by using acoustic
phonetic knowledge to develop classification rules for speech
sounds. While template based methods provided little insight
about human speech processing, but these methods have been
very effective in the design of a variety of speech recognition Figure 6 ASR block diagram
systems. On the other hand, linguistic and phonetic literature
Model adaptation is meant for minimizing the dependencies
provided insights about human speech processing. However,
on speakers‟ voice, acoustic environment, microphones and
this approach had only partial success due to the
transmission channel, and to improve the generalization
complicatedness in quantifying expert knowledge. Another
problem is the integration of levels of human knowledge i.e. capability of the system.
phonetics, lexical access, syntax, semantics and pragmatics.
4.1.1 Feature Extraction
4 PHASES OF ASR Feature extraction requires much attention because
Automatic speech recognition system involves two phases: recognition performance relies heavily on the feature
Training phase and recognition phase. A rigorous training extraction phase. Different techniques for feature extraction
procedure is followed to map the basic speech unit such as are LPC, MFCC, AMFCC, RAS, DAS, ΔMFCC, Higher lag
phone, syllable to the acoustic observation. In training phase, autocorrelation coefficients, PLP, MF-PLP, BFCC, RPLP etc.
known speech is recorded, pre-processed and then enters the It has been found that noise robust spectral estimation is
first stage i.e. Feature extraction. The next three stages are possible on the higher lag autocorrelation coefficients.
HMM creation, HMM training and HMM storage. The Therefore, eliminating the lower lags of the noisy speech
recognition phase starts with the acoustic analysis of unknown signal autocorrelation leads to removal of the main noise
speech signal. The signal captured is converted to a series of components.
acoustic feature vectors. Using appropriate algorithm, the
4.1.2 Acoustic Model
input observations are processed. The speech is compared Acoustic model is the main component for an ASR which
against the HMM‟s networks and the word which is accounts for most of the computational load and performance
pronounced is displayed. An ASR system can only recognize of the system. The Acoustic model is developed for detecting
what it has learned during the training process. But, the the spoken phoneme. Its creation involves the use of audio
system is able to recognize even those words, which are not recordings of speech and their text scripts and then compiling
present in the training corpus and for which sub-word units of them into a statistical representation of sounds which make up
the new word are known to the system and the new word words.
exists in the system dictionary.
4.1.3 Lexical Models
To provide the pronunciation of each word in a given
language, Lexicon is developed. Various combinations of
phones are defined through lexicon model to give valid words

40
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

for the recognition. Neural networks have helped to develop lexical model for non-native speech recognition.
4.1.4 Language Models achieve speaker independent speech recognition. This
Language model is the main component operated on million research has been refined so that the techniques for speaker
of words, consisting of millions of parameters and with the independent patterns are widely used. Carnegie Mellon
help of pronunciation dictionary ,developed word sequences University’s Harphy system[9] recognize speech with
in a sentence. ASR systems uses n-gram language models vocabulary size of 1011 words with reasonable accuracy. It
which are used to search for correct word sequence by was the first to make use of finite state network to reduce
predicting the likelihood of the nth word on the basis of the computation and determine the closest matching strings
n−1 preceding words. The probability of occurrence of a word efficiently.
sequence W is calculated as: In 1980, the key focus of research was on connected words
P (W) =P (w1, w2, . . . . . . . . . . . . . ., w m-1, wm) =P(w1). speech recognition. In the beginning of 1980, Moshey J. Lasry
P(w2|w1). P(w3|w1w2)............................. P(w m |w1w2w3 studied speech spectrogram of letters and digits[10] and
................w m-1). developed a feature based speech recognition. There is a
change in technology in 1980 from template based approaches
For large vocabulary speech recognizers, two problems occur to statistical modelling approach specially HMM [11,12] in
during the construction of n-gram language models. For real speech research. The most significant paradigm shift has been
applications, large amount of training data generally leads to the introduction of statistical methods, especially stochastic
large models. Second is the sparseness problem, is being processing with HMM( Baker, 1975 & Jelinek, 1976) in the
faced during the training of domain specific models. early 1970’s (Portiz 1988). More than 30 years later, this
Language models are non-deterministic. Both these features methodology still predominates. Despite their simplicity, N-
make it complicated. gram language models have proved remarkably powerful.
Now days, most practical speech recognition systems are
5 LITERATURE SURVEY based on statistical approach and their results with additional
Speech recognition came into existence during 1920. The first improvements have been made in 1990s. In 1980, Hidden
machine i.e. Radio Rex ,a toy to recognize voice was Markov model(HMM) approach is one of the key
manufactured. Bell Labs developed a speech synthesis technologies developed. IBM, Institute for Defence Analysis
machine at the World fair in New York. But later on they (IDA) and Dragon Systems understood HMM , but it was not
discarded efforts based on an incorrect conclusion that the AI renowned in the mid-1980s. Neural networks to speech
is ultimately required for success. In order to develop systems recognition problems is the another technology that was
for ASR, attempts were made in 1950s where researchers reintroduced in the late 1980s.
studied the fundamental concepts of phonetic-acoustic. Most In 1990, Pattern recognition approach was developed. It
of the systems in 1950[1] for recognizing speech examines the followed Bayes’s framework traditionally but it has been
vowels spectral resonancews of each utterances. At Bell Labs altered into an optimization problem with minimization of the
Davis, Biddulph and Balashek(1952) premeditated a isolated empirical recognition error[13]. The reason for this alteration
digit recognition system for a single speaker[2] using formant is that the distribution functions for the speech signal could
frequencies estimated during vowel regions of each digit. At not be chosen accurately and under these conditions, Bayes’
RCA Labs, Olson and Belar (1950) built 10 syllables theory can not be applied. However, aim is to design
recognizer of a single speaker[3] and Forgie and Forgie built a recognizer with least recognition error rather than best fitting
speaker-independent 10-vowel recognizer [4] at MIT Lincoln to given data. The techniques used for error minimization are
Lab, by measuring spectral resonances for vowels. Fry and Minimum Classification error (MCE) and Maximum Mutual
Denes(1959) tried to build a phoneme recognizer to recognize Information(MMI). These techniques led to maximum
four vowels and nine consonants [5] at University College in likelihood based approach[14] to speech recognition
England by using a spectrum analyser and a pattern matcher performance. A weighted HMM algorithm is proposed to
to make the recognition decision. Japanese labs entering address HMM based speech recognition issues of robustness
recognition field in 1960-70. As computers are not fast and discrimination,. In order to decrease the acoustic
enough, they designed special purpose H/W as a part of their mismatch between given set of speech model and test
system. In Tokyo, Nagata et.al described a system of the utterance, a maximum likelihood stochastic matching
Radio Research lab, was a H/W vowel recognizer[6]. Another approach[15] was proposed. A narrative approach [16] for
effort was the work of Sakai and Doshita in 1962, of Kyoto HMM speech recognition system is based on the use of a
University who built a H/W phoneme recognizer[7]. In 1963, Neural network as a vector quantizer which is remarkable
Nagata and co-workers at NEC Labs built a the digit innovation in training the neural network. Nam Soo Kim et.al.
recognizer[7]. This led to a long productive research program. described a variety of methods for estimating a robust output
In 1970 , the key focus of research was on isolated word probability distribution based on HMM [17]. An extension of
recognition. IBM researchers studied in large vocabulary the viterbi algorithm[18] made second order HMM efficient
speech recognition. At AT&T Bell Labs, researchers began as compared to existing viterbi algorithm. In 1990s, the
speaker independent speech recognition experiments[8]. A DARPA program continued. After that the centre of attention
large number of clustering algorithms were used to find the is Air Travel Information Service (ATIS) task and later focus
number of distinct patterns required to represent words to on transcription of broadcast news (BN). Advances in

41
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

continuous speech recognition and noisy environment speech SCARF: It is a software toolkit designed for doing speech
recognition, have been explained. In the area of noisy robust recognition with the help of segmental conditional random
speech recognition, minor work has been done.For noisy fields.
environment, for robust speech recognition, a new approach
MICROPHONES: They are being used by researchers for
to an auditory model was proposed[26]. This approach is
recording speech database. Sony and I-ball has developed
computationally efficient as compared with other models. A
model based spectral estimation algorithm has been some microphones which are unidirectional and noiseless.
developed[27].
In 2000, a variational Bayesian estimation technique was
7 PERFORMANCE OF SPEECH
developed[19]. It is based on posterior distribution of RECOGNITION SYSTEM
parameters. Giuseppe Richardi have developed a technique to The performance of speech recognition is specified in terms
solve the problem of adaptive learning[20] in ASR. In 2005, of accuracy and speed. Accuracy is measured in terms of
some improvements have been made on Large Vocabulary performance accuracy which is known as word error rate
Continuous Speech recognition system for performance (WER) whereas speed is measured with the real time factor.
improvement[21]. A 5-year national project Corpus of
Spontaneous Japanese (CSJ) [22] was conducted in Japan. It Word Error Rate
consists of 7 millions of words approximately, corresponding It is a common metric of the speech recognition performance.
As recognized word sequence have a different length from the
to speech of 700 hours. The techniques used in this project
reference word sequence, there is difficulty in measuring
are acoustic modelling, sentence boundary detection, performance.
pronunciation modelling, acoustic as well as language model
adaptation, and automatic speech summarization [23]. S DI
Utterance verification is being investigated [24] to further
increase the robustness of speech recognition systems, WER = N
especially for spontaneous speech,. When humans speak to
each other, they use multimodal communication. It increases where S is number of substitutions
D is number of deletions
the rate of successful transfer of information when the
I is number of insertions
communication take place in a noisy environment. In speech and N is number of words in the reference.
recognition, the use of the visual face information, specially
lip movement, has been examined, and results show that using Sometimes word recognition rate (WRR) is used instead of
both mode of information gives better performances than WER while describing performance of speech recognition.
using only the audio or only the visual information, specially,
in noisy environment. WRR = 1- WER

6 TOOLS FOR ASR S DI


Following are the various tools used for ASR N
= 1-

PRAAT: It is free software with latest version 5.3.04 which


N S DI
can run on wide range of OS platforms and meant for = N
recording and analysis of human speech in mono or stereo
Speed
AUDACITY: It is free, open source software available with It is measured by real time factor. If it takes time T to process
latest version of 1.3.14(Beta) which can run on wide range of an input of duration D then real time factor is defined by
OS platforms and meant for recording and editing sounds.
T
CSL: Computerised Speech Lab is a highly advanced speech D
RTF =
and signal processing workstation (software and hardware). It
possesses robust hardware for data acquisition and a versatile RTF ≤1 implies real time processing.
suite of software for speech analysis.
8 SUMMARY
HTK: The basic application of open source Hidden Markov Research in speech recognition has been carried out
Toolkit (HTK), written completely in ANSI C, is to build and intensively for the last 60 years. The technological progress
manipulate hidden Markov models. can be summarized in Table-1[28]

SPHINX: Sphinx 4 is a latest version of Sphinx series of


speech recognizer tools, written completely in Java
programming language. It provides a more flexible framework
for research in speech recognition.

42
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

Table 1. SPEECH RECOGNITION SUMMARIZATION 11 REFERENCES


[1]. Sadaoki Furui, November 2005, 50 years of Progress in
Sr.No. Past Present speech and Speaker Recognition Research , ECTI
Transactions on Computer and Information
Template Corpus-based statistical Technology,Vol.1. No.2.
1
matching modelling,e.g.HMM, n-gram [2]. K.H.Davis, R.Biddulph, and S.Balashek, 1952,
grams Automatic Recognition of spoken Digits,
2 Distance–based Likelihood –based methods J.Acoust.Soc.Am.,24(6):637-642.
methods
[3]. H.F.Olson and H.Belar, 1956, Phonetic Typewriter ,
3 Maximum Discriminative approach e.g. J.Acoust.Soc.Am.,28(6):1072-1081.
likelihood MCE/GPD and MMI
Approach [4]. J.W.Forgie and C.D.Forgie, 1959, Results obtained from
4 Isolated word Continuous speech a vowel recognition computer program ,
recognition recognition J.Acoust.Soc.Am., 31(11),pp.1480-1489.

5 Small Vocabulary Large Vocabulary [5] D.B.Fry, 1959, Theoritical Aspects of Mechanical speech
Recognition , and P.Denes, The design and Operation of
the Mechanical Speech Recognizer at Universtiy College
6 Clean speech Noisy/Telephone speech London, J.British Inst. Radio Engr., 19:4,211-299.
recognition recognition
[6]. K.Nagata, Y.Kato, and S.Chiba, 1963, Spoken Digit
7 Single speaker Speaker- Recognizer for Japanese Language , NEC Res.Develop.,
recognition independent/adaptive No.6.
recognition [7]. T.Sakai and S.Doshita, 1962 The phonetic typewriter,
8 Read speech Spontaneous speech
recognition recognition information processing 1962 , Proc.IFIP Congress.

Multimodal speech [8]. L.R.Rabiner, S.E.Levinson, A.E.Rosenberg, and


9 Single modality
recognition J.G.Wilpon, August 1979, Speaker Independent
Recognition of Isolated Words Using Clustering
Techniques , IEEE Trans. Acoustics, Speech, Signal
Proc., ASSP-27:336-349.
9 DISCUSSIONS AND CONCLUSIONS
[9]. B.Lowrre, 1990, The HARPY speech understanding
Speech is the most prominent and primary mode of
system ,Trends in Speech Recognition, W.Lea,Ed.,
communication between human beings. Over the past five Speech Science Pub., pp.576-586.
decades, research in the area of speech is a first step towards
ordinary man-machine communication. We also have [10]. R.K.Moore, 1994, Twenty things we still don t know
about speech , Proc. CRIM/ FORWISS Workshop on
encountered some limitations. What we know about speech
Progress and Prospects of speech Research an
processing is very limited. This paper attempts to give a Technology.
comprehensive survey of research in speech recognition and
some year-wise progress to this date and its current status. [11]. J.Ferguson, 1980, Hidden Markov Models for Speech,
IDA,Princeton, NJ.
Although significant amount of work has been done in the last
two decades but there is still work to be done. [12].L.R.Rabiner, February 1989, A Tutorial on Hidden
At present research is focussing on creating and developing Markov Models and Selected Applications in Speech
systems that would be much more robust against variability Recognition , Proc.IEEE,77(2):257-286.
and shift in acoustic environment, speaker characteristics, [13].B.H.Juang and S.Furui, 2000, Automatic speech
language characteristics, external noise sources etc. It has recognition and understanding: A first step toward
been found that HMM is the best technique in developing natural human machine communication ,
language model. Speech recognition is very fascinating Proc.IEEE,88,8,pp.1142-1165.
problem. It has attracted scientists and researchers and created [14].K.P.Li and G.W.Hughes, 1974, Talker differences as
a technological bang on society and is expected to boom they appear in correlation matrices of continuous speech
further in this area of man machine interaction. spectra , J.Acoust.Soc.Am. , 55,pp.833-837.
[15] Ananth Sankar, May 1996, “ A maximum likelihood
10 ACKNOWLEDGEMENT approach to stochastic matching for Robust Speech
I would like to thank my husband and son for their recognition”, IEEE Transactions on Audio, Speech and
unwavering support. My special thanks to Dr. S.S.Agrawal, Language processing Vol.4,No.3.
Dr. R.K.Jain, Dr. Harsh Vardhan Kamrah and Mrs. Neelima [16].Gerhard Rigoll, Jan.1994, “Maximum Mutual
Kamrah for their kind support and encouragement. Information Neural Networks for Hybrid connectionist-
HMM speech Recognition Systems “, IEEE Transactions
on Audio, Speech and Language processing Vol.2,No.1,
PartII.

43
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012

[17].Nam Soo Kim et.al., July1995, On estimating Robust [23].S. Furui, 2004, Speech-to-text and speech-to-speech
probability Distribution in HMM in HMM based speech summarization of spontaneous speech, IEEE Trans.
recognition , IEEE Transactions on Audio, Speech and Speech & Audio Processing, 12, 4, pp. 401-408.
Language processing Vol.3,No.4.
[24].Eduardo Lleida et.al. March 2000, Utterance
[18].Jean Francois, Jan.1997, Automatic Word Recognition Verification In Decoding And Training Procedures ,
Based on Second Order Hidden Markov Models , IEEE IEEE Transactions On Speech And Audio Processing,
Transactions on Audio, Speech and Language processing Vol. 8, No. 2.
Vol.5,No.1.
[25] Geoff Bristow, 1986, “Electronic Speech recognition:
[19].Mohamed Afify and Olivier Siohan, January 2004, Techniques, Technology and Applications’’ , Collins .
Sequential Estimation With Optimal Forgetting for
Robust Speech Recognition , IEEE Transactions On [26] Doh-Suk Kim, 1999, “ Auditory processing of Speech
Speech And Audio Processing, Vol. 12, No. 1. Signals for Robust Speech Recognition in Real World
Noisy Environment”, IEEE Transactions on Speech and
[20].Giuseppe Riccardi, July 2005, Active Learning: Theory Audio Processing Vol.7,No.1.
and Applications to Automatic Speech Recognition ,
IEEE Transactions On Speech And Audio Processing, [27] Adoram Erell et.al. ,1993, “ Filter bank energy estimation
Vol. 13, No. 4. using mixture and Markov models for Recognition of
Noisy Speech” IEEE Transactions on Audio, Speech and
[21]. Mohamed Afify, Feng Liu, Hui Jiang, July 2005, A New Language processing Vol.1,No.1.
Verification-Based Fast-Match for Large Vocabulary
Continuous Speech Recognition , IEEE Transactions On [28] M.A.Anusuya, S.K.Katti, 2009, “Speech Recognition by
Speech And Audio Processing, Vol. 13, No. 4. Machine: A Review”, International Journal of Computer
Science and Information Security, vol. 6, No. 3.
[22].S. Furui, 2005, Recent progress in corpus-based
spontaneous speech recognition, IEICE Trans. Inf. &
Syst., E88-D, 3, pp. 366-375.

44

View publication stats

You might also like