Professional Documents
Culture Documents
Auto Speech
Auto Speech
net/publication/318749834
CITATIONS READS
11 4,044
2 authors, including:
SEE PROFILE
All content following this page was uploaded by Rishi pal Singh on 06 September 2017.
34
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
35
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
Speech Processing
Large
Speech recognition systems are classified as discrete or much easier to look up one of 50 words in 50-word dictionary
continuous systems that are speaker dependent or independent. rather than one of the hundreds of thousands of words in a
Discrete systems maintain a separate acoustic model for each Webster’s dictionary. Another important qualifier in the
word, combination of words or phrases referred to as isolated determination of the complexity of a speech recognition
word speech recognition (ISR). Continuous speech recognition system is the type of speech that the recognition system uses:
(CSR) systems responds to a user who pronounces words, discrete or continuous. In a discrete speech system, the user
phrases or sentences that are in a series of specific order. A must pause between each word which makes speech recognition
speaker-dependent system requires that the user record an task much easier. Continuous speech is more difficult because
example of the word, sentence or phrase prior to its being of several reasons. First, it is difficult to find the start and end
recognized by the system. i.e the user trains the system. A boundary of words. Another problem is that the production of
speaker-independent system does not require any recording surrounding phonemes affect the production of each phoneme.
prior to system use. It is developed to operate for any speaker of Also speech rate affect the recognition of continuous speech.
a particular type. Speaker-dependent systems are simpler to
construct and are more accurate than speaker-independent 2.4 Difficulties with ASR
systems. As a result, the focus of early voice recognition Following are the some of the difficulties with ASR
systems was primarily speaker-dependent isolated word
systems that used limited vocabulary. At the time, overcoming 2.4.1 Human comprehension of speech
the restrictions in the state of technology required a greater Human use the knowledge about the speaker and the subject
focus on human-to-computer interaction. The challenge was to while listening more than the ears. Words are not sequences
identify how improved speech recognition technology could be together arbitrarily but there is a grammatical structure that
used to support the enhancement of human interaction with human use to predict words not yet uttered. In ASR, we have
machines. An important element in the creation of speech
recognition system is the size of the vocabulary. Vocabulary only speech signal. We can construct a model for grammatical
affects the complexity and accuracy of the system. Size of the structure and use some statistical model to improve prediction
vocabulary can be small, medium or large. Obviously, it is but there are still the problem how to model world knowledge
36
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
and the knowledge of speaker. Of course, we can not model 2.4.6.2 Realization
world knowledge but the question is how much we actually If same word were pronounced again and again, the resulting
require to measure up to human comprehension in the ASR. speech signal would never be same. There will be some small
differences in the acoustic wave produced. Speech realization
2.4.2 Spoken language is not equal to written
changes over time.
language
Spoken language is less critical than written language and 2.4.6.3 Speaker Sex
human make much more performance error while speaking. Male and Female have different voices. Female have shorter
Speech is two-way communication as compared to written vocal tract than male. That is why the pitch of female voice is
communication which is one-way. The most important issue is roughly twice than male.
disfluencies in speech e.g. normal speech contains repetitions,
2.4.6.4 Dialects
tongue slips, change of subject in the middle of an utterance etc.
Dialects are group related variations with in a language.
In the last few years, it has become clear that spoken language
Regional dialect: It involves features of vocabulary,
is different from written language. In ASR, we have to identify
pronunciation and grammar according to geographical area the
and address these differences.
speaker belongs.
2.4.3 Noise Social dialect: It involves features of vocabulary, pronunciation
Speech is uttered in an environment of sound such as ticking a and grammar according to social group of speaker.
clock, another speaker in the background, TV playing in In many cases, we may be forced to consider dialects as another
another room etc. This unwanted information in the speech language because of the large differences between two dialects.
signal is known as noise. In ASR, we have to identify these We have considered some of the difficulties of speech
noise and filter out it from the speech signal. Echo effect is recognition but not all. The most problematic issue is strong
another kind of noise in which speech signal is bounced on variability. Our goal is efficient user interface not natural verbal
some surrounding object and it appears in the microphone a few communication.
milliseconds later.
3 SPEECH RECOGNITION
2.4.4 Body Language TECHNIQUES CLASSIFICATION
A human speaker not only communicates with speech but also
Basically there are three approaches to speech recognition.
with body gestures such as waving hands, moving eyes,
postures etc. In ASR, such information is not available. This Acoustic Phonetic Approach
problem is addressed in Multimodality research area where Pattern Recognition Approach
studies are conducted to incorporate body language to improve Artificial Intelligence Approach
human-machine communication.
3.1 Acoustic Phonetic Approach
2.4.5 Channel Variability Acoustic phonetic approach (Hemdal and Hughes 1967),
Variability is the context in which acoustic wave is uttered. postulates that there exist distinguishing phonetic units in
Different types of microphones, noise that changes over time spoken language and these units are characterized by a set of
and anything that affects the content of acoustic wave from the acoustics properties. The acoustic properties of phonetic units
speaker to the discrete representation in a computer is known as are highly variable both with speakers and with neighbouring
channel variability. sounds i.e co-articulation effect, it is understood in this
approach that the rules governing the variability are
2.4.6 Speaker Variability
straightforward and can be readily learned by a machine. The
Human speak differently. The voice is not only different
first stage in this approach is the speech spectral analysis .The
between speakers but also wide variation within one specific
next stage is a feature extraction stage that converts the spectral
speaker. Some of the variations are given below
measurements into a set of features that describe the acoustic
2.4.6.1 Speaking Style properties of the different phonetic units. The next stage is a
All speakers speak differently due to their unique personality. segmentation and labelling stage in which the speech signal is
They have a distinctive way to pronounce words. Speaking segmented into isolated regions, followed by attaching one or
style varies in different situations. We do not speak in the same more phonetic labels to each segmented region. The last stage
way in the public area, with our teachers or friends. Humans finds a valid word from the phonetic label sequences produced
also express emotions while speaking i.e happy, sad, fear, by the segmentation to labelling. The acoustic phonetic
surprise or anger. approach has not been widely used .
37
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
Speech
Recognition
Techniques
Pattern Artificial
Acoustic Phonetic
Recogniton Intelligence
Approach
Approach Approach
HMM,SVM,DTW,
Bayesian Classification
Approach
3.2 Pattern Recognition Approach recognition in the last six decades . A block diagram of this
The pattern-matching approach (Rabiner and Juang 1993) approach is shown below. There exists two methods i.e.
shown in Figure 4 is used to extract patterns based on certain template method and stochastic model method.
conditions and to separate one class from the others. .It has 3.2.1 Template Method
four stages i.e. feature measurement, pattern training , pattern
classification and decision logic. In order to define a test In this method, unknown speech is compared against a set of
pattern, a sequence of measurements is made on the input templates in order to find the best match. During the last six
signal. Reference pattern is created by taking one or more test decades, template based method to speech recognition have
patterns corresponding to speech sounds. This can be in the provided a family of techniques. A collection of prototypical
form of a speech template or a statistical model i.e.HMM and speech patterns are stored as reference patterns representing
can be applied to a sound , a word, or a phrase. In the pattern the dictionary of candidate’s words. Recognition is then
classification stage of the approach, a direct comparison is carried out by matching an unknown spoken utterance with
made between the unknown test pattern and class reference each of these reference templates and selecting the category of
pattern and a measure of distance between them is computed. the best matching pattern. This method provides good
Decision logic stage determines the identity of the unknown recognition performance for a variety of practical
according to the integrity of match of the patterns. This applications. But it has the disadvantage that variations in
approach has become the prime method for speech speech can be modelled by using many templates per word,
which eventually becomes unfeasible.
38
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
Pattern Templates
Training or models
Analysis Test
System
Reference Patterns
S(n) pattern
Recognized Speech
Pattern
Decision
Classifier
Logic
Training System :
Recognition System :
COMPUTER SYSTEM IN
Signal Sx RECOGNITION MODE Selection of the
closet of the
reference patterns
STORED TEMPLATES FOR
MESSAGE CATEGORIES
39
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
40
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
for the recognition. Neural networks have helped to develop lexical model for non-native speech recognition.
4.1.4 Language Models achieve speaker independent speech recognition. This
Language model is the main component operated on million research has been refined so that the techniques for speaker
of words, consisting of millions of parameters and with the independent patterns are widely used. Carnegie Mellon
help of pronunciation dictionary ,developed word sequences University’s Harphy system[9] recognize speech with
in a sentence. ASR systems uses n-gram language models vocabulary size of 1011 words with reasonable accuracy. It
which are used to search for correct word sequence by was the first to make use of finite state network to reduce
predicting the likelihood of the nth word on the basis of the computation and determine the closest matching strings
n−1 preceding words. The probability of occurrence of a word efficiently.
sequence W is calculated as: In 1980, the key focus of research was on connected words
P (W) =P (w1, w2, . . . . . . . . . . . . . ., w m-1, wm) =P(w1). speech recognition. In the beginning of 1980, Moshey J. Lasry
P(w2|w1). P(w3|w1w2)............................. P(w m |w1w2w3 studied speech spectrogram of letters and digits[10] and
................w m-1). developed a feature based speech recognition. There is a
change in technology in 1980 from template based approaches
For large vocabulary speech recognizers, two problems occur to statistical modelling approach specially HMM [11,12] in
during the construction of n-gram language models. For real speech research. The most significant paradigm shift has been
applications, large amount of training data generally leads to the introduction of statistical methods, especially stochastic
large models. Second is the sparseness problem, is being processing with HMM( Baker, 1975 & Jelinek, 1976) in the
faced during the training of domain specific models. early 1970’s (Portiz 1988). More than 30 years later, this
Language models are non-deterministic. Both these features methodology still predominates. Despite their simplicity, N-
make it complicated. gram language models have proved remarkably powerful.
Now days, most practical speech recognition systems are
5 LITERATURE SURVEY based on statistical approach and their results with additional
Speech recognition came into existence during 1920. The first improvements have been made in 1990s. In 1980, Hidden
machine i.e. Radio Rex ,a toy to recognize voice was Markov model(HMM) approach is one of the key
manufactured. Bell Labs developed a speech synthesis technologies developed. IBM, Institute for Defence Analysis
machine at the World fair in New York. But later on they (IDA) and Dragon Systems understood HMM , but it was not
discarded efforts based on an incorrect conclusion that the AI renowned in the mid-1980s. Neural networks to speech
is ultimately required for success. In order to develop systems recognition problems is the another technology that was
for ASR, attempts were made in 1950s where researchers reintroduced in the late 1980s.
studied the fundamental concepts of phonetic-acoustic. Most In 1990, Pattern recognition approach was developed. It
of the systems in 1950[1] for recognizing speech examines the followed Bayes’s framework traditionally but it has been
vowels spectral resonancews of each utterances. At Bell Labs altered into an optimization problem with minimization of the
Davis, Biddulph and Balashek(1952) premeditated a isolated empirical recognition error[13]. The reason for this alteration
digit recognition system for a single speaker[2] using formant is that the distribution functions for the speech signal could
frequencies estimated during vowel regions of each digit. At not be chosen accurately and under these conditions, Bayes’
RCA Labs, Olson and Belar (1950) built 10 syllables theory can not be applied. However, aim is to design
recognizer of a single speaker[3] and Forgie and Forgie built a recognizer with least recognition error rather than best fitting
speaker-independent 10-vowel recognizer [4] at MIT Lincoln to given data. The techniques used for error minimization are
Lab, by measuring spectral resonances for vowels. Fry and Minimum Classification error (MCE) and Maximum Mutual
Denes(1959) tried to build a phoneme recognizer to recognize Information(MMI). These techniques led to maximum
four vowels and nine consonants [5] at University College in likelihood based approach[14] to speech recognition
England by using a spectrum analyser and a pattern matcher performance. A weighted HMM algorithm is proposed to
to make the recognition decision. Japanese labs entering address HMM based speech recognition issues of robustness
recognition field in 1960-70. As computers are not fast and discrimination,. In order to decrease the acoustic
enough, they designed special purpose H/W as a part of their mismatch between given set of speech model and test
system. In Tokyo, Nagata et.al described a system of the utterance, a maximum likelihood stochastic matching
Radio Research lab, was a H/W vowel recognizer[6]. Another approach[15] was proposed. A narrative approach [16] for
effort was the work of Sakai and Doshita in 1962, of Kyoto HMM speech recognition system is based on the use of a
University who built a H/W phoneme recognizer[7]. In 1963, Neural network as a vector quantizer which is remarkable
Nagata and co-workers at NEC Labs built a the digit innovation in training the neural network. Nam Soo Kim et.al.
recognizer[7]. This led to a long productive research program. described a variety of methods for estimating a robust output
In 1970 , the key focus of research was on isolated word probability distribution based on HMM [17]. An extension of
recognition. IBM researchers studied in large vocabulary the viterbi algorithm[18] made second order HMM efficient
speech recognition. At AT&T Bell Labs, researchers began as compared to existing viterbi algorithm. In 1990s, the
speaker independent speech recognition experiments[8]. A DARPA program continued. After that the centre of attention
large number of clustering algorithms were used to find the is Air Travel Information Service (ATIS) task and later focus
number of distinct patterns required to represent words to on transcription of broadcast news (BN). Advances in
41
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
continuous speech recognition and noisy environment speech SCARF: It is a software toolkit designed for doing speech
recognition, have been explained. In the area of noisy robust recognition with the help of segmental conditional random
speech recognition, minor work has been done.For noisy fields.
environment, for robust speech recognition, a new approach
MICROPHONES: They are being used by researchers for
to an auditory model was proposed[26]. This approach is
recording speech database. Sony and I-ball has developed
computationally efficient as compared with other models. A
model based spectral estimation algorithm has been some microphones which are unidirectional and noiseless.
developed[27].
In 2000, a variational Bayesian estimation technique was
7 PERFORMANCE OF SPEECH
developed[19]. It is based on posterior distribution of RECOGNITION SYSTEM
parameters. Giuseppe Richardi have developed a technique to The performance of speech recognition is specified in terms
solve the problem of adaptive learning[20] in ASR. In 2005, of accuracy and speed. Accuracy is measured in terms of
some improvements have been made on Large Vocabulary performance accuracy which is known as word error rate
Continuous Speech recognition system for performance (WER) whereas speed is measured with the real time factor.
improvement[21]. A 5-year national project Corpus of
Spontaneous Japanese (CSJ) [22] was conducted in Japan. It Word Error Rate
consists of 7 millions of words approximately, corresponding It is a common metric of the speech recognition performance.
As recognized word sequence have a different length from the
to speech of 700 hours. The techniques used in this project
reference word sequence, there is difficulty in measuring
are acoustic modelling, sentence boundary detection, performance.
pronunciation modelling, acoustic as well as language model
adaptation, and automatic speech summarization [23]. S DI
Utterance verification is being investigated [24] to further
increase the robustness of speech recognition systems, WER = N
especially for spontaneous speech,. When humans speak to
each other, they use multimodal communication. It increases where S is number of substitutions
D is number of deletions
the rate of successful transfer of information when the
I is number of insertions
communication take place in a noisy environment. In speech and N is number of words in the reference.
recognition, the use of the visual face information, specially
lip movement, has been examined, and results show that using Sometimes word recognition rate (WRR) is used instead of
both mode of information gives better performances than WER while describing performance of speech recognition.
using only the audio or only the visual information, specially,
in noisy environment. WRR = 1- WER
42
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
5 Small Vocabulary Large Vocabulary [5] D.B.Fry, 1959, Theoritical Aspects of Mechanical speech
Recognition , and P.Denes, The design and Operation of
the Mechanical Speech Recognizer at Universtiy College
6 Clean speech Noisy/Telephone speech London, J.British Inst. Radio Engr., 19:4,211-299.
recognition recognition
[6]. K.Nagata, Y.Kato, and S.Chiba, 1963, Spoken Digit
7 Single speaker Speaker- Recognizer for Japanese Language , NEC Res.Develop.,
recognition independent/adaptive No.6.
recognition [7]. T.Sakai and S.Doshita, 1962 The phonetic typewriter,
8 Read speech Spontaneous speech
recognition recognition information processing 1962 , Proc.IFIP Congress.
43
International Journal of Computer Applications (0975 – 8887)
Volume 60– No.9, December 2012
[17].Nam Soo Kim et.al., July1995, On estimating Robust [23].S. Furui, 2004, Speech-to-text and speech-to-speech
probability Distribution in HMM in HMM based speech summarization of spontaneous speech, IEEE Trans.
recognition , IEEE Transactions on Audio, Speech and Speech & Audio Processing, 12, 4, pp. 401-408.
Language processing Vol.3,No.4.
[24].Eduardo Lleida et.al. March 2000, Utterance
[18].Jean Francois, Jan.1997, Automatic Word Recognition Verification In Decoding And Training Procedures ,
Based on Second Order Hidden Markov Models , IEEE IEEE Transactions On Speech And Audio Processing,
Transactions on Audio, Speech and Language processing Vol. 8, No. 2.
Vol.5,No.1.
[25] Geoff Bristow, 1986, “Electronic Speech recognition:
[19].Mohamed Afify and Olivier Siohan, January 2004, Techniques, Technology and Applications’’ , Collins .
Sequential Estimation With Optimal Forgetting for
Robust Speech Recognition , IEEE Transactions On [26] Doh-Suk Kim, 1999, “ Auditory processing of Speech
Speech And Audio Processing, Vol. 12, No. 1. Signals for Robust Speech Recognition in Real World
Noisy Environment”, IEEE Transactions on Speech and
[20].Giuseppe Riccardi, July 2005, Active Learning: Theory Audio Processing Vol.7,No.1.
and Applications to Automatic Speech Recognition ,
IEEE Transactions On Speech And Audio Processing, [27] Adoram Erell et.al. ,1993, “ Filter bank energy estimation
Vol. 13, No. 4. using mixture and Markov models for Recognition of
Noisy Speech” IEEE Transactions on Audio, Speech and
[21]. Mohamed Afify, Feng Liu, Hui Jiang, July 2005, A New Language processing Vol.1,No.1.
Verification-Based Fast-Match for Large Vocabulary
Continuous Speech Recognition , IEEE Transactions On [28] M.A.Anusuya, S.K.Katti, 2009, “Speech Recognition by
Speech And Audio Processing, Vol. 13, No. 4. Machine: A Review”, International Journal of Computer
Science and Information Security, vol. 6, No. 3.
[22].S. Furui, 2005, Recent progress in corpus-based
spontaneous speech recognition, IEICE Trans. Inf. &
Syst., E88-D, 3, pp. 366-375.
44