MS Paper Review On Speech Recognition System

YANG TANG DI WANG JING BAI XIAOYAN ZHU MING LI
ACM, ACM Communications
Abstract. In this paper the Authors propose a new paradigm, the RSVP, that improves speech recognition and translation accuracy in question answering systems. A speech recognition systems main job is to translate spoken word into text and most of the best known such systems use a sound-responsive element, such as a microphone to capture the speech and then convert variable sound pressure into and equivalent variation of an electrical signals. Through a sequence of front end processing and pattern matching processes the electrical signal is match with a set of words. The identified words are then processed through a decision rule system that identifies a final text for each spoken word. Even though a lot of progress has been made in the science and field speech recognition remains a difficult problem due to many sources of variability associated with the signal captured by the sound responsive system. The phonetic and acoustic variability are such problems that are overcome using training and Speaker independent speech recognition techniques. This paper proposes a different technique which is a domain based and functioning as a search engine: the RSVP. The RSVP system is a combination of a search engine , a database of questions and at the heart of its technology the science and concept of information distance.
1- Introduction Speech recognition technology is becoming very useful in many major and ubiquitous systems and technology applications such as command and control, data entry, document preparation, Integrated voice response, etc. But the science remains a difficult problem mainly because of the variability in the signal that is captured by a sound responsive element or device. The two main sources of variability as related to the signal usually fall under two categories: acoustic and phonetic sources. To curtail the effect of the signal variability on the accuracy of the translation result, the major vendors in the field generally adopted one of two main solutions : the Speaker independent speech recognition or Training solution. The Training strategy overcome to some extent the signal variability and provides a more accurate transcription by analyzing a persons specific voice and use it to fine tune the recognition of that persons speech, whereas the Speaker independent speech recognition uses
feature analysis of the speech. It first process the input using "Fourier transforms" or "linear predictive coding ", then it tries to find characteristic similarities(These are similarities that will be present in a wide range of speakers, therefore there is no need to train the system for each user) between the expected inputs and the actual digitized voice input. It has been proven to be good in handling with some degree of success accents, and varying speed of delivery, pitch, volume, and inflection but not with some of the greatest hurdles such as the variety of accents and inflections used by speakers of different nationalities. It was proven that overall Recognition accuracy for speakerindependent systems is somewhat less than for speaker-dependent systems, usually between 90 and 95 percent. The RSVP system is proposing a method to improve the accuracy rate of solutions like the speaker-independent method in a QA specific domain. It is proposed as a system that combines a powerful search engine and an extensive questions database. As input the RSVP system takes in the speech recognitions output text. In the specific tests conducted only up to 3 questions text output was considered for each question. At the heart of all Speech recognition system and application is the application of statistical models such as the Hidden Markov models (HMM). A hidden Markov model is a composition of two stochastic processes: a hidden Markov chain, which accounts for temporal variability and an observable process which account for spectral variability. This is the combination in this model that has helped the traditional speech recognition paradigms to cope with the acoustic and phonetic variability. But at the heart of RSVP search engine is another paradigm heavily based on Information Distance concept. The Information Distance concept here is a mathematical model using Kolmogorov complexity (and other improvement works done on the theory over the years) to measure the similarities between the output of the Speech Recognition system( which are the text translation of the spoken question) and one or more questions that are in the RSVP database. Based on the distance value ( or degree similarity ) the system make a selection of a question from the database. The question selected is then compared to the original question to determine how accurate the output was. The RSVP is QA domain system and therefore can only improve the result of speech recognition when that speech is a question and its goal is to find the original question.
Figure 1 : 2- Background 0Speech recognition is a hard problem but the result of many researches over the years has find solutions to some of the key problems. The solutions not only allowed tremendous progress the scientific knowledge front, but also made it possible to create mathematic models which in turn spun the creation of sophisticated algorithm that are used in many commercial speech recognition application that are having some degree of success today. Some of the most used and useful mathematical models and algorithms in this field are: The hidden Markov Models, Expectation-Maximization algorithm, the Baun-welch algorithm, Viterbi N-best algorithm and N-gram language model. To understand the background of the work done and the key problems that have been overcome we are going to look at the process (Figure 2) that goes on in a typical speech recognition system and at each step(element) of that process describe some of the key problems and explain some of the methods, models and algorithms used to overcome the problem. This process is based on the Bayesian framework which models the recognizer as a system of sequential processes along with a set of databases. The recognizer does its work through the following three processing steps: 1)the feature analysis, 2)pattern matching and 3)confidence scoring. The database on is a set of the following trained databases( acoustic models, the world lexicon and the language model) that are used by the processes along the way to achieve the goal of outputting the correct transcription of the spoken words or sentence.
In a typical speech recognition setting we have a speaker that speaks a sequence of words that conceptually represent a sentence ( W). The speech production mechanism then produces a speech waveform, s(n), which is a data structure that embodies all the words contained in W along with the extraneous sounds and pauses in the spoken input. From there the Recognizer take over and the first thing it does is to convert the speech signal s(n) into a sequence of spectral feature vectors X, where the feature vectors are measured at a constant length of time( typically 10ms) during the duration of the speech. And this is the role of the Feature analysis , which tries to calculate a set of features that make up the spectral properties ( the spectral vector)of each speech recorded sounds. The typical and standard feature set used is MFFCs(Mel Frequency ceptstral Coefficients) (Davis and MermelStein, 1980) along with the first and second order derivative of theses features. The feature analysis process also samples , quantize , cleans, segments into frames, and windows the signal and then follows a spectral analysis using a fast Fourier transform( Rabinar and Gold,1975) or a Linear Predictive Coding( Atal and Hanauer, 1971; Markel and Gray, 1976) , a frequency conversion from a linear frequency scale to a mel frequency scale, a cepstral analysis(Davis and MermelStein, 1980), equalization and normalization of the cepstral coefficients( Rahim and Juang, 1996) and finally the computation of first and second order MFFCs which complete the Feature analysis process. The second steps main goal to find the closest matching word W for each spoken sentence W. The task consists of: 1) using a syntactic decoder to generate all possible valid sentences in the defined language for the system (commonly called the task language) 2) compute the score for each discovered sentence, and 3)choose the sentence, with the highest score as the recognized string W. This method of choice is called the Maximum a posteriori probability(MAP) decision principle defined by Bayes. This second step is handled by the Pattern Matching process. Pattern Matching process
Mathematically the pattern matching process goal is to find a string of words W that maximizes the a posteriory probability of a spoken speech W, given the measure feature vector computed in step1. W = arg maxW P(W|X) Which can be rewritten using Bayes Law as : W =
( | ) ( ) ( )
In practical term this process is done in three steps: 1) Acoustic Modeling: Assign probabilities to strings equivalent of each acoustic signal(Acoustic modeling) measured. The most used model to accomplish this is the statistical model called the Hidden Markov Model (HMM) (Levinson et al., 1983; Ferguson, 1980; Rabiner, 1989; Rabiner and Juang, 1985). It is a trained model that learns the statistics feature,X, for each acoustic signal of each word or subword defined in the task language of the specific system. Using the HMM a quantity PA(X|W) is computed . 2) Assign probabilities, PL(W) , to sequences of words recognized as valid sentences in the language. One of the methods that works the best in achieving this goal is based on a Markovian assumption that the probability of a word in a sentence is conditioned only on the previous N-1 words. That is the Ngram language model that is formulated like the following: ( ) ( ) ( ( | ) )
Where ( | ) is estimated by simply counting up the relative frequencies of N-tuples in a large corpus of text. 3) Search through words sequences in the language to find the words sequence with the maximum a posteriori probability value. In another world we are trying to find the string sequence with the maximum likelihood.
So for all practical purposes the Pattern Matching process is a decoder (Ney, 1984; Paul, 2001; Mohri, 1997) which ultimate goal is to find the best match for the spectral feature vector of the input signal. In another it has to match the spoken words with a word sequence that is consistent with the language model and that has the highest likelihood among all the possible word sequence in the language. To achieve that goal it will use information (mostly probability values) from the acoustic model, the language model and the word lexicon to achieve its goal. Its real challenge come from what it does in the third step mentioned above whereby it search through a search space that can be astronomically large and therefore cost a lot in terms of computing time. The challenge is therefore to find a way to reduce that cost by order of magnitude while providing correct results. The most effective method used to that effect
was adopted from the field of Finite State Automata theory. The method provides Finite State Networks (FSNs) (Mohri, 1997) and a search technique based on dynamic programming. Without going into all the details some other technics among which the Weighted Finite State Transducers(WFST) are used to further reduce the size of the network and therefore speed up the search without loss of accuracy. Once the a word sequence is selected the control is handed off to the Confidence Scoring process. The Confidence Scoring The scoring process has a very simple task: error checking and identification of events related to the definition of the language vocabulary that may have affected the outcome. The Databases The world database really says only half of what the Acoustic Model, the Word Lexicon and the Language Model do. They do collect Most of the data used essentially by Pattern Matching but in their own right they are powerful processes with powerful and complicated algorithm. The Acoustic Model: The main Job of this database is to use probability measures and models to characterize sounds. The model most widely used to achieve that goal is the Hidden Markov Model(HMM) (Levinson et al., 1983; Ferguson,1980; Rabiner, 1989; Rabiner and Juang, 1985). This HMM is an iterative process that uses a mixture of density Gaussian distribution and a training set that is updated and improved in each iteration until an optimal alignment and match is achieved. This can be represented as a State machine in which each state represents a probability density function that characterize the statistical behavior of the feature vectors of each subword unit in the task language and the state transitions representing the probability pij of transitioning from state i to state j. Utilizing the Baum-Welch algorithm(Rabiner, 1989; Baum, 1972 ; Baum et al., 1970) along with the state Machine and a training set of words and sentences, the model can be trained to align each of the spoken word with each of the corresponding set of subwords. The algorithm is iterated until a stable alignment that allows the creation of stable model for each subword is reached. In each iteration the algorithm goes through the following three steps: 1) likelihood evaluation: calculate the probability that the correct alignment was discovered given the model 2) Decoding: choose the optimal state sequence for a given speech utterance 3) re-estimation: Adjust the parameters of the model to maximize the probability. The Word Lexicon: This database is more than anything a dictionary that defines the range of pronunciation of words in the task vocabulary. This is needed to deal with the variability in pronunciation of many words due to variability of accent and some words have different meanings and pronunciation depending on the context in which they are used.
Language Model: the task of the language model is to define of spoken input validity rules and make it possible to compute the probability of the word string, W, given the language model. Speech recognitions system have been implementations for many different applications, tasks and purposes, and many of the systems are still around achieving with some degree of success the goal for which they were built. Overall the results of the different performance evaluation methods have shown that the more restrictions and domain focused the system is, the more effective and more accurate is its output. The performance methods vary based on the task the systems is design to perform and the domain in which it is used. For a simple word recognition system for example the performance measurement method is simply the Word Error Rate. For more complex systems such as the dictation application system, some other error such as the word insertion, word substitution and word deletion(Pallet and Fiscus, 1997) can occur and therefore must be taken into account. The conventional formula used in those cases is :
| |
Where: NI NS ND W = Number of word insertion errors. = Number of word Substitution. = Number of word Deletion. = The number of words in the sentence W being scored.
Table1 shows the word Rate of some of the Speech recognition systems. Corpus Type of speech Vocabulary size Word Error Rate 0.3% 2.0% 5.0% 2.0% 2.5% 6.6% ~15% ~27% ~35%
Connect digit string (TI database) Spontaneous 11(09, oh) Connect digit string (AT&T mall recordSpontaneous 11(09, oh) ings) Connected digit string (AT&T HMIHY) Conversational 11(09, oh) Resource management (RM) Read speech 1000 Airline travel information system (ATIS) Spontaneous 2500 North American business (NAB & WSJ) Read text 64000 Broadcast news Narrated news 210 000 Switchboard Telephone conversation 45 000 Call-home Telephone conversation 28 000 Table1: Word error rates for a range of speech recognition systems.
3- Methods
The added module and technique proposed in this article is essentially based on using the concept of information distance to reduce further the rate of error in a QA domain speech recognition system. The module uses the output of a QA speech recognition system in a text format and runs it through an algorithm that tries to find from its questions database the one question that is the closest possible and is most likely to be original question. The algorithm behind the concept of Information Distance is based on a mathematical framework using at its core what is known in the algorithmic information theory as the Kolmogorov complexity( invented in 1960). The Kolmogorov Complexity of a sting x, denoted K(x) is defined as the measure of the computational resource needed to specify the string x. The Kolmogorov Complexity (written ( | ) ) of a binary string x given y and a fixed Universal Turing Machine U is defined as the shortest (prefix free) program that output x with input y and vice versa. Bennett et al proposed to use to define the Information Distance between two strings x and y as the minimum number of bits needed to convert x to y and vis versa in the following formula that uses the Kolmogorov Complexity: ( ) * ( | ) ( | )+.
This formula was further improved by Li et al who introduced a complementary information distance metric to improve Benett et als formula. They propose the following formula for the Information Distance: ( ) *| | ( ) ( ) | | | | | | ( )+
or Dmin(x,y)=min{K(x|y), K(y|x)} which is a complementary information distance that disregard irrelevant information. With a stable formula of the Information Distance the Problem statement is: Given a question I and a Database Q find a question q such that minimizes the formula : Dmin(q,Q) + Dmax(f,q) where Dmin(q,Q) measure the distance between q and Q, and Dmax(f,q) measures the distance between q and I. Minimize that equation requires solving 3 problems: 1) encoding q using Q 2) encoding I using q or vis versa, and 3) find all possible candidates q and a q0 that minimizes the equation. The encoding problem is resolved by partitioning Q into templates p, each one covering a subsets of questions in Q and encoding question q according to a pattern or template p. This technic reduced the item-to-set encoding into item-to-item encoding, and to convert a sentence from another sentence we need encode only the word mismatches and the missing words. Using a standard dynamic programming algorithm the best alignment between two sentence is found . The best alignment between two sentences. In its practical implementation the RSVP process goes through the following step processes: 1) Analyze Input
This step split the input into word in order to find the best alignment among the input questions. 2) Improve Input questions Build a question based on the word-alignment results from step 1. 3) Analyze relevant database patterns Find the relevant database questions, sorting them based on their semantic and syntactic similarities to the improved input questions from step 2. 4) Generate the candidate questions
Map the original input questions into the patterns extracted from the database and replace the words in the pattern with the word in the input. 5) Rank candidate questions using information distance Compute the Distance between the candidate questions and input questions K(q|I), K(I|q) and the distance between the candidate question and the patterns K(q|p) and then rank then Rank all the candidates using equation Dmin(x,y)=min{K(x|y), K(y|x)}. 6) Return the candidate with minimum information distance score as the final result.
Description of algorithms, approaches, implementation detail 4-Evaluation Evaluation criteria Comparisons Discussion Pros and Cons Limitation 5-Conclusion References
4 Page Count Guideline

A typical MS review/survey paper may include 15-20 pages, but it should be at least 10 pages long.
5 Topic and Schedule

On March 1st (spring semester) and October 1st (fall semester), a key paper will be announced. Based on the key paper and its references, in-depth review and survey of the topic area should be documented for a final MS review/survey paper. MS candidates need to submit a signed hard copy and an electronic version to the department by March 31st (spring) and October 31st (fall) respectively.
6 Signature
MS review/survey paper must include following clause at the end of the document. I acknowledge that copying someone else's article or essay, in full or part of it, is wrong, and declare that this is my own work. Print Name : Signature : Date :
References
1. Baldonado, M., Chang, C.-C.K., Gravano, L., Paepcke, A.: The Stanford Digital Library Metadata Architecture. Int. J. Digit. Libr. 1 (1997) 108 121 2. Bruce, K.B., Cardelli, L., Pierce, B.C.: Comparing Object Encodings. In: Abadi, M., Ito, T. (eds.): Theoretical Aspects of Computer Software. Lecture Notes in Computer Science, Vol. 1281. Springer-Verlag, Berlin Heidelberg New York (1997) 415438 3. van Leeuwen, J. (ed.): Computer Science Today. Recent Trends and Developments. Lecture Notes in Computer Science, Vol. 1000. Springer-Verlag, Berlin Heidelberg New York (1995) 4. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996)
Appendix: Appendix Format

The appendix should appear directly after the references, and not on a new page

MS Paper Review On Speech Recognition System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MS Paper Review On Speech Recognition System

Uploaded by

Copyright:

Available Formats

YANG TANG DI WANG JING BAI XIAOYAN ZHU MING LI

ACM, ACM Communications

4 Page Count Guideline

5 Topic and Schedule

Appendix: Appendix Format

You might also like