Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Speech Recognition

Speech Recognition

Ratings: (0)|Views: 537 |Likes:
Published by Justin Cook
Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as voice recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program.
Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as voice recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program.

More info:

Published by: Justin Cook on Nov 16, 2009
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less





Speech recognition
From Wikipedia, the free encyclopedia
Jump to:navigation, search
Speech recognition
(in many contexts also known as
automatic speech recognition
computer speech recognition
or erroneously as
voice recognition
) is the process of converting a speech signal to a sequence of words, by means of an algorithmimplemented as a computer program.Speech recognition applications that have emerged over the last few years include voicedialing (
, "Call home"), call routing (
, "I would like to make a collect call"),simple data entry (
, entering a credit card number), preparation of structureddocuments (e.g., a radiology report),domoticappliances control and content-basedspoken audio search (
find a podcast where particular words were spoken).Voice recognition or speaker recognitionis a related process that attempts to identify the person speaking, as opposed to what is being said.
[edit] Speech recognition technology
In terms of technology, most of the technical text books nowadays emphasize the use of hidden Markov modelas the underlying technology. Thedynamic programming 
approach, theneural network based approach and theknowledge-based learning approach have been studied intensively in the 1980s and 1990s.
[edit] Performance of speech recognition systems
The performance of a speech recognition systems is usually specified in terms of accuracy and speed. Accuracy is measured with theword error rate,whereas speed is measured with thereal time factor .Most speech recognition users would tend to agree that dictation machines can achievevery high performance in controlled conditions. Part of the confusion mainly comes fromthe mixed usage of the terms "speech recognition" and "dictation".Speaker-dependent dictation systems requiring a short period of training can capturecontinuous speech with a large vocabulary at normal pace with a very high accuracy.Most commercial companies claim that recognition software can achieve between 98% to99% accuracy (getting one to two words out of one hundred wrong) if operated under optimal conditions. These optimal conditions usually means the test subjects have 1)matching speaker characteristics with the training data, 2) proper speaker adaptation, and3) clean environment (e.g. office space). (This explains why some users, especially thosewhose speech is heavily accented, might actually perceive the recognition rate to be muchlower than the expected 98% to 99%).Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.Both acoustic modelingand language modelingare important studies in modern statistical speech recognition. In this entry, we will the use of hidden Markov model(HMM) because notably it is very widely used in many systems. (Language modeling has many other applications such assmart keyboardanddocument classification; to the corresponding entries.)The Carnegie Mellon University has made some good steps in increasing the speed of speechchips by using ASICs (application-specific integrated circuits) and reconfigurablechips called FPGAs (field programmable gate arrays). 
[edit] Hidden Markov model (HMM)-based speech recognition
Modern general-purpose speech recognition systems are generally based on (HMMs).This is a statistical model which outputs a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, onecould assume in a short-time in the range of 10 milliseconds, speech could beapproximated as astationary process. Speech could thus be thought as aMarkov model for many stochastic processes (known as
Another reason why HMMs are popular is because they can be trained automatically andare simple and computationally feasible to use. In speech recognition, to give the verysimplest setup possible, the hidden Markov model would output a sequence of n-dimensional real-valued vectors with n around, say, 13, outputting one of these every 10milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transformof a short-time window of  speech and decorrelating the spectrum using a cosine transform, then taking the first(most significant) coefficients. The hidden Markov model will tend to have, in each state,a statistical distribution called a mixture of diagonal covariance Gaussians which willgive a likelihood for each observed vector. Each word, or (for more general speechrecognition systems), each  phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating theindividual trained hidden Markov models for the separate words and phonemes.Described above are the core elements of the most common, HMM-based approach tospeech recognition. Modern speech recognition systems use various combinations of anumber of standard techniques in order to improve results over the basic approachdescribed above. A typical large-vocabulary system would need context dependency for the phones (so phones with different left and right context have different realizations asHMM states); it would use cepstral normalization to normalize for different speaker andrecording conditions; for further speaker normalization it might use vocal tract lengthnormalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition mightuseheteroscedastic linear discriminant analysis (HLDA); or might skip the delta anddelta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform(also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statisticalapproach to HMM parameter estimation and instead optimize some classification-relatedmeasure of the training data. Examples are maximum mutual information (MMI),minimum classification error (MCE) and minimum phone error (MPE).Decoding of the speech (the term for what happens when the system is presented with anew utterance and must compute the most likely source sentence) would probably use theViterbi algorithmto find the best path, and here there is a choice between dynamicallycreating a combination hidden Markov model which includes both the acoustic andlanguage model information, or combining it statically beforehand (the finite statetransducer, or FST, approach).
[edit] Neural network-based speech recognition
Another approach in
acoustic modeling
is the use of neural networks. They are capableof solving much more complicated recognition tasks, but do not scale as well as HMMswhen it comes to large vocabularies. Rather than being used in general-purpose speechrecognition applications they can handle low quality, noisy data and speaker 

Activity (23)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
draganastek68 liked this
Himanshu Arora liked this
arattupuzha liked this
pocahontas~ liked this
raja3546 liked this
ffptp liked this
yuyu dai liked this
are_peace_din liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->