IEEE | Speech Recognition | Cognitive Science

Voice Recognition

Authors - Ketan Munankar, Kushal Ruwatia, Hitesh Patil
Computer Department, Mumbai University Konkan Gyanpeeth College of Engineering, Karjat Dist-Raigad, State-Maharashtra, India INTRODUCTION Voice recognition system is designed to recognize spoken words and convert them into digital data, computer commands, or text. Voice recognition software allows an operator to use his/her voice as an input. A voice recognition system can be used to dictate text into the computer or to give commands to the computer, such as for opening application programs, pulling down menus, or saving work. As the user speaks, the voice recognition software remembers the way that each word is said. This degree of customization enables consistent voice recognition despite differences in accents, dialects, and inflection. Speech pattern techniques involve a pattern recognition approach that requires no explicit knowledge of speech. This approach requires the training of a speech pattern based on some generic spectral parameter set and the recognition of patterns through pattern comparison. These types of systems are being developed due to the desire to use speech instead of typing to exchange information, since humans are used to communicating this way at an increased speed. Using voice-activated software, operators can easily translate between languages such as Chinese or Japanese and English. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker²as is the case for most desktop recognition software. Recognizing the speaker can simplify the task of translating speech. Speech recognition converts spoken words to text. Speech recognition is a broader solution which refers to technology that can recognize speech without being targeted at single speaker²such as a call center system that can recognize arbitrary voices. Speech recognition applications include voice user interfaces such as voice dialing (e.g., "Call home"), domestic appliance control, search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a report), speech-to-text processing (e.g., word processors or emails), and aircraft (usually termed Direct Voice Input). University researchers found no significant increase in recognition accuracy. HISTORY The first speech recognizer appeared in 1952 and consisted of a device for the recognition of single spoken digits. Another early device was the IBM Shoebox, exhibited at the 1964 New York World's Fair. One of the most notable domains for the commercial application of speech recognition in the United States has been health care and in particular the work of the medical transcriptionist. It was also the case that voice recognition at that time was often technically deficient. The biggest limitation to speech recognition automating transcription, however, is seen as the software. The nature of narrative dictation is highly interpretive and often requires judgment that may be provided by a real human but not yet by an automated system. Another limitation has been the extensive amount of time required by the user and/or system provider to train the software. A distinction in voice recognition is often made

Front-End voice recognition is where the provider dictates into a speech-recognition engine. APPLICATIONS HEALTH CARE In the health care domain.S. and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA). it requires each pilot to create a template. even in the wake of improving speech recognition technologies. with applications including: setting radio frequencies. The Euro fighter Typhoon currently in service with the UK RAF employs a speaker-dependent system. commanding an autopilot system. such as weapon release or lowering of the undercarriage. Of particular note is the U. Each of these types of application presents its own particular goals and challenges.e. as could be expected. and controlling flight displays. i. Back-End Voice Recognition or Deferred Voice Recognition is where the provider dictates into a digital dictation system. Working with Swedish pilots flying in the JAS-39 Gripen cockpit. and form filling may all be faster to perform by voice than by using a keyboard. setting steer-point coordinates and weapons release parameters. It was evident that spontaneous speech caused problems for the recognizer. and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands. the recognized words are displayed right after they are spoken. Voice commands are confirmed by visual and/or aural feedback. Deferred Voice Recognition is being widely used in the industry currently. Searches. . medical transcriptionists have not yet become obsolete. and also programs in the UK dealing with a variety of aircraft platforms. The services provided may be redistributed rather than replaced. Speaker independent systems are also being developed and are in testing for The F35 Lightning II (JSF) and the Aermacchi M346 lead in fighter trainer. speech recognizers have been operated successfully in fighter aircraft. a proper syntax. Contrary to what might be expected. and a program in France installing speech recognition systems on Mirage aircraft. could thus be expected to improve recognition accuracy substantially. In these programs. Many Electronic Medical Records (EMR) applications can be more effective and may be performed more easily when deployed in conjunction with a speech-recognition engine. It never goes through an MT/editor. MILITARY HIGH-PERFORMANCE FIGHTER AIRCRAFT Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. It was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly. England (2004) found recognition deteriorated with increasing G-loads.between "artificial syntax systems" which are usually domain-specific and "natural language processing" which is usually language-specific. and above all. no effects of the broken English of the speakers were found. A restricted vocabulary. The system is not used for any safety critical or weapon critical tasks. who edits the draft and finalizes the report. queries. These systems have produced word accuracies in excess of 98%. but is used for a wide range of other cockpit functions. The system is seen as a major design feature in the reduction of pilot workload. and the dictator is responsible for editing and signing off on the document. Speech recognition can be implemented in frontend or back-end of the medical documentation process.

in an eyes-busy environment where much of the information is presented in a display format. The USAF. and voice applications have included: control of communication radios. Speech understanding programs sponsored by the Defense Advanced Research Projects Agency (DARPA) in the U. As in fighter applications. TRAINING AIR TRAFFIC CONTROLLERS Training for air traffic controllers (ATC) represents an excellent application for speech recognition systems. thus reducing training and support personnel. Significant advances in the state-of-the-art in CSR have been achieved. Brazil. Canada are currently . the overriding issue for voice in helicopters is the impact on pilot effectiveness. which would reduce acoustic noise in the microphone. Work in France has included speech recognition in the Puma helicopter. Human-machine interaction by voice has the potential to be very useful in these environments. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK. notably by the U. and control of an automated target handover system. Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot. Encouraging results are reported for the AVRADA tests. Users were very optimistic about the potential of the system. Speech recognition efforts have focused on a database of continuous speech recognition (CSR). USMC. has focused on this problem of natural speech interface. engaging in a voice dialog with the trainee controller. rapidly changing information databases. and current efforts are focused on integrating speech recognition and natural language processing to allow spoken language interaction with a naval resource management system. The acoustic noise problem is actually more severe in the helicopter environment. Commanders and system operators need to query these databases as conveniently as possible. Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters. Many ATC training systems currently require a person to act as a "pseudo-pilot". which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation.S. There has also been much useful work in Canada. In one feasibility study speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications. A number of efforts have been undertaken to interface commercially available isolated-word recognizers into battle management environments. although capabilities were limited. Air controller tasks are also characterized by highly structured speech as the primary output of the controller. US Army. BATTLE MANAGEMENT Battle Management command centers generally require rapid access to and control of large. Much remains to be done both in speech recognition and in overall speech recognition technology. Results have been encouraging. although these represent only a feasibility demonstration in a test environment. hence reducing the difficulty of the speech recognition task should be possible. not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask. US Navy and FAA as well the Royal Australian Air Force and Civil Aviation Authorities in Italy.S.HELICOPTERS The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment. setting of navigation systems. largevocabulary speech which is designed to be representative of the naval resource management task. in order to consistently achieve performance improvements in operational settings.

16bits or 32-bits). samples per second . Vito Technology (VITO Voice2Go). with Tom Clancy's EndWar and Lifeline as working examples y Transcription (digital speech-to-text) y Speech-to-text (transcription of speech into Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms. whereas grammars are used in desktop command and control or telephony interactive voice response (IVR) type applications. They require an acoustic model.e. and 'compiling' them into a statistical representations of the sounds that make up each word (through a process called 'training'). which is created by taking audio recordings of speech and their transcriptions (taken from a speech corpus). including mobile email y Multimodal interaction y Pronunciation evaluation in computer-aided language learning applications y Robotics y Video games. They also require a language model or grammar file. 44. and different bits per sample (the most common being: 8-bits.using ATC simulators with speech recognition from a number of different vendors. Speech recognition engines work best if the acoustic model they use was trained with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized. vehicle Navigation Systems) y Court reporting (Real time Voice Writing) y Hands-free computing: voice command recognition computer user interface y Home automation y Interactive voice response y Mobile telephony. Ford Sync) y Telematics (e. Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice Command). The improvement of mobile processor speeds made feasible the speech-enabled Symbian and Windows Mobile Smartphone. . A language model is a file containing the probabilities of sequences of words. FURTHER APPLICATIONS y Automatic translation y Automotive speech recognition (e. Hidden Markov models (HMMs) are widely used in many systems. 16 kHz. A grammar is a much smaller file containing sets of predefined combinations of words. Speereo Software (Speereo Voice Translator and SVOX.the most common being: 8 kHz.g. ACOUSTIC MODEL Speech recognition engines require two types of files to recognize speech. TELEPHONY AND OTHER DOMAINS mobile text messages) WORKING MODEL OF VOICE RECOGNITION Voice Recognition in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing.. Speech is used mostly as a part of User Interface. Nuance Communications (Nuance Voice Control). 48 kHz and 96 kHz). however. for creating pre-defined or custom speech commands. Language models are used for dictation applications.g.1 kHz. Audio can be encoded at different sampling rates (i. 32 kHz. voice recognition in the field of document production has not seen the expected by whom increases in use. Language modeling has many other applications such as smart keyboard and document classification.

and turns out to generate a multinomial distribution over words. in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of over fitting). such a model tries to capture the properties of a language. and to predict the next word in a speech sequence. Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. The features would have so-called delta . Estimating the probability of sequences can become difficult in corpora. machine translation. Unigram model only calculates the probability of hitting an isolated word. The vectors would consist of cepstral coefficients.. Each word. In speech recognition and in data compression. Language modeling is used in many natural language processing applications such as speech recognition. without considering any influence from the words before or after the target one. part-of-speech tagging. For that reason these models are often approximated using smoothed N-gram models. each phoneme. for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes. outputting one of these every 10 milliseconds. HIDDEN MARKOV MODEL Modern general-purpose speech recognition systems are based on Hidden Markov Models. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal. With query Q as input. HMM-based approach to speech recognition. Described above are the core elements of the most common. retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query. In practical. and it is proved to be sufficient enough to tell the topic from a piece of text. In a short-time (e. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. parsing and information retrieval. Speech can be thought of as a Markov model for many stochastic purposes. which are obtained by taking a Fourier transform of a short time window of speech and de-correlating the spectrum using a cosine transform. In speech recognition. a language model is associated with a document in a collection.LANGUAGE MODEL A statistical language model assigns a probability to a sequence of m words by means of a probability distribution. or (for more general speech recognition systems). the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states). The method to use language models in information retrieval is the query likelihood model. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. such as 10). This leads to the Bag of words model. it would use cepstral normalization to normalize for different speaker and recording conditions. These are statistical models which output a sequence of symbols or quantities. 10 milliseconds)).g. speech can be approximated as a stationary process. will have a different output distribution. we usually only use unigram language models in information retrieval. When used in information retrieval. then taking the first (most significant) coefficients.

which turn their oral language into text. even if in one video the person was walking slowly and if in another they were walking more quickly. and learners.e. Today. DTW has been applied to video. instructors. While this has been an effective writing strategy. and a superior interface connecting to the sound system. the speech recognition software. Examples are maximum mutual information (MMI). There are various voice recognition programs available. minimum classification error (MCE) and minimum phone error (MPE). or combining it statically beforehand (the finite state transducer. for optimal performance all require a computer with a significant amount of memory (hardware). or MLLT). Secondly. now voice recognition programs are more widely available. any data which can be turned into a linear representation can be analyzed with DTW. who have difficulty expressing themselves in writing.and delta-delta coefficients to capture speech dynamics and in addition might use (HLDA). teachers. For instance. audio. it is a heteroscedastic linear discriminate analysis method that allows a computer to find an optimal match between two given sequences (e. or FST. many students with learning disabilities have depended on parents. The obvious advantage of voice recognition technology for students with learning disabilities. approach). readers. it makes the student with a learning disability dependent on another person.A well known application has been automatic speech recognition. similarities in walking patterns would be detected. it is difficult for the writer to have a sense of the flow of the writing without having the draft to read and reread as each sentence develops. Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Voice recognition software allows students to speak into their computers. voice recognition products employ continuous speech. there are some severe drawbacks to this type of dictation. As with all technology. or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminate analysis or a global semi tied covariance transform (also known as maximum likelihood linear transform. or friends to transcribe their papers as they dictate. a headset/microphone. and are even included in many wordprocessing programs. voice recognition technology is most . First. In general. This sequence alignment method is often used in the context of hidden Markov models. is that it circumvents the transcription process entirely. and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information. to cope with different speaking speeds. Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path. or even if there were accelerations and decelerations during the course of one observation. time series) with certain restrictions. which allows the user to speak at a natural pace. and graphics ± indeed. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. processor speed. the sequences are "warped" non-linearly to match each other. Once a significant financial investment. ABOUT VOICE RECOGNITION Traditionally. students with learning disabilities are now able to become more independent as writers. i. DYNAMIC TIME WARPING (DTW) BASED SPEECH RECOGNITION Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. With the advances of voice recognition technology.g.

At times. Speech recognition is especially useful for people who have difficulty using their hands. such as voicemail to text. USING THE EQUIPMENT FLUENT READERS Given a clear signal from the microphone and sound card. This creates a problem. Speech recognition is used in deaf telephony. A USB sound connection bypasses the sound card. The most common complaint about voice recognition software is that the training was a waste of time. A quick solution to this problem is to purchase a USB (universal serial bus) sound system or USB microphone. There are two components to clear recording: microphone quality and sound card quality. however. the microphone often does not continue to function over an extended period of time. Voice recognition software recognizes the user's voice. generally worn on a headset. Both of these components seriously affect the accuracy of the voice files during training. bringing the signal directly to the dictation software. users must first train the dictation software to recognize their voice. once the user can establish good. SOUND CARD New computers typically have an adequate sound card. EQUIPMENT A GOOD SIGNAL Successful voice recognition requires a clear recording of the user's voice into the computer. It is beneficial to use the same microphone for dictation that was used for training. The following suggestions for enhanced accuracy will help avoid the frustration often associated with the training of voice recognition software. wordprocessed document. investing more time in training will greatly improve the overall results.effective when combined with direct skills instruction in punctuation. . Nevertheless. installed by the manufacturer. This is often the point at which many new users become frustrated. and captioned telephone. In fact. However. sentence skills. Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper) can benefit from the software. the most recent generation of voice recognition software has proven exceptionally accurate. a fluent reader should be able to train voice recognition software simply by following the directions provided in the set-up information. effective voice files. PEOPLE WITH DISABILITIES People with disabilities can benefit from speech recognition programs. Although the dictation software may be fairly accurate after one passage is read. MICROPHONE The majority of voice recognition software packages are sold with an acceptable microphone. relay services. whatever is dictated into the microphone will appear as an electronic. ranging from mild repetitive stress injuries to involved disabilities that preclude using conventional computer input devices. Generally. and paragraphing. the sound card signal is not sufficient for good voice recognition. TRAINING In order to have a useful voice recognition system. people who used the keyboard a lot and developed RSI became an urgent early market for speech recognition. expect to read at least three of the stories provided before the software will be sufficiently accurate.

the software will make mistakes when dictating. Many voice recognition software packages also have a "correct that" feature. In order to get through the longer stories quickly. y y y y y y y y . not a solution to the writing problem The software has specific hardware requirements If users talk fast. The voice recognition program is designed to recognize fluent. At this point. continuous speech. the student should dictate several shorter readings. it will learn the student's voice more accurately with each succeeding correction. We don't have to use a keyboard to input information It recognizes 80-95% words correctly DISADVANTAGES It doesn't always recognize words correctly. ADVANTAGES y y y y y y y y y y Allows user to operate a computer by speaking to it Free up working space Eliminates handwriting. these may be read by the instructor. For example. what they produce will Having completed one or two longer stories. before being allowed to read the shorter ones. and with some fluency. it could interfere with the training process. spelling problems Always spells correctly It reads back to user what they have written. It recognizes 5-20% words incorrectly This technology is not affordable to average user Requires large amounts of memory to store voice files Difficult to use in noisy environments Requires each user to train software to recognize voice. with the instructor reciting the passage quietly. the passages should be read fluently. training dictation software requires several oneon-one sessions with an experienced instructor. If the person training the software struggles over words and makes frequent reading mistakes. If the instructor's voice is picked up by the microphone. in well-articulated phrases. For accurate speech recognition.STUDENTS WITH DIFFICULTY IN DECODING Often. 100+ words per minute. the instructor should pronounce complete phrases for the student to dictate into the microphone. training programs require the user to read several longer stories. the students who most need voice recognition software are also the students for whom the training can be the most frustrating. which they can be edited later Users can write as quickly as they speak. For students with a low reading level. PHRASE-BY-PHRASE y which allows the student to amend dictation errors and further refine the training. As the software is used. INSTRUCTOR INITIALIZES TRAINING The instructor completes the initial training. and corrected. therefore. helping with proof reading Users can write papers without being held back by spelling or typing problems Students can produce a large amount of writing. It is important that the instructor have vocal characteristics similar to those of the intended user. hard for poor decoders Makes errors. the student can refine it later. over the student's shoulder. can be frustrating without adequate support Assists with one stage of the writing process. it would not be suitable to have an aggressive football coach initiate the training. Usually. the training program should then make available several shorter readings. if the software will be used by 14-year old female with a history of social anxiety.

tends to run words together.y y y y y probably be disorganized and grammatically incorrect Users have to speak distinctly in order for the software to work well. can be frustrating without adequate support FUTURE OF VOICE RECOGNITION y SUI ± Speech-based User Interface y Robotics IMPROVEMENTS NEEDED y Greater accuracy y Greater system control/command y More compatible software . or mumble. the training process may be long Some punctuation must also be dictated The software has to be trained to recognize the user's voice Users have to edit the matter at final stage Making errors. If the user has non standard speech.

Sign up to vote on this title
UsefulNot useful