This action might not be possible to undo. Are you sure you want to continue?
One of the most important inventions of the nineteenth century was the telephone. Then at the midpoint of twentieth century, the invention of the digital computer amplified the power of our minds, enabled us to think and work more efficiently and made us more imaginative then we could ever have imagined .Now several new technologies have empowered us to teach computers to talk to us in our native languages and to listen to us when we speak (recognition); haltingly computers have begun to understand what we say. Having given our computers both oral and aural abilities, we have been able to produce innumerable computer applications that further enhance our productivity. Such capabilities enable us to route phone calls automatically and to obtain and update computer based information by telephone, using a group of activities collectively referred to as Voice Processing.
Three primary speech technologies are used in voice processing applications: stored speech, text-tospeech and speech recognition. Stored speech involves the production of computer speech from an actual human voice that is stored in a computer’s memory and used in any of several ways. Speech can also be synthesized from plain text in a process known as text-to-speech which also enables voice processing applications to read from textual database. Speech recognition is the process of deriving either a textual transcription or some form of meaning from a spoken input. Speech analysis can be thought of as that part of voice processing that converts human speech to digital forms suitable for transmission or storage by computers. Speech synthesis functions are essentially the inverse of speech analysis – they reconvert speech data from a digital form to one that’s similar to the original recording and suitable for playback. Speech analysis processes can also be referred to as a digital speech encoding ( or simply coding) and speech synthesis can be referred to as Speech decoding.
Dept. Of ECE
2. Evolution of ASR Methodologies
Speech recognition research has been on-going for more than 80 years. Over that period there have been at least 4 generations of approaches, and we forecast a 5th generation that is being formulated based on current research themes. The 5 generations, and the technology themes associated with each of them, are as follows. • • • Generation 1 (1930s to 1950s): Use of ad hoc methods to recognize sounds, or small vocabularies of isolated words. Generation 2 (1950s to 1960s): Use of acoustic phonetic approaches to recognize phonemes, phones, or digit vocabularies. Generation 3 (1960s to 1980s): Use of pattern recognition approaches to speech recognition of small to medium-sized vocabularies of isolated and connected word sequences, including use of linear predictive coding (LPC) as the basic method of spectral analysis; use of LPC distance measures for pattern similarity scores; use of dynamic programming methods for time aligning patterns; use of pattern recognition methods for clustering multiple patterns into consistent reference patterns; use of vector quantization (VQ) codebook methods for data reduction and reduced computation. • Generation 4 (1980s to 2010s): Use of Hidden Markov model (HMM) statistical methods for modelling speech dynamics and statistics in a continuous speech recognition system; use of forward-backward and segmental K -means training methods; use of Viterbi alignment methods; use of maximum likelihood (ML) and various other performance criteria and methods for optimizing statistical models; introduction of neural network (NN) methods for estimating conditional probability densities; use of adaptation methods that modify the parameters associated with either the speech signal or the statistical model so as to enhance the compatibility between model and data for increased recognition accuracy.
Generation 5 (2000s to 2010s): Use of parallel processing methods to increase recognition decision reliability; combinations of HMMs and acoustic-phonetic approaches to detect and correct linguistic irregularities; increased robustness for recognition of speech in noise; machine learning of optimal combinations of models.
3. ISSUES IN SPEECH RECOGNITION
Dept. Of ECE
As we examine the progress made in implementing speech recognition and natural language understanding systems over the years, we will see that there are a number of issues that need to be addressed in order to define the operating range of each speech recognition system that is built. These issues include the following : • Speech unit for recognition: ranging from words down to syllables and finally to phonemes or even phones. Early systems investigated all these types of units with the goal of understanding their robustness to context, speakers and speaking environments
Vocabulary size: ranging from small (order of 2–100 words), medium (order of 100–1000) words, and large (anything above 1000 words up to unlimited vocabularies). Early systems tackled primarily small-vocabulary recognition problems; modern speech recognizers are all largevocabulary systems
Task syntax: ranging from simple tasks with almost no syntax (every word in the vocabulary can follow every other word) to highly complex tasks where the words follow a statistical n-gram language model
Task perplexity (the average word branching factor): ranging from low values (for simple tasks) to values on the order of 100 for complex tasks whose perplexity approaches that of natural language task
Speaking mode: ranging from isolated words (or short phrases), to connected word systems (e.g., sequences of digits that form identification codes or telephone numbers), to continuous speech (including both read passages and spontaneous conversational speech)
Speaker mode: ranging from speaker-trained systems to speaker-adaptive systems to speaker independent systems, which can be used by anyone without any additional training. Most modern ASR systems are speaker independent and are utilized in a range of telecommunication applications. However, for dictation purposes, most systems are still largely speaker dependent and adapt over time to each individual speaker.
Speaking situation: ranging from human-to-machine dialogues to human-to-human dialogues (e.g., as might be needed for language translation systems) Speaking environment: ranging from a quiet room, to noisy places (e.g., offices, airline terminals), and even outdoors (e.g., via the use of cellphones) Transducer: ranging from high-quality microphones to telephones (wire line) to cellphones (mobile) to array microphones (which track the speaker location electronically)
Dept. Of ECE
4. SPEECH RECOGNITION BASICS
The following definitions are the basics needed for understanding speech recognition technology. • Utterance An utterance is the vocalization (speaking) of a word or words that represent a single meaning to the computer. Utterances can be a single word, a few words, a sentence, or even multiple sentences.
Speaker Dependency Speaker dependent systems are designed around a specific speaker. They generally are more accurate for the correct speaker, but much less accurate for other speakers. They assume the speaker will speak in a consistent voice and tempo. Speaker independent systems are designed for a variety of speakers. Adaptive systems usually start as speaker independent systems and utilize training techniques to adapt to the speaker to increase their recognition accuracy.
Vocabularies Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR system. Generally, smaller vocabularies are easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand or more!
Accurate The ability of a recognizer can be examined by measuring its accuracy − or how well it recognizes utterances. This includes not only correctly identifying an utterance but also identifying if the spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application.
Training Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it may allow training to take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy. Training can also be used by speakers that have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.
Dept. Of ECE
5. TYPES OF SPEECH RECOGNITION
Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they are using. • Isolated Words: Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not−Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class.
Connected Words: Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run−together' with a minimal pause between them.
Continuous Speech Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation.
Spontaneous Speech There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.
Voice Verification/Identification Some ASR systems have the ability to identify specific users. This document doesn't cover verification or security systems
Dept. Of ECE
6. SPEECH RECOGNITION
The days when you had to keep staring at the computer screen and frantically hit the key or click the mouse for the computer to respond to your commands may soon be a things of past. Today we can stretch out and relax and tell your computer to do your bidding. Speech recognition is the process of deriving either a textual transcription or some form of meaning from a spoken input. Speech recognition is the inverse process of synthesis, conversion of speech to text. The Speech recognition task is complex. This involves the computer taking the user's speech and interpreting what has been said. This allows the user to control the computer (or certain aspects of it) by voice, rather than having to use the mouse and keyboard, or alternatively just dictating the contents of a document. This has been made possible by the ASR (Automatic Speech Recognition) technology. The ASR technology would be particularly welcome by automated telephone exchange operators, doctors, besides others whose seek freedom from tiresome conventional computer operations using keyboard and the mouse. It is suitable for applications in which computers are used to provide routine information and services. The ASR’s direct speech to text dictation offers a significant advantage over traditional transcriptions. With further refinement of the technology in text will become a thing of past. ASR offers a solution to this fatigue-causing procedure by converting speech in to text. The ASR technology is presently capable achieving recognition accuracies of 95% - 98 % but only under ideal conditions. The technology is still far from perfect in the uncontrolled real world. The routes of this technology can be traced to 1968 when the term Information Technology hadn’t even been coined. American’s had only begun to realize the vast potential of computers. In the Hollywood blockbuster 2001: a space odyssey. A talking listening computer HAL-9000, had been featured which to date is a called figure in both science fiction and in the world of computing. Even today almost every speech recognition technologist dreams of designing an HAL-like computer with a clear voice and the ability to understand normal speech. Though the ASR technology is still not as versatile as the imaginer HAL, it can nevertheless be used to make life easier. New application specific standard products, interactive errorrecovery techniques, and better voice activated user interfaces allow the handicapped, computer-illiterate, and rotary dial phone owners to talk to the computers. ASR by offering a natural human interface to computers, finds applications in telephone-call centres, such as for airline flight information system, learning devices, toys, etc.
Dept. Of ECE
6.1. HOW DOES THE ASR TECHNOLOGY WORK?
When a person speaks, compressed air from the lungs is forced through the vocal tract as a sound wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is interpreted as speech when it falls upon a person’s ear. In any machine that records or transmits human voice, the sound wave is converted into an electrical analog signal using a microphone.
THE ROAD TO HAL
Person speaks “THE ROAD TO HAL” ELECTRONICAL SIGNAL INTO THE COMPUTER BACKGR REMOVAL OF NOISE AND SOUND AMPLIFIC ATION BREAK UP WORDS INTO PHONEMES
TWO CALL LANGUAGE ANALYSIS
MATCHING AND CHOSING THE RIGHT CHARACTER COMBINATION
Roa d Loa
Fig.6.1 Flow of speech recognition
When we speak into a telephone receiver, for instance, its microphone converts the acoustic wave into an electrical analog signal that is transmitted through the telephone network. The electrical signals strength from the microphone varies in amplitude over time and is referred to as an analog signal or an analog waveform. If the signal results from speech, it is known as a speech waveform. Speech waveforms have the characteristic of being continuous in both time and amplitude.
Dept. Of ECE
A listener’s ears and brain receive and process the analog speech waveforms to figure out the speech. ASR enabled computers, too, work under the same principle by picking up acoustic cues for speech analysis and synthesis. Because it helps to understand the ASR technology better, let us dwell a little more on the acoustic process of the human articulator system. In the vocal tract the process begins from the lungs. The variations in air pressure cause vibrations in the folds of skin that constitute the vocal chords. The elongated orifice between the vocal chords is called the glottis. As a result of the vibrations, repeated bursts of compressed air are released in to the air as sound waves. Articulators in the vocal tract are manipulated by the speaker to produce various effects. The vocal chords can be stiffened or relaxed to modify the rate of vibration, or they can be turned off and the vibration eliminated while still allowing air to pass. The velum acts as a gate between the oral and the nasal cavities. It can be closed to isolate or opened to couple the two cavities. The tongue, jaw, teeth, and lips can be moved to change the shape of the oral cavity. The nature of sound preserve wave radiating out world from the lips depends upon this time varying articulations and upon the absorptive qualities of the vocal tracts materials. The sound pressure wave exists as a continually moving disturbance of air. Particles come move closer together as the pressure increases or move further apart as it decreases, each influencing its neighbor in turn as the wave propagates at the speed of sound. The amplitude to the wave at any position, distant from the speaker, is measured by the density of air molecules and grows weaker as the distance increases. When this wave falls upon the ear it is interpreted as sound with discernible timbre, pitch, and loudness. Air under pressure from the lung moves through the vocal tract and comes into contact with various obstructions including palate, tongue, teeth, lips and timings. Some of its energy is absorbed by these obstructions; most is reflected. Reflections occur in all directions so that parts of waves bounce around inside the cavities for some time, blending with other waves, dissipating energy and finally finding the way out through the nostrils or past the lips. Some waves resonate inside the tract according to their frequency and the cavity’s shape at that moment, combining with other reflections, reinforcing the wave energy before exiting. Energy in waves of other, non-resonant frequencies is attenuated rather than amplified in its passage through the tract.
6.2. THE SPEECH RECOGNITION PROCESS
When a person speaks, compressed air from the lungs is forced through the vocal tract as a sound wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is interpreted as speech when it falls up on a person’s ear. Speech waveforms have the characteristic of being continuous in both time and amplitude.
Dept. Of ECE PESIT
Fig.6.2.Steps in speech recognition
Fig.6.3.Block diagram of steps in speech recognition Any speech recognition system involves five major steps:
Converting sounds into electrical signals: when we speak into microphone it converts sound waves into electrical signals. In any machine that records or transmits human voice, the sound wave is converted into an electrical signal using a microphone. When we speak into telephone receiver, for instance, its microphone converts the acoustic wave into an electrical analog signal that is transmitted through the telephone network. The electrical signal’s strength from the microphone varies in amplitude overtime and is referred to as an analog signal or an analog waveform. Analog signal is converted into digital signal using sound card.
Background noise removal: the ASR programs removes all noise and retains the words that you have spoken. Breaking up words into phonemes: The words are broken down into individual sounds, known as phonemes, which are the smallest sound units discernible. For each small amount of time, some feature, value is found out in the wave. Likewise, the wave is divided into small parts, called Phonemes.
Matching and choosing character combination: this is the most complex phase. The program has big dictionary of popular words that exist in the language. Each Phoneme is matched against the sounds and converted into appropriate character group. This is where problem begins. It checks and compares words that are similar in sound with what they have heard. All these similar words are collected.
Language analysis: here it checks if the language allows a particular syllable to appear after another. After that, there will be grammar check. It tries to find out whether or not the combination of
words any sense. That is there will be a grammar check package.
Dept. Of ECE PESIT
Finally the numerous words constitution the speech recognition programs come with their own word processor, some can work with other word processing package like MS word and word perfect.
6.3. VARIATIONS IN SPEECH
The speech-recognition process is complicated because the production of phonemes and the transition between them varies from person to person and even the same person. Different people speak differently. Accents, regional dialects, sex, age, speech impediments, emotional state, and other factors cause people to pronounce the same word in different ways. Phonemes are added, omitted, or substituted. For example, the word, America, is pronounced in parts of New England as Amrica. The rate of speech also varies from person the person depending upon a person’s habit and his regional background. A word or a phrase spoken by the same individual differs from moment to moment illness; tiredness, stress or other conditions cause subtle variations in the way a word is spoken at different times. Also, the voice quality varies in accordance with the position of the person relative to the microphone, the acoustic nature of the surroundings, or the quality of the recording devices. The resulting changes in the waveform can drastically affect the performance of the recognizer.
6.4. VOCABULARIES FOR COMPUTERS
Each ASR system has LAN active vocabulary- a set of words from which the recognition engine tries to make senses of utterance- and a total vocabulary size-the total number of words in all possible sets that can be culled from the memory. The vocabulary size and system recognition latency- the allowable time to accurately recognize an utterance determining the process horsepower of the recognition engine. An active vocabulary set comprises approximately fourteen words plus none of the above, who the recognizer chooses when none of the fourteen words is good mach .The recognition latency when using a 4-MIPS processor, is about.5 seconds for a second for independent set. Processing power requirements increased dramatically for LVR sets with thousands of words. Real time latencies with a vocabulary base of few thousands are possible only through the use of Pentium class processors. A small active vocabulary limits a system search range providing advantages in latency and search time. A large total vocabulary enables more versatile human interface but affects system memory requirements. A system with a small active vocabulary with each prompt usually provides faster more accurate results, similar sounding words in vocabulary set cause recognition errors. But a unique sound for each word enhances recognition engines accuracy.
Dept. Of ECE PESIT
6.5. WHICH SYSTEM TO CHOOSE
In choosing a speech recognition system you should consider the degree of speaker independence it offers. Speaker independent systems can provide high recognition accuracies for a wide range of users without needing to adapt to each user’s voice. Speaker dependent systems require that to train the system to your voice to attain high accuracy. Speaker adaptive systems an intermediate category are essentially speaker-independent but can adapt their templates for each user to improve accuracy.
ADVANTAGES OF SPEAKER INDEPENDENT SYSTEM
The advantage of a speaker independent system is obvious-anyone can use the system without first training it. However, its drawbacks are not so obvious. One limitation is the work that goes into creating the vocabulary templates. To create reliable speaker-independents templates, someone must collect and process numerous speech samples. This is a time-consuming task; creating these templates is not a onetime effort. Speaker-independent templates are language-dependant, and the templates are sensitive not only to two dissimilar languages but also to the differences between British and American English. Therefore, as part of your design activity, you would need to create a set of templates for each language or a major dialect that your customers use. Speaker independent systems also have a relatively fixed vocabulary because of the difficulty in creating a new template in the field at the user’s site.
ADVANTAGE OF A SPEAKER-DEPENDENT SYSTEM:
A speaker dependent system requires the user to train the ASR system by providing examples of his own speech. Training can be tedious process, but the system has the advantage of using templates that refer only to the specific user and not some vague average voice. The result is language independence. You can say ja, si, or ya during training, as long as you are consistent. The drawback is that the speaker-dependent system must do more than simply match incoming speech to the templates. It must also include resources to create those templates.
WHICH IS BETTER:
For a given amount of processing power, a speaker dependent system tends to provide more accurate recognition than a speaker-independent system. A speaker independent system is not necessarily better, the difference in performance stems from the speaker independent template encompassing wide speech variations.
Dept. Of ECE PESIT
6.6. TECHNIQUES IN VOGUE:
The most frequently used speech recognition technique involves template matching, in which vocabulary words are characterized in memory a template time based sequences of spectral information taken from waveforms obtained during training. As an alternative to template matching, feature based designs have been used in which a time sequence of the pertinent phonetic features is extracted from a speech waveform. Different modelling approaches are used, but models involving state diagrams have been found to give encouraging performance. In particular, HMM (Hidden Markov Models) are frequently applied. With HMMs any speech unit can be modelled, and all knowledge sources can be modelled, and all knowledge sources and be included in a single, integrated model. Various types of HMMs have been implemented with differing results. Some model each word in the vocabulary, while others model sub word speech units.
7. HIDDEN MARKOV MODEL
A hidden Markov model can be used to model an unknown process that produces a sequence of observable outputs at discrete intervals where the outputs are members of some finite alphabet. It might be helpful to think of the unknown process as a black box about whose workings nothing is known except that, at each interval, it issues one member chosen from the alphabet. These models are called "hidden" Markov models precisely because the state sequence that produced the observable output is not known-it's "hidden." HMMs have been found to be especially apt for modelling speech processes.
CHOICE OF SPEECH UNITS
The amount of storage required and the amount of processing time for recognition are functions of the number of units in the inventory, so selection of the unit will have a significant impact. Another consideration concerns the ease with which adequate training can be provided Another important consideration in selecting a speech unit concerns the ability to model contextual differences.
MODELING SPEECH UNITS WITH HIDDEN MARKOV MODELS
Dept. Of ECE
Suppose we want to design a word-based, isolated word recognizer using discrete hidden Markov models. Each word in the vocabulary is represented by an individual HMM, each with the same number of states. A Word can be modelled as a sequence of syllables, phonemes, or other speech sounds that have a temporal interpretation and can best be modelled with a left-to-right HMM whose states represent the speech sounds. Assume the longest word in the vocabulary can be represented by a 10-state HMM. So, using a 10-state HMM like that of Figure below for each word, let's assume states in the HMM represent phonemes. The dotted lines in the figure are null transitions, so any state can be omitted and some words modelled with fewer states. The duration of a phoneme is accommodated by having a state transition returning to the same state. Thus, at a clock time, a state may return to itself and may do so at as many clock times as required to correctly model the duration of that phoneme in the word, Except for beginning and end states, which represent transitions into and out of the word, each state in the word model has a selftransition. Assume, in our example, that the input speech waveform is coded into a string of spectral vectors, one occurring every 10 milliseconds, and that vector quantization further transforms each spectral vector to a single value that indexes a representative vector in the codebook. Each word in the vocabulary will be trained through a number of repetitions by one or more talkers. As each word is trained, the transitional and output probabilities of its HMM are adjusted to merge the latest word repetition into the model. During training, the codebook is iterated with the objective of deriving one that’s optimum for the defined vocabulary. When an unknown spoken word is to be recognized, it's transformed to a string of code book indices. That string is then considered an HMM observation sequence by the recognizer that calculates, for each word model in the vocabulary, the probability of that HMM having generated the observations. The word corresponding to the word model with the highest probability is selected as the one recognized.
ACOUSTIC/PHONETIC EXAMPLE USING HIDDEN MARKOV MODEL
Every speech recognition system has its own architecture. Even those that are based on HMMs have their individual designs, but all share some basic concepts and features, many of which are recognizable even though the names are often different. A representative block diagram is given below. The input to a recognizer represented by Figure arrives from the left in the form of a speech waveform, and an output word or sequence of words emanates from the recognizer to the right. It incorporates: (A) SPECTRAL CODING: The purpose of spectral coding is to transform the form embodying speech features that facilitate subsequent recognition tasks. In signal into digital to spectral
coding, this function is sometimes called spectrum analysis, acoustic parameterization, etc. Recognizers can work with time-domain coding, but spectrally coded parameters in the frequency domain have advantages and are widely used-hence the title "spectral coding."
Dept. Of ECE PESIT
Fig.7.1.A hidden Markov model recognizer
(B) UNIT MATCHING: The objective of unit matching is to transcribe the output data stream from the spectral coding module into a sequence of speech units. The function of this module is also referred to as feature analysis, phonetic decoding, phonetic segmentation, phonetic processing, feature extraction, etc. (C) LEXICAL DECODING: The function of this module is to match strings of speech units in the unit matching module's output stream with words from the recognizer's lexicon. It outputs candidate wordsusually in the form of a word lattice containing sets of alternative word choices. (D) SYNTACTIC, SEMANTIC, AND OTHER ANALYSES Analyses that follow lexical
decoding all have the purpose of pruning worst candidates passed along from the lexical decoding module until optimal word selections can be made. Various means and various sources of intelligence- can be applied to this end. Acoustic information (stress, intonation, change of amplitude or pitch, relative location of formants, etc.) obtained from the waveform can be employed, but sources of intelligence from outside the waveform are also available. These include syntactic, semantic, and pragmatic information.
Dept. Of ECE
The specific use of speech recognition technology will depend on the application. Some target applications that are good candidates for integrating speech recognition include: • Games and Edutainment Speech recognition offers game and edutainment developers the potential to bring their applications to a new level of play. With games, for example, traditional computer-based characters could evolve into characters that the user can actually talk to. While speech recognition enhances the realism and fun in many computer games, it also provides a useful alternative to keyboard-based control, and voice commands provide new freedom for the user in any sort of application, from entertainment to office productivity. • Data Entry Applications that require users to keyboard paper-based data into the computer (such as database front-ends and spreadsheets) are good candidates for a speech recognition application. Reading data directly to the computer is much easier for most users and can significantly speed up data entry. While speech recognition technology cannot effectively be used to enter names, it can enter numbers or items selected from a small (less than 100 items) list. Some recognizers can even handle spelling fairly well. If an application has fields with mutually exclusive data types (for example, one field allows "male" or "female", another is for age, and a third is for city), the speech recognition engine can process the command and automatically determine which field to fill in. • Document Editing This is a scenario in which one or both modes of speech recognition could be used to dramatically improve productivity. Dictation would allow users to dictate entire documents without typing.
Dept. Of ECE PESIT
Command and control would allow users to modify formatting or change views without using the mouse or keyboard. For example, a word processor might provide commands like "bold", "italic", "change to Times New Roman font", "use bullet list text style," and "use 18 point type." A paint package might have "select eraser" or "choose a wider brush." • Command and Control ASR systems that are designed to perform functions and actions on the system are defined as Command and Control systems. Utterances like "Open Netscape" and "Start a new xterm" will do just that. • Telephony Some PBX/Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones. • • Wearable devices Because inputs are limited for wearable devices, speaking is a natural possibility. Medical/Disabilities Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text. • Embedded Applications Some newer cellular phones include C&C speech recognition that allow utterances such as "Call Home". This could be a major factor in the future of ASR and Linux. Why can't I talk to my television yet?
Each of the speech technologies of recognition and synthesis has its limitations. These limitations or constraints on speech recognition systems focus on the idea of variability. Overcoming the tendency for ASR systems to assign completely different labels to speech signals which a human being would judge to be variants of the same signal has been a major stumbling block in developing the technology. The task has been viewed as one of de-sensitising recognisers to variability. It is not entirely clear that this idea models adequately the parallel process in human speech perception. Human being are extremely good at spotting similarities between input signals - whether they are speech signals or some other kind of sensory input, like visual signals. The human being is essentially a pattern seeking device, attempting all the while to spot identity rather than difference.
Dept. Of ECE
By contrast traditional computer programming techniques make it relatively easy to spot differences, but surprisingly difficult to spot similarity even when the variability is only slight. Much effort is being devoted at the moment to developing techniques which can re-orientate this situation and turn the computer into an efficient pattern spotting device.
The uses of speech technology are wide ranging. Most effort at the moment centers around trying to provide voice input and output for information systems - say, over the telephone network. A relatively new refinement here is the provision of speech systems for accessing distributed information of the kind presented on the Internet. The idea is to make this information available to people who do not have, or do not want to have, access to screens and keyboards. Essentially researchers are trying to harness the more natural use of speech as a means of direct access to systems which which more normally associated with the technological paraphernalia of computers. Clearly a major use of the technology is to assist people who are disadvantaged in one way or another with respect to producing or perceiving normal speech. The eavesdropping potential referred to in the slide is not sinister. It simply means the provision of, say, a speech recognition system for providing an input to a computer when the speaker has their hands engaged on some other task and cannot manipulate a keyboard - for example, a surgeon giving a running commentary on what he or she is doing. Another example might be a car mechanic on his or her back underneath a vehicle interrogating a stores computer as to the availability of a particular spare part.
Speech recognition is a truly amazing human capacity, especially when you consider that normal conversation requires the recognition of 10 to 15 phonemes per second. It should be of little surprise then
Dept. Of ECE PESIT
that attempts to make machine (computer) recognition systems have proven difficult. Despite these problems, a variety of systems are becoming available that achieve some success, usually by addressing one or two particular aspects of speech recognition. A variety of speech synthesis systems, on the other hand, have been available for some time now. Though limited in capabilities and generally lacking the “natural” quality of human speech, these systems are now a common component in our lives.
 L. R. Rabiner and B. Juang, “Fundamentals of Speech Recognition”, Pearson Education (Asia) Pte. Ltd., 2004  L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals”, Pearson Education (Asia) Pte. Ltd., 2004. S.Young, HMMs and Related Speech Recognition Technologies, Part E 27 Historical Perspective of the Field of ASR/NLU,L. Rabiner, B.-H. Juang,Part E 26
Dept. Of ECE
Dept. Of ECE