By-NITIN RAWAT (0909533046)

Submitted to the Department of Electronics and Telecommunication In partial fulfillment of the requirements For the degree of Bachelor of technology In Electronics and Telecommunication

Mahatma Gandhi Mission’s College of Engineering & Technology, Sector 62, NOIDA (U.P) G.B.TECHNICAL UNIVERSITY: LUCKNOW February, 2012


This is to certify that Project Report entitled “SPEECH RECOGNITION SYSTEM” which is submitted by NITIN RAWAT in partial fulfillment of the requirement for the award of degree of B. Tech. in Department of Electronics & Telecommunication is a record of the candidate own work carried out by him under my supervision. The matter embodied in this thesis is original and has not been submitted for the award of any other degree.

Date: 29-02-12

Supervisor Mrs. Rakhi Jain


I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgement has been made in the text.



Roll No.- 0909533046

Date- 29-02-2012


It gives us a great sense of pleasure to present the report of the B.Tech Seminar undertaken during B.Tech Third Year. We owe special debt of gratitude to Mrs. Rakhi Jain, Department of Electronics & telecommunication, MGM CoET, Noida for her constant support and guidance throughout the course of our work. Her sincerity, thoroughness and perseverance have been a constant source of inspiration for us. It is only her cognizant efforts that our endeavors have seen light of the day.

We also take the opportunity to acknowledge the contribution of Professor Vamshi Krishna, Head of the Department of Electronics & Telecommunication, MGMCoET, Noida for his full support and assistance during the development of the project.

We also do not like to miss the opportunity to acknowledge the contribution of all faculty members of the department for their kind assistance and cooperation during the development of our project. Last but not the least, we acknowledge our friends for their contribution in the completion of the project.

Figure no.
1.1 1.2 1.3 1.4

Figure caption
speech or voice as an analog signal Structure of a standard speech recognition system. speech and its 16 bit coded waveform at 100fps Acoustic models: template and state representations for the word "cat".

Page no.
2 3 4 5

2.1 3.1 3.2 3.3 3.4 6.1

a speech waveform and its FFT(fast Fourier transform) pin configuration of IC HM 2007 Programming board for HM 2007 functional diagram of the speech recognition circuit Internal processes in HM2007 speech recognition home automation

8 11 13 14 15 20

Chapter 1

1.1 Fundamentals of Speech Recognition

1 3 7 7 10 10 11 11 12 12 12 14 15 13 16 16 18 19 21 22

Chapter 2

Theory 2.1 How does the ASR technology work.

Chapter 3

Materials and methods 3.1 A speech recognitions system involves seven major steps : 3.2 Speech recognition through IC HM2007 3.2.1 About the ic: 3.2.2 Speech Acquisition: 3.2.3 Speech Preprocessing 3.2.4 Training the ic: 3.2.5 Functional diagram of IC HM2007 3.2.6 Why ic HM2007

Chapter 4

Type of speech recognition systems 4.1- Advantage of speaker dependent system 4.2-Advantage of speaker independent system

Chapter 5 Chapter 6 Chapter 7

Persisting problems Pros and cons Conclusion References

Speech is the vocalized form of human communication. It is based upon the syntactic combination of lexicals and names that are drawn from very large vocabularies. Each spoken word is created out of the phonetic combination of a limited set of vowel and consonant speech sound units. These vocabularies, the syntax which structures them, and their set of speech sound units differ, creating the existence of many thousands of different types of mutually unintelligible human languages. Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format or digital codes. Rudimentary speech recognition software has a limited vocabulary of words and phrases and may only identify these if they are spoken very clearly. The speech recognition system is completely assembled and easy to use programmable speech recognition. Programmable, in the sense that you train the words (or vocal utterances) you want the circuit to recognize. This allows you to experiment with many facets of speech recognition technology. It has 8 bit data out which can be interfaced with any microcontroller for further development. Some of interfacing applications which can be made are controlling home appliances, robotics movements, Speech Assisted technologies, Speech into text translation, and many more.


Speech is a natural mode of communication for people. We learn all the relevant skills during early childhood, without instruction, and we continue to rely on speech communication throughout our lives. It comes so naturally to us that we don’t realize how complex phenomenon speech is. The human vocal tract and articulators are biological organs with nonlinear properties, whose operation is not just under conscious control but also affected by factors ranging from gender to upbringing to emotional state. As a result, vocalization scan vary widely in terms of their accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics (if telephones or other electronic equipment are used). All these sources of variability make speech recognition, even more than speech generation, a very complex problem. The first developments in speech recognition predate the invention of the modern computer by more than 50 years. Alexander Graham Bell was inspired to experiment in transmitting speech by his wife, who was deaf. He initially hoped to create a device that would transform audible words into a visible picture that a deaf person could interpret. He did produce spectrographic images of sounds, but his wife was unable to decipher them. That line of research eventually led to his invention of the telephone. Speech recognition is basically the conversion of sound or speech energy into electrical signals which are further sampled into digital signals.


Figure 1.1-speech or voice as an analog signal.

The speech recognition technology would be particularly welcome by automated telephone exchange operators, doctors and lawyers, besides others whose seek freedom from tiresome conventional computer operations using keyboard and the mouse. It is suitable for applications in which computers are used to provide routine information and services. The ASR¶s direct speech to text dictation offers a significant advantage over traditional transcriptions. With further refinement of the technology in text will become a thing of past. ASR offers a solution to this fatigue-causing procedure by converting speech in to text. The ASR technology is presently capable achieving recognition accuracies of 95% - 98% but only under ideal conditions. The technology is still far from perfect in the uncontrolled real world. The routes of this technology can be traced to 1968 when the term Information Technology hadn¶t even been coined. American¶s had only begun to realize the vast potential of computers. In the Hollywood blockbuster 2001: a space odyssey. A talking listening computer HAL-9000, had been featured which to date is a called figure in both science fiction and in the world of computing. Even today almost every speech recognition technologist dreams of designing an HAL-like computer with a clear voice and the ability to understand normal speech. Though the ASR technology is still not as versatile as the imaginer HAL, it can nevertheless be used to make life easier. New application specific standard products,

interactive error-recovery techniques, and better voice activated user interfaces allow the handicapped, computer-illiterate, and rotary dial phone owners to talk to the computers. ASR by offering a natural human interface to computers, finds applications in telephone-call centers, such as for airline flight information system, learning devices, toys, etc

1.1 Fundamentals of Speech Recognition
Speech recognition is a multileveled pattern recognition task, in which acoustical signals are examined and structured into a hierarchy of subword units (e.g., phonemes), words, phrases, and sentences. Each level may provide additional temporal constraints, e.g., known word pronunciations or legal word sequences, which can compensate for errors or uncertainties at lower levels. This hierarchy of constraints can best be exploited by combining decisions probabilistically at all lower levels, and making discrete decisions only at the highest level. The structure of a standard speech recognition system is illustrated in Figure. The elements are as follows:

figure 1.2-Structure of a standard speech recognition system.

Raw speech. Speech is typically sampled at a high frequency, e.g., 16 KHz over a microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over time.

Signal analysis. Raw speech should be initially transformed and compressed, in order to simplify subsequent processing. Many signal analysis techniques are available which can extract useful features and compress the data by a factor of ten without losing any important information. Among the most popular:

Fourier analysis (FFT) yields discrete frequencies over time, which can be interpreted visually. Frequencies are often distributed using a Mel scale, which is linear in the low range but logarithmic in the high range, corresponding to physiological characteristics of the human ear.


Perceptual Linear Prediction (PLP) is also physiologically motivated, but yields coefficients that cannot be interpreted visually.


Linear Predictive Coding (LPC) yields coefficients of a linear equation that approximate the recent history of the raw speech values.


Cepstral analysis calculates the inverse Fourier transform of the logarithm of the power spectrum of the signal.


In practice, it makes little difference which technique is used1. Afterwards, procedures such as Linear Discriminant Analysis (LDA) may optionally be applied to further reduce the dimensionality of any representation, and to decorrelate the coefficients.

Figure 1.3-raw speech and its 16 bit coded waveform at 100fps

Speech frames. The result of signal analysis is a sequence of speech frames, typically at 10 msec intervals, with about 16 coefficients per frame. These frames may be augmented by their own first and/or second derivatives, providing explicit information about speech dynamics; this typically leads to improved performance. The speech frames are used for acoustic analysis.

Acoustic models. In order to analyze the speech frames for their acoustic content, we need a set of acoustic models. There are many kinds of acoustic models, varying in their representation, granularity, context dependence, and other properties.

figure 1.4-Acoustic models: template and state representations for the word "cat". Figure shows two popular representations for acoustic models. The simplest is a template, which is just a stored sample of the unit of speech to be modeled, e.g., a recording of a word. An unknown word can be recognized by simply comparing it against all known templates, and finding the closest match. Templates have two major drawbacks: (1) they cannot model acoustic variabilities, except in a coarse way by assigning multiple templates to each word; and (2) in practice they are limited to whole-word models, because it's hard to record or segment a sample shorter than a word - so templates are useful only in small systems which can afford the luxury of using whole-word models. A more flexible representation, used in larger systems, is based on trained acoustic models, or states. In this approach, every word is modeled by a sequence of trainable states, and each state indicates the sounds that are likely to

be heard in that segment of the word, using a probability distribution over the acoustic space. Probability distributions can be modeled parametrically, by assuming that they have a simple shape (e.g., a Gaussian distribution) and then trying to find the parameters that describe it; or non-parametrically, by representing the distribution directly.


In Computer Science, Speech recognition is the translation of spoken words into text. It is also known as "automatic speech recognition", "ASR", "computer speech recognition", "speech to text", or just "STT". Speech Recognition is technology that can translate spoken words into text. Some SR systems use "training" where an individual speaker reads sections of text into the SR system. These systems analyze the person's specific voice and use it to fine tune the recognition of that person's speech, resulting in more accurate transcription. Systems that do not use training are called "Speaker Independent" systems. Systems that use training are called "Speaker Dependent" systems.

When a person speaks, compressed air from the lungs is forced through the vocal tract asa sound wave that varies as per the variations in the lung pressure and the vocal tract. This acoustic wave is interpreted as speech when it falls upon a person’s ear. In any machine that records or transmits human voice, the sound wave is converted into an electrical analogue signal using a microphone. When we speak into a telephone receiver, for instance, its microphone converts the acoustic wave into an electrical analogue signal that is transmitted through the telephone network. The electrical signals strength from the microphone varies in amplitude over time and is referred to as an analogue signal or an analogue waveform. If the signal results from speech, it is known as a speech waveform. Speech waveforms have the characteristic of being continuous in both time and amplitude A listener¶s ears and brain receive and process the analogue speech waveforms to figure out the speech. ASR enabled computers, too, work under the same principle by picking up acoustic cues for speech analysis and synthesis. Because it helps to understand the ASR technology better, let us dwell a little more on the acoustic process of the human articulator system. In the vocal tract the process begins from the lungs. The variations in air

pressure cause vibrations in the folds of skin that constitute the vocal chords. The elongated orifice between the vocal chords is called the glottis. As a result of the vibrations, repeated bursts of compressed air are released in to the air as sound waves.

Figure 2.1 –a speech waveform and its FFT(fast Fourier transform) Articulators in the vocal tract are manipulated by the speaker to produce various effects. The vocal chords can be stiffened or relaxed to modify the rate of vibration, or they can be turned off and the vibration eliminated while still allowing air to pass. The velum acts as a gate between the oral and the nasal cavities. It can be closed to isolate or opened to couple the two cavities. The tongue, jaw, teeth, and lips can be move to change the shape of the oral cavity. The nature of sound preserve wave radiating out world from the lips depends upon this time varying articulations and upon the absorptive qualities of the vocal tracts materials. The sound pressure wave exists as a continually moving disturbance of air. Particles come move closer together as the pressure increases or move further apart as it decreases, each influencing its neighbor in turn as the wave propagates at the speed of sound. The amplitude to the wave at any position, distant from the speaker, is measured by the density of air molecules and grows weaker as the distance increases. When this wave falls upon the ear it is interpreted as sound with discernible timbre, pitch, and loudness. Air under pressure from the

lung moves through the vocal tract and comes into contact with various obstructions including palate, tongue, teeth, lips and timings. Some of its energy is absorbed by these obstructions; most is reflected. Reflections occur in all directions so that parts of waves bounce around inside the cavities for some time, blending with other waves, dissipating energy and finally finding the way out through the nostrils or past the lips. Some waves resonate inside the tract according to their frequency and the cavity¶s shape at that moment, combining with other reflections, reinforcing the wave energy before exiting. Energy in waves of other, nonresonant frequencies is attenuated rather than amplified in its passage through the tract.


3.1 A speech recognitions system involves seven major steps :
1. Converting sounds into electrical signals: when we speak into microphone it converts sound waves into electrical signals. In any machine that records or transmits human voice, the sound wave is converted into an electrical signal using a microphone. When we speak into telephone receiver, for instance, its microphone converts the acoustic wave into an electrical analogue signal that is transmitted through the telephone network. The electrical signal¶s strength from the microphone varies in amplitude overtime and is referred to as an analogue signal or an analogue waveform.

2. Background noise removal: the ASR programs removes all noise and retains the words that you have spoken.

3. Breaking up words into phonemes: The words are broken down into individual sounds, known as phonemes, which are the smallest sound units discernible. For each small amount of time, some feature, value is found out in the wave. Likewise, the wave is divided into small parts, called Phonemes.

4. Matching and choosing character combination: this is the most complex phase. The program has big dictionary of popular words that exist in the language. Each Phoneme is matched against the sounds and converted into appropriate character group. This is where problem begins. It checks and compares words that are similar in sound with what they have heard. All these similar words are collected.

5. Language analysis: here it checks if the language allows a particular syllable to appear after another.

6. After that, there will be grammar check. It tries to find out whether or not the combination of words any sense.

7. Finally the numerous words constitution the speech recognition programs come with their own word processor, some can work with other word processing package like MS word and word perfect.


3.2.1 About the ic: The HM2007 is a CMOS voice recognition LSI (Large Scale Integration) circuit. The chip contains an analog front end, voice analysis, regulation, and system control functions. The chip may be used in a stand alone or CPU connected.

Figure 3.1-pin configuration of IC HM 2007

3.2.2 Speech Acquisition: During speech acquisition, speech samples are obtained from the speaker in real time and stored in memory for preprocessing. Speech acquisition requires a microphone coupled with an analog-to-digital converter (ADC) that has the proper amplification to receive the voice speech signal, sample it, and convert it into digital speech. The system sends the analog speech through a transducer, amplifies it, sends it through an ADC. The received samples are stored into memory on a RAM. We can easily implement speech acquisition with the HM 2007 ic. The microphone input port with the audio codec receives the signal, amplifies it, and converts it into 8-bit PCM digital samples at a sampling rate of 3.57MHZ. The HM 2007 IC requires initial configuration or training of words, which is performed using a programming board. In the training process user trains the IC by speaking words into the microphone and assigning a particular value for that word. For example a world “hello” can be assigned a value 02or 05. This can then be later connected to a microcontroller for further functions.

3.2.3 Speech Preprocessing: The speech signal consists of the uttered digit along with a pause period and background noise. Preprocessing reduces the amount of processing required in later stages. Generally, preprocessing involves taking the speech samples as input, blocking the samples into frames, and returning a unique pattern for each sample, as described in the following steps. 1. The system must identify useful or significant samples from the speech signal. To accomplish this goal, the system divides the speech samples into overlapped frames. 2. The system checks the frames for voice activity using endpoint detection and energy threshold calculations. 3. The speech samples are passed through a pre -emphasis filter.


Figure 3.2- Programming board for HM 2007

3.2.4 Training the ic: An important part of speech-to-text conversion using pattern recognition is training. Training involves creating a pattern representative of the features of a class using one or more test patterns that correspond to speech sounds of the same class. The resulting pattern (generally called a reference pattern) is an example or template, derived from some type of averaging technique. It can also be a model that characterizes the reference pattern statistics. Our system uses speech samples from three individuals during training. A model commonly used for speech recognition is the HMM, which is a statistical model used for modeling an unknown system using an observed output sequence. The keypad and digital display are used to communicate with and program the HM2007 chip. The keypad is made up of 12 normally open momentary contact switches. When the circuit is turned on, “00” is on the digital display, the red LED (READY) is lit and the circuit waits for a command.

Training Words for Recognition: Press “1” (display will show “01” and the LED will turn off) on the keypad, then press the TRAIN key ( the LED will turn on) to place circuit in training mode, for word one. Say the target word into the onboard microphone (near LED) clearly. The circuit signals acceptance of the voice input by blinking the LED off then on. The word (or utterance) is now identified as the “01” word. If the LED did not flash, start over by pressing “1” and then “TRAIN” key. You may continue training new words in the circuit. Press “2” then TRN to train the second word and so on. The circuit will accept and recognize up to 20 words (numbers 1 through 20). It is not necessary to train all word spaces. If you only require 10 target words that’s all you need to train.


•ic hm 2007 •voice recognising ic

•8k x8 sram •ram for storing digital data

•latch for controling 7 segment displays

Figure 3.3- functional diagram of the speech recognition circuit.


3.2.6 Why ic HM2007        Up to 20 word vocabulary of duration two second each Multi-lingual Non-volatile memory back up with 3V battery onboard. Will keep the speech recognition data in memory even after power off. Easily interfaced to control external circuits & appliances -contained stand alone speech recognition circuit

The main advantage of the IC HM 2007 is that it provides a complete package for speech detection purpose.


User defined words

Manual training

Speech acquisition

Speech processing

Sampling (ADC)

Figure 3.4-Internal processes in HM2007


CHAPTER 4-Type of speech recognition systems
In choosing a speech recognition system you should consider the degree of speak independence it offers. Speaker independent systems can provide high recognition accuracies for a wide range of users without needing to adapt to each user¶s voice. Speaker dependent systems require that to train the system to your voice to attain high accuracy. Speaker adaptive systems an intermediate category are essentially speaker-independent but can adapt their templates for each user to improve accuracy 4.1-ADVANTAGES OF SPEAKER INDEPENDENT SYSTEM The advantage of a speaker independent system is obvious-anyone can use the system without first training it. However, its drawbacks are not so obvious. One limitation is the work that goes into creating the vocabulary templates. To create reliable speaker-independent templates, someone must collect and process numerous speech sample. This is a time-consuming task; creating these templates is not a one-time effort. Speaker-independent templates are languagedependant, and the templates are sensitive not only to two dissimilar languages but also to the differences between British and American English. Therefore, as part of your design activity, you would need to create a set of templates for each language or a major dialect that your customers use. Speaker independent systems also have a relatively fixed vocabulary because of the difficulty in creating a new template in the field at the user¶s site. 4.2-THE ADVANTAGE OF A SPEAKER-DEPENDENT SYSTEM: A speaker dependent system requires the user a train the ASR system by providing examples of his own speech. Training can be tedious process, but the system has the advantage of using templates that refer only to the specific user and not some vague average voice. The result is language independence. You can say ja, si, or ya during training, as long as you are consistent. The drawback is that the speaker-dependent system must do more than simply matching coming speech to the templates. It must also include resources to create those templates.


WHICH IS BETTER: For a given amount of processing power, a speaker dependent system tends to provide more accurate recognition than a speaker-independent system. A speaker independent system is not necessarily better: the difference in performance stems from the speaker independent template encompassing wide speech variations.


        Portability - independence of computing platform. Adaptability to changing conditions (different mic, background noise, new speaker, new task domain, new language even) Language Modeling is there a role for linguistics in improving the language models. Confidence Measures - better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary (OOV) Words - Systems must have some method of detecting OOV words, and dealing with them in a sensible way. Spontaneous Speech - disfluencies (filled pauses, false starts, hesitations, un grammatical constructions etc) remain a problem. Prosody -Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Accent, dialect and mixed language ± non-native speech is a huge problem, especially where code-switching is commonplace.



Security: With this technology a powerful interface between man and computer is created as thevoice reorganization understands only the prerecorded voices and hence there are no ways of tampering data or breaking the codes if created.

Productivity: It decreases work as all operations are done through voice recognition and hence paper work decreases to its maximum and the user can feel relaxed irrespective of the work

Advantage for handicapped and blind: This technology is great boon for blind and handicapped as they can utilize the voice recognition technology for their works. Usability of other languages increases: As the speech recognition technology needs only voice and irrespective of the language in which it is delivered it is recorded, due to this perspective this is helpful to be used in any language. Personal voice macros can be created: Every day tasks like sending mails receiving mails drafting documents can be done easily and also many tasks speed can be increased.


Figure 6.1-speech recognition home automation DRAWBACKS: If the system has to work under noisy environments, background noise may corrupt the original data and leads to SS misinterpretation. If words that are pronounced similar for example, their, there, this technology face difficulty in distinguishing them.



Voice recognition promises rosy future and offer wide variety of services. The next generation of voice recognition technology consists of something called Neural networks using artificial intelligence technology. They are formed by interconnected nodes which do parallel processing of the input for fast evaluation. Like human beings they learn new pattern of speech automatically.


1. http://en.wikipedia.org/wiki/Speech_recognition 2. http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speechrecognition.htm 3. www.alldatasheet.com/datasheet-pdf/pdf/129295/.../HM2007.html 4. www.datasheetarchive.com/HM2007-datasheet.html - United States 5. http://www.imagesco.com/articles/hm2007/SpeechRecognitionTutorial01.html 6. http://www.learnartificialneuralnetworks.com/speechrecognition.html 7. http://www.crescendo.com/en/cstudies/wireless-speech-recognition-er.php 8. Report on “Speech recognition technology” by Siddhartha Shety
Link http://www.scribd.com/doc/48946526/speech-recognition


Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times