Voice Controlled Robot

VOICE CONTROLLED ROBOT
(Submitted in partial fulfillment for the award of Bachelor of Electronics

Engineering Degree by the University of Mumbai)
By
Pratik Chopra
Harshad Dange
Under the guidance of
Mr. Shirish S. Halbe

(Asst. Professor & Hobby Centre Co-ordinator )
Department of Electronics Engineering,

K. J. Somaiya College of Engineering,
Vidyavihar, Mumbai - 400077.
2006 - 2007.
________________________________________________________________________
K.J.SOMAIYA COLLEGE OF ENGINERING,

VIDYAVIHAR, MUMBAI -400077.
DEPARTMENT OF ELECTRONICS
Certificate
This is to certify that Mr. Pratik Chopra of Electronics Department, bearing the
University seat number 8139 has completed the B.E. project on Voice
Controlled Robot and is accepted and examined for the partial fulfillment of the
Bachelor of Electronics Engineering Degree by the University of Mumbai.
Prof. Shirish S. Halbe

GUIDE
Prof. Milind Marathe

H. O. D.
Dr. P.P Parikh

Director / Principal
_______________
________________
Examiner
Date of Examination
2
_____________________________________________________________________
________________________________________________________________________
ACKNOWLEGDEMENT
We take this opportunity to express our deepest gratitude towards Mr. S.S. Halbe, our
project guide, who has been the driving force behind this project and whose guidance and
co-operation has been a source of inspiration for us.
We would also like to thank Prof. Samir Mhatre for his valuable support whenever
needed.
We are very much thankful to our professors, colleagues and authors of various
publications to which we have been referring to. We express our sincere appreciation
and thanks to all those who have guided us directly or indirectly in our project. Also
much needed moral support and encouragement was provided on numerous occasions by
our whole division
Finally we thank our parents for their immense support.
3
_____________________________________________________________________
________________________________________________________________________
Contents
1.
2.
3.
4.
5.
6.
7.
Introduction--------------------------------------------------------------------5
The Task------------------------------------------------------------------------7
Speech Recognition Types/Styles-------------------------------------------9
Approaches to statistical Speech Recognition----------------------------11
Nature of Problem------------------------------------------------------------13
Solution to Problems---------------------------------------------------------16
Design Approach-------------------------------------------------------------18
a. Speech Recognition Module----------------------------------------19
b. Microcontroller and Decoder circuit-----------------------------28
c. RF module------------------------------------------------------------33
d. Driver Circuit--------------------------------------------------------35
e. Buffer-----------------------------------------------------------------35
f. Batteries--------------------------------------------------------------35
8. Training and Recognition---------------------------------------------------36
9. Applications-------------------------------------------------------------------37
10. Components Used------------------------------------------------------------38
11. Datasheet-HM2007----------------------------------------------------------39
12. Project Progress Report Summary----------------------------------------46
13. Bibliography------------------------------------------------------------------47
4
_____________________________________________________________________
________________________________________________________________________
Chapter1. INTRODUCTION
When we say voice control, the first term to be considered is Speech Recognition i.e.
making the system to understand human voice. Speech recognition is a technology where
the system understands the words (not its meaning) given through speech.
Speech is an ideal method for robotic control and communication. The speechrecognition circuit we will outline, functions independently from the robots main
intelligence [central processing unit (CPU)]. This is a good thing because it doesnt take
any of the robots main CPU processing power for word recognition. The CPU must
merely poll the speech circuits recognition lines occasionally to check if a command has
been issued to the robot. We can even improve upon this by connecting the recognition
line to one of the robots CPU interrupt lines. By doing this, a recognized word would
cause an interrupt, letting the CPU know a recognized word had been spoken. The
advantage of using an interrupt is that polling the circuits recognition line occasionally
would no longer be necessary, further reducing any CPU overhead.
Another advantage to this stand-alone speech-recognition circuit (SRC) is its
programmability. You can program and train the SRC to recognize the unique words you
want recognized. The SRC can be easily interfaced to the robots CPU.
To control and command an appliance (computer, VCR, TV security system, etc.) by
speaking to it, will make it easier, while increasing the efficiency and effectiveness of
5
_____________________________________________________________________
________________________________________________________________________
working with that device.At its most basic level speech recognition allows the user to
perform parallel tasks, (i.e. hands and eyes are busy elsewhere) while continuing to work
with the computer or appliance.
Robotics is an evolving technology. There are many approaches to building robots, and
no one can be sure which method or technology will be used 100 years from now. Like
biological systems, robotics is evolving following the Darwinian model of survival of
the fittest.
Suppose you want to control a menu driven system. What is the most striking property
that you can think of?
Well the first thought that came to our mind is that the range of inputs in a menu driven
system is limited. In fact, by using a menu all we are doing is limiting the input domain
space. Now, this is one characteristic which can be very useful in implementing the menu
in stand alone systems. For example think of the pine menu or a washing machine menu.
How many distinct commands do they require?
Why build robots?

Robots are indispensable in many manufacturing industries. The reason is that the cost
per hour to operate a robot is a fraction of the cost of the human labor needed to perform
the same function. More than this, once programmed, robots repeatedly perform
functions with a high accuracy that surpasses that of the most experienced human
operator. Human operators are, however, far more versatile. Humans can switch job tasks
easily. Robots are built and programmed to be job specific. You wouldnt be able to
program a welding robot to start counting parts in a bin. Todays most advanced
industrial robots will soon become dinosaurs. Robots are in the infancy stage of their
evolution. As robots evolve, they will become more versatile, emulating the human
capacity and ability to switch job tasks easily. While the personal computer has made an
indelible mark on society, the personal robot hasnt made an appearance. Obviously
theres more to a personal robot than a personal computer. Robots require a combination
of elements to be effective: sophistication of intelligence, movement, mobility,
navigation, and purpose.
Without risking human life or limb, robots can replace humans in some hazardous duty
service. Robots can work in all types of polluted environments, chemical as well as
nuclear. They can work in environments so hazardous that an unprotected human would
quickly die.
6
_____________________________________________________________________
________________________________________________________________________
Chapter2. THE TASK

The purpose of this project is to build a robotic car which could be controlled
using voice commands. Generally these kinds of systems are known as Speech
Controlled Automation Systems (SCAS). Our system will be a prototype of the same.
We are not aiming to build a robot which can recognize a lot of words. Our basic
idea is to develop some sort of menu driven control for our robot, where the menu is
going to be voice driven.
What we are aiming at is to control the robot using following voice commands.
Robot which can do these basic tasks:-
1.
2.
3.
4.
5.
6.
7.
move forward
move back
turn right
turn left
load
release
stop ( stops doing the current job )
7
_____________________________________________________________________
________________________________________________________________________
SAMPLE INPUT OUTPUT
INPUT (Speaker speaks)
OUTPUT (Robot does)
forward
moves forward
Back
moves back
Right
turns right
Left
turns left
Load
Lifts the load
Release
Releases the load
Stop
stops doing current task
(Words are chosen in such a way that they sound least familiar)
8
_____________________________________________________________________
________________________________________________________________________
Chapter3. SPEECH RECOGNITION TYPES AND STYLES
Voice enabled devices basically use the principal of speech recognition.It is the process
of electronically converting a speech waveform (as the realization of a linguistic
expression) into words (as a best-decoded sequence of linguistic units).
Converting a speech waveform into a sequence of words involves several essential steps:
1. A microphone picks up the signal of the speech to be recognized and converts it
into an electrical signal. A modern speech recognition system also requires that
the electrical signal be represented digitally by means of an analog-to-digital
(A/D) conversion process, so that it can be processed with a digital computer or a
microprocessor.
2. This speech signal is then analyzed (in the analysis block) to produce a
representation consisting of salient features of the speech. The most prevalent
feature of speech is derived from its short-time spectrum, measured successively
over short-time windows of length 2030 milliseconds overlapping at intervals of
1020 ms. Each short-time spectrum is transformed into a feature vector, and the
temporal sequence of such feature vectors thus forms a speech pattern.
3. The speech pattern is then compared to a store of phoneme patterns or models
through a dynamic programming process in order to generate a hypothesis (or a
number of hypotheses) of the phonemic unit sequence. (A phoneme is a basic unit
of speech and a phoneme model is a succinct representation of the signal that
corresponds to a phoneme, usually embedded in an utterance.) A speech signal
inherently has substantial variations along many dimensions.
Before we understand the design of the project let us first understand speech recognition
types and styles. Speech recognition is classified into two categories, speaker dependent
and speaker independent.
Speaker dependent systems are trained by the individual who will be using the system.
These systems are capable of achieving a high command count and better than 95%
accuracy for word recognition. The drawback to this approach is that the system only
responds accurately only to the individual who trained the system. This is the most
common approach employed in software for personal computers.
Speaker independent is a system trained to respond to a word regardless of who speaks.
Therefore the system must respond to a large variety of speech patterns, inflections and
enunciation's of the target word. The command word count is usually lower than the
speaker dependent however high accuracy can still be maintain within processing limits.
Industrial requirements more often need speaker independent voice systems, such as the
AT&T system used in the telephone systems.
A more general form of voice recognition is available through feature analysis and this
technique usually leads to "speaker-independent" voice recognition. Instead of trying to
9
_____________________________________________________________________
________________________________________________________________________
find an exact or near-exact match between the actual voice input and a previously stored
voice template, this method first processes the voice input using "Fourier transforms" or
"linear predictive coding (LPC)", then attempts to find characteristic similarities between
the expected inputs and the actual digitized voice input. These similarities will be present
for a wide range of speakers, and so the system need not be trained by each new user. The
types of speech differences that the speaker-independent method can deal with, but which
pattern matching would fail to handle, include accents, and varying speed of delivery,
pitch, volume, and inflection. Speaker-independent speech recognition has proven to be
very difficult, with some of the greatest hurdles being the variety of accents and
inflections used by speakers of different nationalities. Recognition accuracy for speakerindependent systems is somewhat less than for speaker-dependent systems, usually
between 90 and 95 percent. Speaker independent systems do not ask to train the system
as an advantage, but perform with lower quality. These systems find applications in
telephony communications such as dictating a number or a word where many people are
in concern. However, there is a need for a well training database in speaker independent
systems.
Recognition Style
Speech recognition systems have another constraint concerning the style of speech they
can recognize. They are three styles of speech: isolated, connected and continuous.
Isolated speech recognition systems can just handle words that are spoken separately.
This is the most common speech recognition systems available today. The user must
pause between each word or command spoken. The speech recognition circuit is set up to
identify isolated words of .96 second lengths.
Connected is a half way point between isolated word and continuous speech recognition.
Allows users to speak multiple words. The HM2007 can be set up to identify words or
phrases 1.92 seconds in length. This reduces the word recognition vocabulary number to
20.
Continuous is the natural conversational speech we are use to in everyday life. It is

extremely difficult for a recognizer to shift through the text as the word tend to merge
together. For instance, "Hi, how are you doing?" sounds like "Hi,.howyadoin"
Continuous speech recognition systems are on the market and are under continual
development.
10
_____________________________________________________________________
________________________________________________________________________
4. Approaches of Statistical Speech Recognition

a. Hidden Markov model (HMM)-based speech recognition
Modern general-purpose speech recognition systems are generally based on hidden
Markov models (HMMs). This is a statistical model which outputs a sequence of symbols
or quantities.
One possible reason why HMMs are used in speech recognition is that a speech signal
could be viewed as a piece-wise stationary signal or a short-time stationary signal. That
is, one could assume in a short-time in the range of 10 milliseconds, speech could be
approximated as a stationary process. Speech could thus be thought as a Markov model
for many stochastic processes (known as states).
Another reason why HMMs are popular is because they can be trained automatically and
are simple and computationally feasible to use. In speech recognition, to give the very
simplest setup possible, the hidden Markov model would output a sequence of ndimensional real-valued vectors with n around, say, 13, outputting one of these every 10
milliseconds. The vectors, again in the very simplest case, would consist of cepstral
coefficients, which are obtained by taking a Fourier transform of a short-time window of
speech and de-correlating the spectrum using a cosine transform, then taking the first
(most significant) coefficients. The hidden Markov model will tend to have, in each state,
a statistical distribution called a mixture of diagonal covariance Gaussians which will
give likelihood for each observed vector. Each word, or (for more general speech
recognition systems), each phoneme, will have a different output distribution; a hidden
Markov model for a sequence of words or phonemes is made by concatenating the
individual trained hidden Markov models for the separate words and phonemes.
The above is a very brief introduction to some of the more central aspects of speech
recognition. Modern speech recognition systems use a host of standard techniques which
it would be too time consuming to properly explain, but just to give a flavor; a typical
large-vocabulary continuous system would probably have the following parts. It would
need context dependency for the phones (so phones with different left and right context
have different realizations); to handle unseen contexts it would need tree clustering of the
contexts; it would of course use cepstral normalization to normalize for different
recording conditions and depending on the length of time that the system had to adapt on
different speakers and conditions it might use cepstral mean and variance normalization
for channel differences, vocal tract length normalization (VTLN) for male-female
normalization and maximum likelihood linear regression (MLLR) for more general
speaker adaptation. The features would have delta and delta-delta coefficients to capture
speech dynamics and in addition might use heteroscedastic linear discriminant analysis
(HLDA); or might skip the delta and delta-delta coefficients and use LDA followed
perhaps by heteroscedastic linear discriminant analysis or a global semi tied covariance
transform (also known as maximum likelihood linear transform (MLLT)). A serious
company with a large amount of training data would probably want to consider
discriminative training techniques like maximum mutual information (MMI), MPE, or
(for short utterances) MCE, and if a large amount of speaker-specific enrollment data was
available a more wholesale speaker adaptation could be done using MAP or, at least, tree11
_____________________________________________________________________
________________________________________________________________________
based maximum likelihood linear regression. Decoding of the speech (the term for what
happens when the system is presented with a new utterance and must compute the most
likely source sentence) would probably use the Viterbi algorithm to find the best path, but
there is a choice between dynamically creating combination hidden Markov models
which includes both the acoustic and language model information, or combining it
statically beforehand (the AT&T approach, for which their FSM toolkit might be useful).
Those who value their sanity might consider the AT&T approach, but be warned that it is
memory hungry.
b. Neural network-based speech recognition
Another approach in acoustic modeling is the use of neural networks. They are capable of
solving much more complicated recognition tasks, but do not scale as well as HMMs
when it comes to large vocabularies. Rather than being used in general-purpose speech
recognition applications they can handle low quality, noisy data and speaker
independence. Such systems can achieve greater accuracy than HMM based systems, as
long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but
generally the results are better than for HMMs. There are also NN-HMM hybrid systems
that use the neural network part for phoneme recognition and the hidden Markov model
part for language modeling.
c. Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns would be
detected, even if in one video the person was walking slowly and if in another they were
walking more quickly, or even if there were accelerations and decelerations during the
course of one observation. DTW has been applied to video, audio, and graphics -- indeed,
any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g. time series) with certain restrictions, i.e. the
sequences are "warped" non-linearly to match each other. This sequence alignment
method is often used in the context of hidden Markov models.
12
_____________________________________________________________________
________________________________________________________________________
Chapter4. NATURE OF PROBLEM
Speech recognition is the process of finding a interpretation of a spoken utterance;

typically, this means finding the sequence of words that were spoken.
This involves preprocessing the acoustic signals to parameterize it in a more usable
and useful form. The input signal must be matched against a stored pattern and then
makes a decision of accepting or rejecting a match. No two utterances of the same word
or sentence are likely to give rise to the same digital signal. This obvious point not only
underlies the difficulty in speech recognition but also means that we be able to extract
more than just a sequence of words from the signal.
The different types of problems we are going to face in our project have been
enumerated below: DIFFERENCES IN THE VOICES OF DIFFERENT PEOPLE:The voice of a man differs from the voice of a woman that again differs from the
voice of a baby. Different speakers have different vocal tracts and source physiology.
Electrically speaking, the difference is in frequency. Women and babies tend to speak at
higher frequencies from that of men.
DIFFERENCES IN THE LOUDNESS OF SPOKEN WORDS:No two persons speak with the same loudness. One person will constantly go on
speaking in a loud manner while another person will speak in a light tone. Even if the
same person speaks the same word on two different instants, there is no guarantee that he
will speak the word with the same loudness at the different instants. The problem of
loudness also depends on the distance the microphone is held from the user's mouth.
Electrically speaking, the problem of difference is reflected in the amplitude of
the generated digital signal.
DIFFERENCE IN THE TIME:Even if the same person speaks the same word at two different instants of time,
there is no guarantee that he will speak exactly similarly on both the occasions.
Electrically speaking there is a problem of difference in time i.e. indirectly frequency.
13
_____________________________________________________________________
________________________________________________________________________
OSCILLOGRAM (WAVEFORM)
Physically the speech signal (actually all sound) is a series of pressure changes in the
medium between the sound source and the listener. The most common representation of
the speech signal is the oscillogram, often called the waveform. In this the time axis is the
horizontal axis from left to right and the curve shows how the pressure increases and
decreases in the signal. The utterance we have used for demonstration is "phonetician",.
The signal has also been segmented, such that each phoneme in the transcription has been
aligned with its corresponding sound event.
phonetician
SPECTROGRAM
In the spectrogram the time axis is the horizontal axis, and frequency is the vertical axis.
The third dimension, amplitude, is represented by shades of darkness. Consider the
spectrogram to be a number of spectrums in a row, looked upon "from above", and where
the highs in the spectra are represented with dark spots in the spectrogram.
From the picture it is obvious how different the speech sounds are from a spectral point
of view
14
_____________________________________________________________________
________________________________________________________________________
Now, let's look at the spectrograms of the vowel /i:/ in "three" and "tea".
Figure Example of vowel /i:/ in different phonetic contexts.
PROBLEMS DUE TO NOISE:A machine will have to face many problems, when trying to imitate the ability of
humans. The audio range of frequencies varies from 20 Hz to 20 kHz. Some external
noises have frequencies that may be within this audio range. These noises pose a problem
since they cannot be filtered out.
DIFFERENCES IN THE PROPERTIES OF MICROPHONES: There may be problems due to differences in the electrical properties of different
mikes and transmission channels.
DIFFERENCES IN THE PITCH:Pitch and other source features such as breathiness and amplitude can be varied
independently.
OTHER PROBLEMS:We have to make sure that robot does not go out of reach of our voice.
Output of microphone is very small.

Output of Voice recognition chip is not compatible with input required at motors.
15
_____________________________________________________________________
________________________________________________________________________
Chapter5. SOLUTION TO PROBLEMS
After analyzing the problems we come out with the solutions which are listed below.
1. Amplitude Variation:Amplitude variation of the electrical signal output of microphone may occur mainly due
to:
a) Variation of distance between sound source and the transducer.
b) Variation of strength of sound generated by source.
To recognize a spoken word, it does not matter whether it has been spoken loudly
or less loudly. This is because characteristic features of a word spoken lies in its
frequency & not in its loudness (amplitude). Thus, at a certain stage this amplitude
information is suitably normalized.
2. Recognition of a word: If same word is spoken two times at different time instants, they sound similar to
us; question arises what is the similarity in-between them? It is important to note that it
does not matter whether one of spoken word was of different loudness than the other. The
difference lies in frequency. Hence, any large frequency variation would cause the system
not to recognize the word. In speaker independent type of system, some logic can be
implemented to take care of frequency variation. A small frequency variation i.e. features
variation within tolerable limits is considered to be acceptable.
3. Noise:Along with the sound source of the speech the other stray sounds also are picked
up by the microphone, thus degrading the information contained in the signal.
4. Microphone response: Two different microphones may not have same response. Hence if microphone is
changed, or the system is installed on a new PC due to different response the success rate
of recognition may drop.
5. In order our voice is recognized by robot at a distance we will use wireless mic. In
case robot does not recognize any word, we will make an arrangement such that robot
automatically stops after some time.
6. We will use microphone pre-amplifier circuit. It is in-built in HM2007
7. We use decoding logic and motor driving circuits so chip and motors are made
compatible, thereby solving compatibility problem.
16
_____________________________________________________________________
________________________________________________________________________
8. One of the important problem which needed to be solved was to provide sufficient
current and voltage to entire assembly when interfered together. Since the current drawn
from supply was so much that a 9V battery could not last for a longer period, we used
current buffer IC. In our application we have used 74LS245.
17
_____________________________________________________________________
________________________________________________________________________
Chapter7. DESIGN APPROACH
The most challenging part of the entire system is designing and interfacing various stages
together. Our approach was to get the analog voice signal being digitized. The frequency
and pitch of words be stored in a memory. These stored words will be used for matching
with the words spoken. When the match is found, the system outputs the address of
stored words. Hence we have to decode the address and according to the address sensed,
the car will perform the required task. Since we wanted the car to be wireless, we used
RF module. The address was decoded using microcontroller and then applied to RF
module. This together with driver circuit at receivers end made complete intelligent
systems.
It must be noted that we did not use wireless mic instead used analog RF module which
transmitted 5 different frequencies each for right, left, forward, backward, crane
movement.
SYSTEM DESIGN
a. Voice Recognition Module
b. Microcontroller and Decoder
c. RF module
d. Motor Driver Circuit
e. Buffer
18
_____________________________________________________________________
________________________________________________________________________
Block Diagram:
Voice Recognition Module

The speech recognition module basically consists of:
Voice Recognition Chip: It is the heart of the entire system. HM2007 is a voice
recognition chip with on-chip analog front end, voice analysis, recognition process and
system control functions. The input voice command is analyzed, processed, recognized
and then obtained at one of its output port which is then decoded , amplified and given to
motors of robot car.
19
_____________________________________________________________________
________________________________________________________________________
We initially used an Indian manufactured voice recognition chip AP7003. It is a
monolithic user dependence speech recognition IC designed for toy application. AP7003
consist of microphone amplifier, A/D converter, speech processor and I/O controller.
After pre-recording, AP7003 can recognize up to 12 different sentences each with 1.5 sec
length with highly I/O programmability. However it was not much accurate and reliable.
So we started looking for another alternative. We found HM 2007 as a right choice.
The chip provides the options of recognizing either forty .96 second words or
twenty 1.92 second words. This circuit allows the user to choose either the .96 second
word length (40 word vocabulary) or the 1.92 second word length (20 word vocabulary).
For memory the circuit uses an 8K X 8 static RAM.
The chip has two operational modes; manual mode and CPU mode. The CPU
mode is designed to allow the chip to work under a host computer. This is an attractive
approach to speech recognition for computers because the speech recognition chip
operates as a co-processor to the main CPU. The jobs of listening and recognition dont
occupying any of the computer's CPU time. When the HM2007 recognizes a command it
can signal an interrupt to the host CPU and then relay the command code. The HM2007
chip can be cascaded to provide a larger word recognition library.
The circuit we are building operates in the manual mode. The manual mode allows one
to build a stand alone speech recognition board that doesn't require a host computer and
may be integrated into other devices to utilize speech control.
The major components of this design are: a speech recognition chip, memory,
keypad, and LED 7-segment display. The chip is designed for speaker dependent (oneuser) applications, but can be manipulated to perform speaker independent (multipleusers) applications. The keypad and LED 7-segment display will be used to program and
test the voice recognition circuit.
20
_____________________________________________________________________
________________________________________________________________________
More about the HM2007 chip
The HM2007 is a single-chip complementary metal-oxide semiconductor (CMOS) voicerecognition large-scale integration (LSI) circuit. The chip contains an analog front end,
voice analysis,recognition, and system control functions. The chip may be used in a
stand-alone or connected CPU.
Features
Single-chip voice-recognition CMOS LSI

Speaker-dependent
External RAM support
Maximum of 40-word recognition
Maximum word length of 1.92 s
Microphone support
Manual and CPU modes available
Response time less than 300 milliseconds (ms)
5 volt (5V) power supply
The system we are building is typically trained as speaker dependent (single user).Thus
the user will be its real master.
Microphone: It takes the analog voice commands and sends it to voice recognition
chip(HM 2007) in the form of electrical signal.
The human ear has an auditory range from 10 to 15,000 Hz. Sound can be picked up
easily using a microphone and amplifier. Microphones typically have an auditory range
that surpasses that of human hearing.
Microphones are transducers which detect sound signals and produce an electrical image
of the sound, i.e., they produce a voltage or a current which is proportional to the sound
signal. The most common microphones for musical use are dynamic, ribbon, or
condenser microphones. Besides the variety of basic mechanisms, microphones can be
designed with different directional patterns and different impedances.
21
_____________________________________________________________________
________________________________________________________________________
Dynamic Microphones
Advantages:
Relatively cheap and

rugged.
Can be easily
miniaturized.
Disadvantages:
Principle: sound moves the cone and the attached coil of

wire moves in the field of a magnet. The generator
effect produces a voltage which "images" the sound
pressure variation - characterized as a pressure
microphone.
The
uniformity
of
response to different
frequencies does not
match that of the ribbon
or
condenser
microphones
Ribbon Microphones
Advantages:
Adds "warmth" to
the tone by accenting
lows when closemiked.
Can be used to
discriminate against
distant low frequency
noise in its most
common
gradient
form.
Disadvantages:
Principle: the air movement associated with the sound

moves the metallic ribbon in the magnetic field, generating
an imaging voltage between the ends of the ribbon which is
proportional to the velocity of the ribbon - characterized as
a "velocity" microphone.
Accenting
lows
sometimes produces
"boomy" bass.
Very susceptible to
wind noise. Not
suitable for outside
use unless very well
shielded
22
_____________________________________________________________________
________________________________________________________________________
Condenser Microphones
Advantages:
Best
overall
frequency
response makes this the
microphone of choice for
many recording applications.
Disadvantages:
Principle: sound pressure changes the spacing

between a thin metallic membrane and the
stationary back plate. The plates are charged to a
total charge
Expensive
May pop and crack when
close miked
Requires a battery or external
power supply to bias the
plates.
A change in plate spacing will cause

a change in charge Q and force a
current through resistance R. This
where C is the capacitance and V the voltage of the current "images" the sound pressure,
making this a "pressure" microphone
biasing battery.
Pop filters in front of mics.

Some microphones are very sensitive to minor gusts of wind--so sensitive in fact that
they will produce a loud pop if you breath on them. To protect these mics (some of which
can actually be damaged by blowing in them) engineers will often mount a nylon screen
between the mic and the artist. This is not the most common reason for using pop filters
though:
Vocalists like to move around when they sing; in particular, they will lean into
microphones. If the singer is very close to the mic, any motion will produce drastic
changes in level and sound quality. (You have seen this with inexpert entertainers using
hand held mics.) Many engineers use pop filters to keep the artist at the proper distance.
The performer may move slightly in relation to the screen, but that is a small proportion
of the distance to the microphone.
23
_____________________________________________________________________
________________________________________________________________________
Keypad: It is used for training/programming the chip. It also allocates definite memory
locations to voice commands. The keypad is made up of 12 switches.
Figure 2
When the circuit is turned on, the HM2007 checks the static RAM. If everything checks
out the board displays "00" on the digital display and lights the red LED (READY). It is
in the "Ready" waiting for a command.
24
_____________________________________________________________________
________________________________________________________________________
7-segment Display: It is used to test the voice recognition circuit.
The 7 segment display is used as a numerical indicator on many types of test equipment.It
is an assembly of light emitting diodes which can be powered individually.
They most commonly emit red light.
Powering all the segments will display the number 8.
Powering a,b,c d and g will display the number 3.
Numbers 0 to 9 can be displayed.
The d.p represents a decimal point.
The one shown is a common anode display since all anodes are joined together and go to
the positive supply.
The cathodes are connected individually to zero volts.
Resistors must be placed in series with each diode to limit the current through each diode
to a safe value.
Common cathode displays where all the cathodes are joined are also available.
25
_____________________________________________________________________
________________________________________________________________________
Applications and Drivers
A numeral to be displayed on a seven segment display is usually encoded in BCD form,
and a logic circuit driver ON or OFF the proper segments of the display. This logic is also
called decoder. Various decoders are available to drive common anode and common
cathode displays. One of the easily available decoder is 7447 AND 7448 TTL decoders.
They are open collector TTL that are designed to pull down common anode (7447 type)
and common cathode (7448 type) through external current limiting resistors.
We used 7448 decoder chip driving a common cathode seven segment display.
Circuit Diagram of voice recognition module:
8k x 8 RAM: It stores decoded voice commands by the chip at the assigned locations.
26
_____________________________________________________________________
________________________________________________________________________
Output of Voice recognition module
The 8-bit output is taken from the output of the 74LS373 data octal latch. The output is
not a standard 8-bit byte, but it is broken into two 4-bit binary coded decimal (BCD)
nibbles. BCD code is related to standard binary numbers as Table below illustrates.
As you can see, the binary and BCD numbers remain the same until reaching decimal 10.
At decimal 10, BCD jumps to the upper nibble and the lower nibble resets to zero. The
binary numbers continue to decimal 15, and then jump to the upper nibble at 16 where
the lower nibble resets. If a computer is expecting to read an 8-bit binary number and
BCD is provided, this will be the cause of errors. Further since the module outputs nos.
55, 66 and 77 as default value for errors and we want these outputs not to be used, we use
microcontroller.
27
_____________________________________________________________________
________________________________________________________________________
Microcontroller and driver circuit
Decoder: It is second most important part of the project. The output from the chip is
given to decoder (micro-controller) which acts as a DMC i.e. a Digital Motor Controller.
DMC senses the output ports of HM2007 chip and produces proper o/p as per the
commands forward, backward, left, right, load, release, stop. The proper functionality of
the system depends on the proper decoding logic.
We use port0 as input port and port1 as output port.

P0.0 to P0.6 are given inputs from 7 output pins of voice recognition module
While P0.7 is kept grounded
28
_____________________________________________________________________
________________________________________________________________________
Microcontroller circuit:
29
_____________________________________________________________________
________________________________________________________________________
Table shows the output codes generated due to different commands after programming
the microcontroller.
Commands
Stop
Right
Left
Backward
Forward
Crane
P0.7
1
1
1
1
1
0
P0.6
1
1
1
1
0
1
P0.5
1
1
1
0
1
1
P0.4
1
1
0
1
1
1
P0.3
1
0
1
1
1
1
P0.2
1
1
1
1
1
1
P0.1
1
1
1
1
1
1
P0.0
1
1
1
1
1
1
Code
FF
F7
EF
DF
BF
7F
(For wireless car, this is input to RF module and then to motors through driver ckt)
Commands
Stop
Right
Left
Backward
Forward
Crane
P0.0
0
0
0
0
0
0
P0.1
0
0
0
0
0
1
P0.2
0
0
0
0
0
1
P0.3
0
0
0
0
0
1
P0.4
0
0
0
1
0
1
P0.5
0
1
0
0
1
1
P0.6
0
0
0
1
0
1
P0.7
0
0
1
0
1
0
Code
00
04
01
0A
05
FC
(For wired car, this is input directly to driver ckt)
30
_____________________________________________________________________
________________________________________________________________________
Keil 2 Vision
It is software which allows us to use C language, basic language as per user

convenience. This can be then converted into hex codes. Thus making
programming simpler. Thus no need to refer opcodes for commands.
31
_____________________________________________________________________
________________________________________________________________________
Aec_isp_v3 C Programmer
It is used to program 89S51, 89S52, 89S53.
It reads, programs hex files into microcontroller.
Running the Software: Your code needs to be in Intel Hex Format.AEC_ISP

will open the file you specify and load it into a buffer. You can specify a default
file in the command line; e.g.: To specify TEST.HEX as the default file; start by
typing AEC_ISP TEST.HEX.
32
_____________________________________________________________________
________________________________________________________________________
RF module:
Let's take a closer look at the RC truck we saw in 1st chapter. We will assume that the
exact frequency used is 27.9 MHz. Here's the sequence of events that take place when
you use the RC transmitter:
You press a trigger to make the truck go forward.

The trigger causes a pair of electrical contacts to touch, completing a circuit
connected to a specific pin of an integrated circuit (IC).
The completed circuit causes the transmitter to transmit a set sequence of
electrical pulses.
Each sequence contains a short group of synchronization pulses, followed by the pulse
sequence. For our truck, the synchronization segment -- which alerts the receiver to
incoming information -- is four pulses that are 2.1 milliseconds (thousandths of a second)
long, with 700-microsecond (millionths of a second) intervals. The pulse segment, which
tells the antenna what the new information is, uses 700-microsecond pulses with 700microsecond intervals.
33
_____________________________________________________________________
________________________________________________________________________
A typical RC signal transmission
Here are the pulse sequences used in the pulse segment:

1.
2.
3.
4.
5.
6.
Forward: 16 pulses
Backward: 40 pulses
Forward/Left: 28 pulses
Forward/Right: 34 pulses
U-turn: 52 pulses
Crane movement: 46 pulses
The transmitter sends bursts of radio waves that oscillate with a frequency of 27,900,000
cycles per second (27.9 MHz).
The truck is constantly monitoring the assigned frequency (27.9 MHz) for a signal. When
the receiver receives the radio bursts from the transmitter, it sends the signal to a filter
that blocks out any signals picked up by the antenna other than 27.9 MHz. The remaining
signal is converted back into an electrical pulse sequence.
The pulse sequence is sent to the IC in the truck, which decodes the sequence and starts
the appropriate motor. For our example, the pulse sequence is 16 pulses (forward), which
means that the IC sends positive current to the motor running the wheels. If the next pulse
sequence were 40 pulses (reverse), the IC would invert the current to the same motor to
make it spin in the opposite direction.
The motor's shaft actually has a gear on the end of it, instead of connecting directly to the
axle. This decreases the motor's speed but increases the torque, giving the truck adequate
power through the use of a small electric motor!
The truck moves forward.
34
_____________________________________________________________________
________________________________________________________________________
Buffer: We used IC 74LS245 as buffer ic.It solved the current supply problem. It is a 3state octal bus transceiver. They are designed for asynchronous two-way communication
between data buses.The device allows the A bus to the B bus or vice-versa depending
upon the logic level at the direction control (DIR) input. The enable input can be used
to disable the device so that the buses are effectively isolated.
Batteries
Batteries are by far the most commonly used electric power supply for robotics. Batteries
are so commonplace that its easy to take them for granted. An understanding of batteries
will help you choose batteries that will optimize your robots design.
Primary batteries
Primary batteries are one-time-use batteries. The batteries we will look at in this class
deliver 1.5 V per cell. They are designed to deliver their rated electrical capacity and then
be discarded. When building robotic systems, discarding depleted primary batteries can
become expensive. However, one advantage to using primary batteries is that they
typically have a greater electrical capacity than rechargeables. If one is engaged in a
function (i.e., a robotic war) that requires the highest power density available for one-shot
use, primary batteries may be the way to go.
Secondary batteries
Secondary batteries are rechargeable. The most common rechargeable batteries are NiCds
and lead-acid. Secondary batteries, while initially more expensive, are cheaper in the long
run. Typically secondary batteries can be recharged 200 to 1000 times.
35
_____________________________________________________________________
________________________________________________________________________
Chapter8. TRAINING AND RECOGNITION
To record or train a command, the chip stores the analog signal pattern and amplitude and
saves it in the 8kx8 SRAM. In recognition mode, the chip compares the user- inputted
analog signal from the microphone with those stored in the SRAM and if it recognizes a
command, an output of the command identifier will be sent to the microprocessor through
the D0 to D7 ports of the chip. For training, testing (if recognized properly) and clearing
the memory, keypad and 7-segment display is used.
To Train:
To train the circuit begin by pressing the word number you want to train on the keypad.
Use any numbers between 1 and 40. For example press the number "1" to train word
number 1. When you press the number(s) on the keypad the red led will turn off. The
number is displayed on the digital display. Next press the "#" key for train. When the "#"
key is pressed it signals the chip to listen for a training word and the red led turns back
on. Now speak the word you want the circuit to recognize into the microphone clearly.
The LED should blink off momentarily, this is a signal that the word has been accepted.
Continue training new words in the circuit using the procedure outlined above. Press the
"2" key then "#" key to train the second word and so on. The circuit will accept up to
forty words. You do not have to enter 40 words into memory to use the circuit. If you
want you can use as many word spaces as you want..
Recognition:
The circuit is continually listening. Repeat a trained word into the microphone. The
number of the word should be displayed on the digital display. For instance if the word
"directory" was trained as word number 25. Saying the word "directory" into the
microphone will cause the number 25 to be displayed.
Error Codes:
The chip provides the following error codes:
55 = word too long
66 = word too short
77 = word no match
36
_____________________________________________________________________
________________________________________________________________________
Chapter9. APPLICATIONS
We believe such a system would find wide variety of applications. Menu driven
systems such as e-mail readers, household appliances like washing machines, microwave
ovens, and pagers and mobiles etc. will become voice controlled in future
The robot is useful in places where humans find difficult to reach but human
voice reaches. E.g. in a small pipeline, in a fire-situations, in highly toxic areas.
The robot can be used as a toy.
It can be used to bring and place small objects.
It is the one of the important stage of Humanoid robots.
Command and control of appliances and equipment
Telephone assistance systems
Data entry
Speech and voice recognition security systems
37
_____________________________________________________________________
________________________________________________________________________
Chapter10. COMPONENTS USED
Parts list for speech-recognition circuit
(1) IC1 HM2007 IC

(1) IC2 SRAM 8K X 8
(1) IC3 74LS373
(2) IC4 and IC5 7448
(1) XTAL 3.57 MHz
(1) Speech-recognition PCB
(1) 12-contact keypad
(2) 7-segment displays
(2) 16-pin, 220-ohm, 1/4W resistor packs
(1) 22K-ohm, 1/4-W resistor
(1) 5.6K-ohm, 1/4-W resistor
(1) 0.0047-uF cap
(1) C2 100-uF, 16V cap
(1) C5 0.1-uF cap
(1) 7805 voltage regulator
(1) Microphone
(1) 9V battery clip
Parts list for interface circuit
(1) Micrcontroller 89S51

(1) 74LS373 Octal D flip-flop tri-state
(4) 220 ohm 7pin Ressistor Bank
(10) Miniature LEDs
(1)L298N
(1)RF module
(1)40 Mhz crystal
(3)DC motors
(1)Antenna
(4)Male-Female 7pin connectors
Parts available from: Images Company

39 Seneca Loop
Staten Island, NY 10314
http://www.imagesco.com
38
_____________________________________________________________________
________________________________________________________________________
Chapter11. DATASHEET
HM2007
Features
Single-chip voice-recognition CMOS LSI

Speaker-dependent
External RAM support
Maximum of 40-word recognition
Maximum word length of 1.92 s
Microphone support
Manual and CPU modes available
Response time less than 300 milliseconds (ms)
5 volt (5V) power supply
39
_____________________________________________________________________
________________________________________________________________________
40
_____________________________________________________________________
________________________________________________________________________
41
_____________________________________________________________________
________________________________________________________________________
42
_____________________________________________________________________
________________________________________________________________________
43
_____________________________________________________________________
________________________________________________________________________
44
_____________________________________________________________________
________________________________________________________________________
45
_____________________________________________________________________
________________________________________________________________________
Chapter12. Project Progress Report Summary
Calendar year 2006:
June -Work started
July -Gathered useful information on voice processing techniques, microphone
properties. (Chapter 3,4)
August -We tried another chip AP7003-02, manufactured by Indian company A-plus
India. (Page 20)
September We built a voice recognition module using AP7003-02.
October Our attempts did not suceed with AP7003-02.
November- Tried to find some better alternative but finally decided to go with HM2007
and decided to get it imported from US.(Page 19,20; Chapter 11)
December- Project work was on hold.
Calendar year 2007:

January In 2nd week of January we worked upon voice recognition part and circuit was
soldered.In last week we got desired output of voice recognition module. (Page 26)
February We worked upon microcontroller part. With lot of minor problems being
solved we finally even managed to complete microcontroller part. At the end of February
we got somewhat success with our wired model using proper driver circuit. (Page 28, 29,
30,31,32)
March- We made use of our waste toy car and decoded its remote control logic and
matched it with our microcontroller output. Finally with buffers being added between
microcontroller and rf module we were able to bring entire circuit together. At this point
of time we also won certificate in project paper presentation.(Page 33,34,35)
April- Project complete.
46
_____________________________________________________________________
________________________________________________________________________
Chapter13. BIBLIOGRAPHY
Web:
www.imagesco.com/articles/hm2007/SpeechRecognitionTutorial01.html
www.migi.com for selecting motors and other robotic concepts.
www.migindia.com/modules.php?name=News&file=article&sid=22
www.datasheetcatalog.com
www.alldatasheet4u.com
http://arts.ucsc.edu/ems/music/tech_background/TE-20/teces_20.html#I. For
microphones types and properties.
www.Howstuffworks.com for understanding microphone concepts, rf radio
working and other related concepts.
Book:
The 8051 microcontroller Kenneth Ayala, 3rd reprint, 2005; Thomson

Asia Ltd.,Singapore; Chapter 3,6,7&8.For programming 89S51
Modern Digital Electronics RP Jain, 3rd edition; Tata Mcgraw Hill;
Chapter 6&10. For A/D converter and 7 segment display connections.
Others:
Keil2 software
Used for simulating the microcontroller program
Aec_isp_v3
Used for burning/programming the microcontroller
47
_____________________________________________________________________

Voice Controlled Robot

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Voice Controlled Robot

Uploaded by

Copyright:

Available Formats

VOICE CONTROLLED ROBOT

(Submitted in partial fulfillment for the award of Bachelor of Electronics

Under the guidance of

Mr. Shirish S. Halbe

Department of Electronics Engineering,

K.J.SOMAIYA COLLEGE OF ENGINERING,

Prof. Shirish S. Halbe

Prof. Milind Marathe

Dr. P.P Parikh

Why build robots?

Chapter2. THE TASK

SAMPLE INPUT OUTPUT

INPUT (Speaker speaks)

OUTPUT (Robot does)

Lifts the load

Releases the load

stops doing current task

Chapter3. SPEECH RECOGNITION TYPES AND STYLES

Continuous is the natural conversational speech we are use to in everyday life. It is

4. Approaches of Statistical Speech Recognition

b. Neural network-based speech recognition

c. Dynamic time warping (DTW)-based speech recognition

Chapter4. NATURE OF PROBLEM

Speech recognition is the process of finding a interpretation of a spoken utterance;

Figure Example of vowel /i:/ in different phonetic contexts.

Output of microphone is very small.

Voice Recognition Module

Single-chip voice-recognition CMOS LSI

Relatively cheap and

Principle: sound moves the cone and the attached coil of

Principle: the air movement associated with the sound

Principle: sound pressure changes the spacing

A change in plate spacing will cause

Pop filters in front of mics.

Circuit Diagram of voice recognition module:

We use port0 as input port and port1 as output port.

(For wired car, this is input directly to driver ckt)

It is software which allows us to use C language, basic language as per user

It is used to program 89S51, 89S52, 89S53.

It reads, programs hex files into microcontroller.

Running the Software: Your code needs to be in Intel Hex Format.AEC_ISP

You press a trigger to make the truck go forward.

A typical RC signal transmission

Here are the pulse sequences used in the pulse segment:

The robot can be used as a toy.

It can be used to bring and place small objects.

It is the one of the important stage of Humanoid robots.

Command and control of appliances and equipment

Telephone assistance systems

Speech and voice recognition security systems

Parts list for speech-recognition circuit

(1) IC1 HM2007 IC

Parts list for interface circuit

(1) Micrcontroller 89S51

Parts available from: Images Company

Single-chip voice-recognition CMOS LSI

Calendar year 2007:

The 8051 microcontroller Kenneth Ayala, 3rd reprint, 2005; Thomson

Used for simulating the microcontroller program

Used for burning/programming the microcontroller

You might also like