Professional Documents
Culture Documents
ABSTRACT:
This project aims to help people with disabilities. I chose this project because I have
always wanted to help disabled people. I personally feel that working in the field of
Assistive Technology is really great. It enables people with disabilities to
compensate for the impairments they experience, at least in part, for their
limitations. This main idea of this project is to combine the standard microphone
signal and EMG Signal produced from speech for robust speech recognition in noisy
environment. These signals will then be classified (by Neural Network) and used to
operate Rehabilitative devices (such as wheelchair).This project will help people
with speech disabilities along with Physical disabilities (e.g. In case of a Paralyzed
patient).
1. INTRODUCTION:
Speech is a natural source of interface for humanmachine communication, as well
as being one of the most natural interfaces for humanhuman communication.
However, environmental robustness is still one of the main barriers to the wide use
of speech recognition. Speech recognition performance degrades signicantly
under varying environmental conditions for many application areas. In order to
minimize environmental effects and high levels of acoustic noise, researchers have
developed a variety of techniques to minimize the impact of noise, ranging from
adaptive noise cancellation to throat microphones. Throat microphones record
speech signal in the form of vibrations through skin-attached piezo-electric sensors
and are signicantly more robust to environmental noise conditions than acoustic
microphone recordings. However, they represent a lower bandwidth speech signal
content compared to open-air acoustic recordings. Increasingly, researchers are
experimenting with the measurement and analysis of bioelectric signals associated
with speech in an effort to further minimize or even completely eliminatethe
degrading effects of acoustic noise. Such techniques, either on their own or fused
with other modalities, hold promise for improving human communication and
human-computer interaction. One possible approach is to increase the robustness
of speech recognizers in noisy environments is to complement the standard
microphone signal with additional robust signals from alternative speech sensors,
which can be more isolated from environmental noises. In this work, for this
purpose, the bioelectric technique, electromyography will be used.
Electromyography is the study of muscle function through its electrical properties.
Electrical activity emanating from muscles associated with speech can be detected
by non-invasive surface sensors mounted in the region of the face and neck. For
this approach to be effective, two problems must be addressed: how to combine
information from both the signals so as to improve noisy-speech recognition, and
how to obtain acoustic models for the combined features, since only a small
amount of simultaneous measurements may be available.
Voiced sounds are produced when the vocal cords vibrate open and closed, thus
interrupting the flow of air from the lungs to the vocal tract and producing quasiperiodic pulses of air as the excitation. The rate of the opening and closing gives
the pitch of the sound. This can be adjusted by varying the shape of, and the
tension in, the vocal cords, and the pressure of the air behind them. Voiced
sounds show a high degree of periodicity at the pitch period, which is typically
between 2 and 20 ms.
Unvoiced sounds result when the excitation is a noise-like turbulence produced
by forcing air at high velocities through a constriction in the vocal tract while the
glottis is held open. Such sounds show little long-term periodicity, although
short-term correlations due to the vocal tract are still present.
Plosive sounds result when a complete closure is made in the vocal tract, and air
pressure is built up behind this closure and released suddenly.
Some sounds cannot be considered to fall into any one of the three classes above, but
are a mixture. For example voiced fricatives result when both vocal cord vibration and
a constriction in the vocal tract are present.
Although there are many possible speech sounds which can be produced, the shape of
the vocal tract and its mode of excitation change relatively slowly, and so speech can
be considered to be quasi-stationary over short periods of time (of the order of 20 ms).
Speech signals show a high degree of predictability, due sometimes to the quasiperiodic vibrations of the vocal cords and also due to the resonances of the vocal tract.
Speech coders attempt to exploit this predictability in order to reduce the data rate
necessary for good quality voice transmission.
inflections and enunciation's of the target word. The command word count is
usually lower than the speaker dependent however high accuracy can still be
maintain within processing limits. Industrial requirements more often need
speaker independent voice systems, such as the AT&T system used in the
telephone systems.
A more general form of voice recognition is available through feature analysis and this
technique usually leads to "speaker-independent" voice recognition. Instead of trying
to find an exact or near-exact match between the actual voice input and a previously
stored voice template, this method first processes the voice input using "Fourier
transforms" or "linear predictive coding (LPC)", then attempts to find characteristic
similarities between the expected inputs and the actual digitized voice input. These
similarities will be present for a wide range of speakers, and so the system need not be
trained by each new user. The types of speech differences that the speaker-independent
method can deal with, but which pattern matching would fail to handle, include accents,
and varying speed of delivery, pitch, volume, and inflection. Speaker-independent
speech recognition has proven to be very difficult, with some of the greatest hurdles
being the variety of accents and inflections used by speakers of different nationalities.
Recognition accuracy for speaker independent systems is somewhat less than for
speaker-dependent systems, usually between 90 and 95 percent. Speaker independent
systems do not ask to train the system as an advantage, but perform with lower quality.
These systems find applications in telephony communications such as dictating a
number or a word where many people are in concern. However, there is a need for a
well training database in speaker independent systems.
2.2.1. Recognition Style:
Speech recognition systems have another constraint concerning the style of speech they
can recognize. They are three styles of speech: isolated, connected and continuous.
Isolated speech recognition systems can just handle words that are spoken
separately. This is the most common speech recognition systems available today.
The user must pause between each word or command spoken. The speech
recognition circuit is set up to identify isolated words of .96 second lengths.
Connected is a halfway point between isolated word and continuous speech
recognition. Allows users to speak multiple words. The HM2007 can be set up
to identify words or phrases 1.92 seconds in length. This reduces the word
recognition vocabulary number to 20.
Continuous is the natural conversational speech we are use to in everyday life. It
is extremely difficult for a recognizer to shift through the text as the word tend
to merge together. For instance, "Hi, how are you doing?" sounds like
for different recording conditions and depending on the length of time that the system
had to adapt on different speakers and conditions it might use cepstral mean and
variance normalization for channel differences, vocal tract length normalization
(VTLN) for male-female normalization and maximum likelihood linear regression
(MLLR) for more general speaker adaptation. The features would have delta and deltadelta coefficients to capture speech dynamics and in addition might use heteroscedastic
linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients
and use LDA followed perhaps by heteroscedastic linear discriminant analysis or a
global semi tied covariance transform (also known as maximum likelihood linear
transform (MLLT)). A serious company with a large amount of training data would
probably want to consider discriminative training techniques like maximum mutual
information (MMI), MPE, or (for short utterances) MCE, and if a large amount of
speaker-specific enrollment data was available a more wholesale speaker adaptation
could be done using MAP or, at least, tree based maximum likelihood linear regression.
Decoding of the speech (the term for what happens when the system is presented with
a new utterance and must compute the most likely source sentence) would probably use
the Viterbi algorithm to find the best path, but there is a choice between dynamically
creating combination hidden Markov models which includes both the acoustic and
language model information, or combining it statically beforehand (the AT&T
approach, for which their FSM toolkit might be useful). Those who value their sanity
might consider the AT&T approach, but be warned that it is memory hungry.
b. Neural network-based speech recognition
Another approach in acoustic modeling is the use of neural networks. They are capable
of solving much more complicated recognition tasks, but do not scale as well as HMMs
when it comes to large vocabularies. Rather than being used in general-purpose speech
recognition applications they can handle low quality, noisy data and speaker
independence. Such systems can achieve greater accuracy than HMM based systems,
as long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but
generally the results are better than for HMMs. There are also NN-HMM hybrid
systems that use the neural network part for phoneme recognition and the hidden
Markov model part for language modeling.
2.3. Electromyography
Electromyography is the study of muscle function via its electrical properties (i.e.,
the electrical signal emanating from the muscle during muscle activation). In 1848,
the Frenchman DuBois-Reymond was the first to report the detection of electrical
signals voluntarily elicited from human muscles. By placing his fingers in a saline
solution and contracting his hand and forearm, he produced a measurable
deflection in a galvanometer. His dedication to his work is beyond question
correctly surmising that the skin presented a high impedance to the flow of current,
on at least two separate occasions he deliberately blistered his forearm, removed
the skin, and exposed the open wound to the saline, thereby producing a
substantially greater deflection in the galvanometer during muscle contraction.
Electromyography has continued to develop since the time of DuBois-Reymond.
Substantial research interest was generated during the 1960s in the use of
electromyography as a mechanism for the control of prostheses. The arrival of
inexpensive digital computing in the 1980s furthered development, with many
research groups investigating digital techniques for control and communication,
including groups focused on EMG-based speech recognition. These speech efforts
are surveyed in the next section of this paper. The electrophysiology of muscles is
complex. Muscle action originates in the central and peripheral nervous systems.
Nerve impulses are carried from anterior horn cells of the spinal column to the end
of the nerve via motor neurons. As the axon of the motor neuron approaches
FIG-2.3 Motor neuron innervation of muscle fibers. Note that adjacent fibers are
not necessarily innervated by the same motor neuron.
overall correct classification rate in the presence of ambient acoustic noise. At the
NASA Ames Research Center, members of the Neuro-Engineering Laboratory have
done work on subvocal speech recognition. In 2003, Jorgensen et al. collected six
words from three subjects using surface Ag-AgCl sensors and a single EMG
channel. Data were collected at the rate of 2000 samples/channel/s. A variety of
techniques were tested for feature extraction, including short-time Fourier
transforms, linear predictive coding, and several different wavelet transforms.
Classification was performed using a neural network and an average correct
recognition rate of 92% was achieved. In later work, Jorgensen and Binsted applied
a similar signal processing architecture to seventeen vowel phonemes and twentythree consonant phonemes collected from two subjects. Average correct recognition
exceeded 33% for the entire vocabulary (and exceeded 50% when certain alveolars
were removed). In 2003, NTT DoCoMo researchers Manabe et al. used a novel
surface sensor mounting configuration for EMGbased speech recognition. Three
channels of sensors were mounted on the subjects hand, then the hand was held to
the face during speech. Analog filtering restricted the EMG signal to the range 20
450 Hz with a sampling rate of 1000 samples/channel/s. Recognition was performed
using a three-layer neural network, where the inputs to the network were the rootmean-squared (RMS) EMG values during pronunciation of a vowel. Over three
subjects, each using a vocabulary of five Japanese vowels, the average correct
classification rate exceeded 90%. In later work, Manabe and Zhang made use of
HMMs to classify the ten Japanese digits collected from ten subjects; accuracies as
high as 64% were achieved. In 2004, Kumar et al. used three EMG channels for
speech recognition. Channels were sampled at 250 samples/channel/s, with RMS
EMG values used as feature inputs to a neural network classifier. Using three
subjects and five English vowels, an average recognition rate of up to 88% was
achieved.
3. PROJECT:
3.1. Current Work
In this semester I have mostly studied literature (Research Papers, Journals and
Articles) related to this Project. As this project requires a deep knowledge in areas
such as Signal Processing and Neural Networking, I have read various books and
watched video tutorials related to these fields.
I have designed the microphone preamplifier circuit. The next step of this Project
will be Data Acquisition and Signal processing. The core idea behind these steps is
the correct recognition of the spoken words, so that we can interact with
wheelchair through speech signal. I have also addressed the different types of
problems which may occur during speech recognition.
Speech recognition is the process of finding an interpretation of a spoken
utterance; typically, this means finding the sequence of words that were spoken.
This involves preprocessing the acoustic signals to parameterize it in a more usable
and useful form. The input signal must be matched against a stored pattern and
then makes a decision of accepting or rejecting a match. No two utterances of the
same word or sentence are likely to give rise to the same digital signal. This obvious
point not only underlies the difficulty in speech recognition but also means that we
be able to extract more than just a sequence of words from the signal.
Data Acquisition:
Speech signal data will be collected using data acquisition board through standard
microphone. The general words used in the operation of wheelchair are forward,
backward, right, left and stop. In the offline phase, 100 examples of each word will
be collected from the subject and use it to train the classifier.
Signal Processing:
The signal processing activity has two distinct phases. In the first, a training set will
be used to produce a classifier. In the second, the trained classifier will be
presented with previously unseen samples, either for the purpose of testing the
classifier or for producing some end effect. The first three stages are common to
both phases:
1. Signal acquisition
2. Activity detection
3. Feature extraction
Activity detection refers to the process of segmenting an isolated word out of the
continuous signal stream and feature extraction is the process of reducing the
dimensionality of the data to in some way facilitate subsequent classification.
After activity detection and feature extraction, features from the training set will
be used to train a neural network classifier. 70% of the collected samples will be
used for training and the remaining 30% will be set aside for generalization testing.
Once trained, the neural network will be inserted into the real-time system for
purposes of testing. The subject would use the real-time system to operate a
wheelchair. The network will be used by the same subject for whom it was trained.
Before being inserted into the real-time system, the network will be checked
against its generalization set to ensure that it was not an outlier in terms of
recognition rate.
In the next stage I will take EMG signals produced from subvocal speech and use it
to train a neural network classifier after activity detection and feature extraction.
Once the classifier will be trained the subject will use the real-time system to
communicate through speech EMG.
To increase the robustness of speech recognition in noisy environments, I will
complement the standard microphone signal with this EMG Signal. So, I will do joint
analysis of standard microphone signal and speech EMG signal in the final stage of
this project.
REFERENCES: