You are on page 1of 20

OBJECTIVE:

To design and develop a speech Signal based biopotential amplifier for


Rehabilitative Aids (e.g. Wheelchair).

ABSTRACT:
This project aims to help people with disabilities. I chose this project because I have
always wanted to help disabled people. I personally feel that working in the field of
Assistive Technology is really great. It enables people with disabilities to
compensate for the impairments they experience, at least in part, for their
limitations. This main idea of this project is to combine the standard microphone
signal and EMG Signal produced from speech for robust speech recognition in noisy
environment. These signals will then be classified (by Neural Network) and used to
operate Rehabilitative devices (such as wheelchair).This project will help people
with speech disabilities along with Physical disabilities (e.g. In case of a Paralyzed
patient).

1. INTRODUCTION:
Speech is a natural source of interface for humanmachine communication, as well
as being one of the most natural interfaces for humanhuman communication.
However, environmental robustness is still one of the main barriers to the wide use
of speech recognition. Speech recognition performance degrades signicantly
under varying environmental conditions for many application areas. In order to
minimize environmental effects and high levels of acoustic noise, researchers have
developed a variety of techniques to minimize the impact of noise, ranging from
adaptive noise cancellation to throat microphones. Throat microphones record
speech signal in the form of vibrations through skin-attached piezo-electric sensors
and are signicantly more robust to environmental noise conditions than acoustic
microphone recordings. However, they represent a lower bandwidth speech signal
content compared to open-air acoustic recordings. Increasingly, researchers are
experimenting with the measurement and analysis of bioelectric signals associated
with speech in an effort to further minimize or even completely eliminatethe
degrading effects of acoustic noise. Such techniques, either on their own or fused

with other modalities, hold promise for improving human communication and
human-computer interaction. One possible approach is to increase the robustness
of speech recognizers in noisy environments is to complement the standard
microphone signal with additional robust signals from alternative speech sensors,
which can be more isolated from environmental noises. In this work, for this
purpose, the bioelectric technique, electromyography will be used.
Electromyography is the study of muscle function through its electrical properties.
Electrical activity emanating from muscles associated with speech can be detected
by non-invasive surface sensors mounted in the region of the face and neck. For
this approach to be effective, two problems must be addressed: how to combine
information from both the signals so as to improve noisy-speech recognition, and
how to obtain acoustic models for the combined features, since only a small
amount of simultaneous measurements may be available.

2. BACKGROUND AND RELATED RESEARCH:


2.1. Speech Signal:
Speech is produced when air is forced from the lungs through the vocal cords and along
the vocal tract. The vocal tract extends from the opening in the vocal cords (called the
glottis) to the mouth, and in an average man is about 17 cm long. It introduces shortterm correlations (of the order of 1 ms) into the speech signal, and can be thought of as
a filter with broad resonances called formants. The frequencies of these formants are
controlled by varying the shape of the tract, for example by moving the position of the
tongue. An important part of many speech codecs is the modelling of the vocal tract as
a short term filter. As the shape of the vocal tract varies relatively slowly, the transfer
function of its modelling filter needs to be updated only relatively infrequently
(typically every 20 ms or so).
The vocal tract filter is excited by air forced into it through the vocal cords. Speech
sounds can be broken into three classes depending on their mode of excitation.

Voiced sounds are produced when the vocal cords vibrate open and closed, thus
interrupting the flow of air from the lungs to the vocal tract and producing quasiperiodic pulses of air as the excitation. The rate of the opening and closing gives
the pitch of the sound. This can be adjusted by varying the shape of, and the
tension in, the vocal cords, and the pressure of the air behind them. Voiced
sounds show a high degree of periodicity at the pitch period, which is typically
between 2 and 20 ms.
Unvoiced sounds result when the excitation is a noise-like turbulence produced
by forcing air at high velocities through a constriction in the vocal tract while the
glottis is held open. Such sounds show little long-term periodicity, although
short-term correlations due to the vocal tract are still present.
Plosive sounds result when a complete closure is made in the vocal tract, and air
pressure is built up behind this closure and released suddenly.

Some sounds cannot be considered to fall into any one of the three classes above, but
are a mixture. For example voiced fricatives result when both vocal cord vibration and
a constriction in the vocal tract are present.
Although there are many possible speech sounds which can be produced, the shape of
the vocal tract and its mode of excitation change relatively slowly, and so speech can
be considered to be quasi-stationary over short periods of time (of the order of 20 ms).
Speech signals show a high degree of predictability, due sometimes to the quasiperiodic vibrations of the vocal cords and also due to the resonances of the vocal tract.

Speech coders attempt to exploit this predictability in order to reduce the data rate
necessary for good quality voice transmission.

2.2. Speech Recognition Types and Styles


Voice enabled devices basically use the principal of speech recognition. It is the process
of electronically converting a speech waveform (as the realization of a linguistic
expression) into words (as a best-decoded sequence of linguistic units).
Converting a speech waveform into a sequence of words involves several essential
steps:
1. A microphone picks up the signal of the speech to be recognized and converts it into
an electrical signal. A modern speech recognition system also requires that the electrical
signal be represented digitally by means of an analog-to-digital (A/D) conversion
process, so that it can be processed with a digital computer or a microprocessor.
2. This speech signal is then analyzed (in the analysis block) to produce a representation
consisting of salient features of the speech. The most prevalent feature of speech is
derived from its short-time spectrum, measured successively over short-time windows
of length 2030 milliseconds overlapping at intervals of 1020 ms. Each short-time
spectrum is transformed into a feature vector, and the temporal sequence of such feature
vectors thus forms a speech pattern.
3. The speech pattern is then compared to a store of phoneme patterns or models
through a dynamic programming process in order to generate a hypothesis (or a number
of hypotheses) of the phonemic unit sequence. (A phoneme is a basic unit of speech
and a phoneme model is a succinct representation of the signal that corresponds to a
phoneme, usually embedded in an utterance.) A speech signal inherently has substantial
variations along many dimensions.
Speech recognition is classified into two categories, speaker dependent and speaker
independent.
Speaker dependent systems are trained by the individual who will be using the
system. These systems are capable of achieving a high command count and better
than 95% accuracy for word recognition. The drawback to this approach is that
the system only responds accurately only to the individual who trained the
system. This is the most common approach employed in software for personal
computers.
Speaker independent is a system trained to respond to a word regardless of who
speaks. Therefore the system must respond to a large variety of speech patterns,

inflections and enunciation's of the target word. The command word count is
usually lower than the speaker dependent however high accuracy can still be
maintain within processing limits. Industrial requirements more often need
speaker independent voice systems, such as the AT&T system used in the
telephone systems.
A more general form of voice recognition is available through feature analysis and this
technique usually leads to "speaker-independent" voice recognition. Instead of trying
to find an exact or near-exact match between the actual voice input and a previously
stored voice template, this method first processes the voice input using "Fourier
transforms" or "linear predictive coding (LPC)", then attempts to find characteristic
similarities between the expected inputs and the actual digitized voice input. These
similarities will be present for a wide range of speakers, and so the system need not be
trained by each new user. The types of speech differences that the speaker-independent
method can deal with, but which pattern matching would fail to handle, include accents,
and varying speed of delivery, pitch, volume, and inflection. Speaker-independent
speech recognition has proven to be very difficult, with some of the greatest hurdles
being the variety of accents and inflections used by speakers of different nationalities.
Recognition accuracy for speaker independent systems is somewhat less than for
speaker-dependent systems, usually between 90 and 95 percent. Speaker independent
systems do not ask to train the system as an advantage, but perform with lower quality.
These systems find applications in telephony communications such as dictating a
number or a word where many people are in concern. However, there is a need for a
well training database in speaker independent systems.
2.2.1. Recognition Style:
Speech recognition systems have another constraint concerning the style of speech they
can recognize. They are three styles of speech: isolated, connected and continuous.
Isolated speech recognition systems can just handle words that are spoken
separately. This is the most common speech recognition systems available today.
The user must pause between each word or command spoken. The speech
recognition circuit is set up to identify isolated words of .96 second lengths.
Connected is a halfway point between isolated word and continuous speech
recognition. Allows users to speak multiple words. The HM2007 can be set up
to identify words or phrases 1.92 seconds in length. This reduces the word
recognition vocabulary number to 20.
Continuous is the natural conversational speech we are use to in everyday life. It
is extremely difficult for a recognizer to shift through the text as the word tend
to merge together. For instance, "Hi, how are you doing?" sounds like

"Hi,.howyadoin" Continuous speech recognition systems are on the market and


are under continual development.

2.2.2. Approaches of Statistical Speech Recognition:

a. Hidden Markov model (HMM)-based speech recognition


Modern general-purpose speech recognition systems are generally based on hidden
Markov models (HMMs). This is a statistical model which outputs a sequence of
symbols or quantities.
One possible reason why HMMs are used in speech recognition is that a speech signal
could be viewed as a piece-wise stationary signal or a short-time stationary signal. That
is, one could assume in a short-time in the range of 10 milliseconds, speech could be
approximated as a stationary process. Speech could thus be thought as a Markov model
for many stochastic processes (known as states).
Another reason why HMMs are popular is because they can be trained automatically
and are simple and computationally feasible to use. In speech recognition, to give the
very simplest setup possible, the hidden Markov model would output a sequence of n
dimensional real-valued vectors with n around, say, 13, outputting one of these every
10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral
coefficients, which are obtained by taking a Fourier transform of a short-time window
of speech and de-correlating the spectrum using a cosine transform, then taking the first
(most significant) coefficients. The hidden Markov model will tend to have, in each
state, a statistical distribution called a mixture of diagonal covariance Gaussians which
will give likelihood for each observed vector. Each word, or (for more general speech
recognition systems), each phoneme, will have a different output distribution; a hidden
Markov model for a sequence of words or phonemes is made by concatenating the
individual trained hidden Markov models for the separate words and phonemes.
The above is a very brief introduction to some of the more central aspects of speech
recognition. Modern speech recognition systems use a host of standard techniques
which it would be too time consuming to properly explain, but just to give a flavor; a
typical large-vocabulary continuous system would probably have the following parts.
It would need context dependency for the phones (so phones with different left and right
context have different realizations); to handle unseen contexts it would need tree
clustering of the contexts; it would of course use cepstral normalization to normalize

for different recording conditions and depending on the length of time that the system
had to adapt on different speakers and conditions it might use cepstral mean and
variance normalization for channel differences, vocal tract length normalization
(VTLN) for male-female normalization and maximum likelihood linear regression
(MLLR) for more general speaker adaptation. The features would have delta and deltadelta coefficients to capture speech dynamics and in addition might use heteroscedastic
linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients
and use LDA followed perhaps by heteroscedastic linear discriminant analysis or a
global semi tied covariance transform (also known as maximum likelihood linear
transform (MLLT)). A serious company with a large amount of training data would
probably want to consider discriminative training techniques like maximum mutual
information (MMI), MPE, or (for short utterances) MCE, and if a large amount of
speaker-specific enrollment data was available a more wholesale speaker adaptation
could be done using MAP or, at least, tree based maximum likelihood linear regression.
Decoding of the speech (the term for what happens when the system is presented with
a new utterance and must compute the most likely source sentence) would probably use
the Viterbi algorithm to find the best path, but there is a choice between dynamically
creating combination hidden Markov models which includes both the acoustic and
language model information, or combining it statically beforehand (the AT&T
approach, for which their FSM toolkit might be useful). Those who value their sanity
might consider the AT&T approach, but be warned that it is memory hungry.
b. Neural network-based speech recognition
Another approach in acoustic modeling is the use of neural networks. They are capable
of solving much more complicated recognition tasks, but do not scale as well as HMMs
when it comes to large vocabularies. Rather than being used in general-purpose speech
recognition applications they can handle low quality, noisy data and speaker
independence. Such systems can achieve greater accuracy than HMM based systems,
as long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but
generally the results are better than for HMMs. There are also NN-HMM hybrid
systems that use the neural network part for phoneme recognition and the hidden
Markov model part for language modeling.

c. Dynamic time warping (DTW)-based speech recognition


Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns would
be detected, even if in one video the person was walking slowly and if in another they
were walking more quickly, or even if there were accelerations and decelerations during
the course of one observation. DTW has been applied to video, audio, and graphics -indeed, any data which can be turned into a linear representation can be analyzed with
DTW.
A well-known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g. time series) with certain restrictions, i.e. the
sequences are "warped" non-linearly to match each other. This sequence alignment
method is often used in the context of hidden Markov models.

2.3. Electromyography
Electromyography is the study of muscle function via its electrical properties (i.e.,
the electrical signal emanating from the muscle during muscle activation). In 1848,
the Frenchman DuBois-Reymond was the first to report the detection of electrical
signals voluntarily elicited from human muscles. By placing his fingers in a saline
solution and contracting his hand and forearm, he produced a measurable
deflection in a galvanometer. His dedication to his work is beyond question
correctly surmising that the skin presented a high impedance to the flow of current,
on at least two separate occasions he deliberately blistered his forearm, removed
the skin, and exposed the open wound to the saline, thereby producing a
substantially greater deflection in the galvanometer during muscle contraction.
Electromyography has continued to develop since the time of DuBois-Reymond.
Substantial research interest was generated during the 1960s in the use of
electromyography as a mechanism for the control of prostheses. The arrival of
inexpensive digital computing in the 1980s furthered development, with many
research groups investigating digital techniques for control and communication,
including groups focused on EMG-based speech recognition. These speech efforts
are surveyed in the next section of this paper. The electrophysiology of muscles is
complex. Muscle action originates in the central and peripheral nervous systems.
Nerve impulses are carried from anterior horn cells of the spinal column to the end
of the nerve via motor neurons. As the axon of the motor neuron approaches

individual muscle fibers, it branches, meaning a single motor neuron innervates


several muscle fibers, terminating at a neuro-muscular junction (also known as an
endplate). When a nerve impulse reaches the endplate, the neurotransmitter
acetylcholine is released. This in turn causes sodium and potassium cation channels
to open in the muscle fiber. Once an excitation threshold is achieved, an action
potential propagates in both directions from the endplate to the muscle-tendon
junction. This movement of cations establishes an electromagnetic field in the
vicinity of the muscle fiber. The time-varying potential recorded by an electrode
placed in the field is known as an electromyogram. Of course, such an electrode
measures the superposition of several such fields arising from separate motor
units. That, coupled with spatially varying tissue filtering effects, makes the EMG
signal highly complex

FIG-2.3 Motor neuron innervation of muscle fibers. Note that adjacent fibers are
not necessarily innervated by the same motor neuron.

2.4. EMG based Speech Recognition:

Basmajian and De Luca describe electromyography research done before 1985


related to the muscles of the mouth, pharynx, larynx, face, and neck. As it pertains
to speech, the goal of research during that period seems to have been understanding
muscle processes associated with phonation in normal subjects and subjects with

disability. Investigations were carried out predominantly through fine-needle


indwelling electrodes on animals and humans. Although no explicit references have
been found prior to 1985 to attempts at EMG-based speech recognition, the concept
almost surely occurred to researchers of the timethe state of digital computing
(and non-invasive) sensing may have been the limiting factors. The first efforts at
performing EMG-based speech recognition seem to have occurred independently
and in parallel in Japan and the United States around 198586. In Japan, Sugie et al.
used three channels of silver- silver chloride (Ag- AgCl) surface sensors with a
sampling rate of 1250 samples/channel/s. A threshold-andcounting scheme was used
to produce a three-bit number every 10 ms. These numbers were then fed into a finite
automaton for vowel discrimination. Three subjects were asked to repeat 50
Japanese monosyllables. The overall correct classification rate was reported as 64%.
It is interesting to note that the researchers developed a pilot real-time system as part
of this effort. Simultaneously in the United States, Morse and Morse and OBrien
used four channels of stainless steel surface electrodes (with a light coating of
electrode gel) and a sampling rate of 5120 samples/channel/s. Analog filtering was
used to restrict the bandwidth of the EMG signal to the 1001000 Hz range. An
average magnitude technique was used to reduce the signal dimensionality to 20
points/channel/s. Two subjects were studied with several different word sets, one of
which was the English words zero to nine. Subjects were asked to repeat each
word twenty times. A maximum likelihood technique was used for classification.
For the ten-digit word set, a correct classification rate exceeding 60% was observed.
In later work in 1991, Morse et al. applied a neural network to a similar data set and
achieved roughly the same correct classification rate of 60%. In 2001, the Canadian
researchers Chan et al. reported EMG-based speech recognition results that were
motivated by the need to communicate in acoustically harsh environments (in this
case the cockpit of a fighter aircraft). Five channels of surface Ag-AgCl sensors were
used with each channel bandlimited to 100500 Hz and sampled at a rate of 1000
samples/channel/s. A variety of transforms (including a wavelet transform) and
principle component analysis (PCA) were used to reduce the data to thirty features
per word on a ten-word vocabulary (the ten English digits). Classification was
performed using linear discriminant analysis (LDA). On an experiment in which
words were randomly presented to two subjects, recognition rates as high as 93%
were achieved. In later work, a hidden Markov model (HMM) was used as the
classification engine and achieved results similar to the LDA technique. In 2002,
Chan et al. used evidence theory to combine results from a conventional automatic
speech recognition system and an EMG-based one, dramatically maintaining a high

overall correct classification rate in the presence of ambient acoustic noise. At the
NASA Ames Research Center, members of the Neuro-Engineering Laboratory have
done work on subvocal speech recognition. In 2003, Jorgensen et al. collected six
words from three subjects using surface Ag-AgCl sensors and a single EMG
channel. Data were collected at the rate of 2000 samples/channel/s. A variety of
techniques were tested for feature extraction, including short-time Fourier
transforms, linear predictive coding, and several different wavelet transforms.
Classification was performed using a neural network and an average correct
recognition rate of 92% was achieved. In later work, Jorgensen and Binsted applied
a similar signal processing architecture to seventeen vowel phonemes and twentythree consonant phonemes collected from two subjects. Average correct recognition
exceeded 33% for the entire vocabulary (and exceeded 50% when certain alveolars
were removed). In 2003, NTT DoCoMo researchers Manabe et al. used a novel
surface sensor mounting configuration for EMGbased speech recognition. Three
channels of sensors were mounted on the subjects hand, then the hand was held to
the face during speech. Analog filtering restricted the EMG signal to the range 20
450 Hz with a sampling rate of 1000 samples/channel/s. Recognition was performed
using a three-layer neural network, where the inputs to the network were the rootmean-squared (RMS) EMG values during pronunciation of a vowel. Over three
subjects, each using a vocabulary of five Japanese vowels, the average correct
classification rate exceeded 90%. In later work, Manabe and Zhang made use of
HMMs to classify the ten Japanese digits collected from ten subjects; accuracies as
high as 64% were achieved. In 2004, Kumar et al. used three EMG channels for
speech recognition. Channels were sampled at 250 samples/channel/s, with RMS
EMG values used as feature inputs to a neural network classifier. Using three
subjects and five English vowels, an average recognition rate of up to 88% was
achieved.

2.5. Communication in Acoustically Harsh Environments


People have long had an interest in communicating in acoustically noisy
environments. Military needs have driven research and development in this area for
many decades. Much research was done before and during the Second World War
on techniques to allow pilot voice communication in airplanes, resulting in the
development of devices such as throat microphones (e.g., the T-30 throat
microphone, manufactured by Shure Inc. under a 1941 contract). Interest in these

devices continues to this day, particularly when used as part of a multi-modality


speech recognition system. Military research in the area continues, with the United
States Defense Advanced Research Projects Agency (DARPA) sponsoring research
in sensors and techniques appropriate for communicating in noisy environments.
Unfortunately, in many cases first responders have yet to benefit from these
advanced techniques. For many fire departments, voice communication is still done
by shouting through the mask of the SCBA into a shoulder mounted or hand-carried
radio. Some alternatives have been developed and targeted at first responders (e.g.,
bone conduction microphones and in-mask boom microphones) but have yet to
receive wide deployment. Our study suggests that bioelectric techniques also hold
promise for this community, whether on their own or fused with other modalities.
Minimizing the impact of acoustic noise and the potential for covert communication
would make bioelectrics appeal to different segments of the community. It is
important to remember, however, that the first responder community is one that
values dependable and robust equipment and that is almost always forced to be
extremely cost conscious.

3. PROJECT:
3.1. Current Work
In this semester I have mostly studied literature (Research Papers, Journals and
Articles) related to this Project. As this project requires a deep knowledge in areas
such as Signal Processing and Neural Networking, I have read various books and
watched video tutorials related to these fields.
I have designed the microphone preamplifier circuit. The next step of this Project
will be Data Acquisition and Signal processing. The core idea behind these steps is
the correct recognition of the spoken words, so that we can interact with
wheelchair through speech signal. I have also addressed the different types of
problems which may occur during speech recognition.
Speech recognition is the process of finding an interpretation of a spoken
utterance; typically, this means finding the sequence of words that were spoken.
This involves preprocessing the acoustic signals to parameterize it in a more usable
and useful form. The input signal must be matched against a stored pattern and
then makes a decision of accepting or rejecting a match. No two utterances of the
same word or sentence are likely to give rise to the same digital signal. This obvious
point not only underlies the difficulty in speech recognition but also means that we
be able to extract more than just a sequence of words from the signal.

The different types of problems are:


DIFFERENCES IN THE VOICES OF DIFFERENT PEOPLE:
The voice of a man differs from the voice of a woman that again differs from the
voice of a baby. Different speakers have different vocal tracts and source
physiology.
Electrically speaking, the difference is in frequency. Women and babies tend to
speak at higher frequencies from that of men.

DIFFERENCES IN THE LOUDNESS OF SPOKEN WORDS:


No two persons speak with the same loudness. One person will constantly go on
speaking in a loud manner while another person will speak in a light tone. Even if
the same person speaks the same word on two different instants, there is no
guarantee that he will speak the word with the same loudness at the different
instants. The problem of loudness also depends on the distance the microphone is
held from the user's mouth.
Electrically speaking, the problem of difference is reflected in the amplitude of the
generated digital signal.
DIFFERENCE IN THE TIME:
Even if the same person speaks the same word at two different instants of time,
there is no guarantee that he will speak exactly similarly on both the occasions.
Electrically speaking there is a problem of difference in time i.e. indirectly
frequency.

PROBLEMS DUE TO NOISE:


A machine will have to face many problems, when trying to imitate the ability of
humans. The audio range of frequencies varies from 20 Hz to 20 kHz. Some external
noises have frequencies that may be within this audio range. These noises pose a
problem since they cannot be filtered out.

DIFFERENCES IN THE PROPERTIES OF MICROPHONES:


There may be problems due to differences in the electrical properties of different
mikes and transmission channels.

DIFFERENCES IN THE PITCH:


Pitch and other source features such as breathiness and amplitude can be varied
independently.

3.2. Future Work


The future work on this project will contain the most significant steps-Data
Acquisition and Signal Processing

Data Acquisition:
Speech signal data will be collected using data acquisition board through standard
microphone. The general words used in the operation of wheelchair are forward,
backward, right, left and stop. In the offline phase, 100 examples of each word will
be collected from the subject and use it to train the classifier.
Signal Processing:
The signal processing activity has two distinct phases. In the first, a training set will
be used to produce a classifier. In the second, the trained classifier will be
presented with previously unseen samples, either for the purpose of testing the
classifier or for producing some end effect. The first three stages are common to
both phases:
1. Signal acquisition
2. Activity detection
3. Feature extraction
Activity detection refers to the process of segmenting an isolated word out of the
continuous signal stream and feature extraction is the process of reducing the
dimensionality of the data to in some way facilitate subsequent classification.
After activity detection and feature extraction, features from the training set will
be used to train a neural network classifier. 70% of the collected samples will be
used for training and the remaining 30% will be set aside for generalization testing.
Once trained, the neural network will be inserted into the real-time system for
purposes of testing. The subject would use the real-time system to operate a
wheelchair. The network will be used by the same subject for whom it was trained.
Before being inserted into the real-time system, the network will be checked

against its generalization set to ensure that it was not an outlier in terms of
recognition rate.
In the next stage I will take EMG signals produced from subvocal speech and use it
to train a neural network classifier after activity detection and feature extraction.
Once the classifier will be trained the subject will use the real-time system to
communicate through speech EMG.
To increase the robustness of speech recognition in noisy environments, I will
complement the standard microphone signal with this EMG Signal. So, I will do joint
analysis of standard microphone signal and speech EMG signal in the final stage of
this project.

REFERENCES:

1. B.J. Betts and C. Jorgensen, Small Vocabulary Recognition Using Surface


Electromyography in an Acoustically Harsh Environment, Interacting with
Computers, Vol.18, Issue 6, 2006, pp. 1242-1259.
2. C. Jorgensen and K. Binsted, Web Browser Control Using EMG Based Sub Vocal
Speech Recognition, AI Magazine, Vol.21, n.1, 2000, pp. 57-66.
3. Nakajima, Y., Kashioka, H., Shikano, K., and Campbell, N., Non-Audible Murmur
Recognition Input Interface Using Stethoscopic Microphone Attached to the Skin,
in Proc.ICASSP,Hong Kong, pp. 708-711, 2003.
4. L. C. Ng, G. C. Burnett, J. F. Holzrichter, and T. J. Gable, Denoising of human
speech using combined acoustic and EM sensor signal processing, in Proc. ICASSP,
Istanbul, Turkey, 2000, pp. I229I232.
5. S. Kumar et al., EMG based voice recognition, Proceedings of the 2004
Intelligent Sensors, Sensor Networks and Information Processing Conference,
2004, pp. 593597.
6. Wouter Geuaert, Georgi Tsenav, Valeri Mladenov, Neural Network used for
Speech Recognition Journals Automatic Control, volume.20.1.7, 2010
7. Antanas Lipeika, Joana Lipeika, Loimutis Telksnys, development of Isolated
Word Speech Recognition System volume.30, No.1, PP 37-46, 2002.
8. S.-C. Jou, T. Schultz, and A. Waibel, Whispery speech recognition using adapted
articulatory features, in Proc. IEEE Int. Conf. Acousti., Speech, Signal Process.
(ICASSP05), 2005, vol. 1, pp. 10091012.
9. L. Deng, A. Acero, P. Plumpe, and X. Huang, Large-vocabulary speech
recognition under adverse acoustic environments,in Proc. Int. Conf. Speech Lang.
Processing (ICSLP00), Oct. 2000, pp. 806809.
10. http://neuralnetworksanddeeplearning.com
11. http://www-mobile.ecs.soton.ac.uk/speech_codecs/speech_properties.html

You might also like