Professional Documents
Culture Documents
Voice Recognition System Using Wavelet Transform and Neural Networks
Voice Recognition System Using Wavelet Transform and Neural Networks
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 13
Abstract— Speech is the natural way that the people interact with each other. And by there voice, they can do and re‐
mote any job. This study aims to make the voice recognition system more efficient by converting the original data to
seven levels; each level presents a wavelet transform and then examines which level of the seven levels presents the
best solution. This system is applied on 40 samples which presents eight words. This research is based on speech rec‐
ognized words using Neural Networks, based on limited dictionary. This paper begins with introduction of this study,
then presents some related works, and explained the experiment of this study; finally the conclusion and future works
are presented.
1 INTRODUCTION
car, performing surgery, or firing weapons at the enemy
Every day there is many people come to this world and [2].
make the first sound to begin there life and they don’t
know that by this sound they can make there life very These days, we need to do every thing quickly and to
comfortable and easy to communicate with people and save our times and do it without using our hands espe‐
machines. cially when it busy with something else. To achieve our
goal we can give order to any system just by using our
The speech recognition is the process by which a com‐ mouth by saying the order and it will be done. This con‐
puter identifies your spokenwords. It means when you cept can be achieved by using the voice recognition which
are talking to your computer, the computer will correctly is a process by which the words of the humans are con‐
recognize what you are saying. verted into electrical signals and these signals are trans‐
formed into coding patterns to which meaning has been
Moreover, voice recognition [1] “is the technology by assigned [3].
which sounds, words or phrases are spoken by humans
that are converted into electrical signals, and these signals There is a difficult in using the voice as an input to a
are transformed into coding patterns to which meaning computer presented by the differences between the hu‐
has been assigned”. The sound recognition can be more man speech and the traditional forms of the computer
general than the voice recognition, but in this paper we input. Each human has a different voice, and the same
focus on the human voice because it is most often and words can have different meanings when it is spoken in
most naturally used to communicate with the humans different different contexts. To overcome these diffculties
and machines. there are many techniques and methods that can be used
for voice recognition system, one of these methods is by
Speech generation and recognition are used to com‐ using the artificial neural networks.
municate between humans and machines. Rather than
using our hands and eyes, and also we can use our mouth Artificial Neural Networks (ANNs) are computer systems
and ears. This is very convenient when our hands and made from collections of artificial neurons. They accept a
eyes should be doing something else, such as: driving a vector of inputs and produce a vector of outputs. They
compute their results in constant time [4]. Like what we
———————————————— know about the nervous system in the human body. They
Abdulsalam Alarabeyyat is with Al‐balqa Applied University, Salt, Jordan.
are trained by presenting them with input datasets and
Mohʹd Rasoul Al‐Hadidi is with Al‐balqa Applied University, Salt, Jordan. corresponding correct outputs, and working to minimize
the recognition errors by adjusting the weights of the
Bayan Alsaaidah is with Al‐balqa Applied University, Salt, Jordan.
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 14
network [4].The neural networks are used to design many exists only within a finite domain, and its value is zero
applications and the speech recognition is one of the most elsewhere [7].
important of it.
The main Significant Contribution of this research study 2 RELATED WORKS
is the using of neural networks tool to design a system
There are many researches presented in voice recognition
that use the voice recognition technology, wher the voice
system by using Artificial Neural Networks (ANNs), the
is processed by using the Wavelet transform. Neural net‐
following explanation introduce some of them.
work is applicable for many applications and it is popular
and used in many applications in the recently years. It is
In 1988 Murdock et al. improve speech recognition and
friendly tool with matlab enviroment when the grammar
synthesis for disabled individuals using fuzzy neural
rules are not known.
network. Their system involves three stages:
Dynamic word wrap matching is used to detect and align
According to the mentioned before, the main objective is
candidate words; fuzzy neural‐net word recognition is
to build a system that works as speech recognition after
applied to input spectrogram patterns; a voice synthesizer
minimize some features of the voice and train the neural
is used to complete the interactive loop. The system has a
networks to identify the spoken words, and then find the
recognition accuracy of 95‐98% [8].
best recognition after the wavelet transform. In a speech
recognition system, each input typically represents one
In 1989 Nakamura and Shikano proposed system with
feature of the captured speech signal.
speaker dependent which seem an updating on the pre‐
vious works. The algorithm was applied to Hidden Mar‐
The combination of the voice feature strengths results in
kov Models (HMMs) and Neural Networks and evaluat‐
an output vector that shows, for example, the likelihood
ed using a database of 216 phonetically balance words
that these inputs represent various phonemes under con‐
and 5240 important Japanese words uttered by three
sideration [4]. The neural network is a new technique that
speakers. The HMM speaker adapted recognition rate for
based on training a model to recognize certain patterns of
b,d,g was 79.5%. The average recognition rate for the
voice so that when any words applied to the model and
three choices was about 91%. The algorithm was applied
have the same pattern it will be recognized.
to neural networks and resulted in almost the same per‐
formance [9].
Pattern recognition is the basis of today’s voice recogni‐
tion software. For any application the voice is converted
In 1990 Hampshire and Waibel proposed the Single‐
into digital data, which is then compared to information
speaker and multispeaker recognition system for the
stored in the programʹs database [5].
voice‐stop consonants b, d, g using Time‐Delay Neural
Networks (TDNNs) with a number of enhancements, in‐
The comparison process of the recognition system uses
cluding a new objective function for training these net‐
algorithms based on statistical techniques for predictive
works. The new objective function, called the Classifica‐
modeling known as the Hidden Markov Model or HMM
tion Figure of Merit (CFM) [10].
or Neural Network or any other approches. The process
makes educated guesses about the audio sound pattern of
In 1994 the speech recognition using neural networks
voice to predict the words that the user might be used[5].
used to controlling a robot as mentioned in [11]. Zhou et
al. activated robot arm controller by using the
Discrete Wavelet Transform (DWT) is an orthogonal func‐
VoiceCommander that based on neural networks.
tion which can be applied to a finite group of data. The
DWT and the Discrete Fourier Transform (DFT) are simi‐
In 1996 Nava and Taylor proposed a system with Neu‐
lar in the orthogonality of the function, a signal passed
ro‐Fuzzy Classifier (NFC) with excellent classification
twice through the transformation is unchanged, the input
accuracy to solve the speaker‐independent systemʹs prob‐
signal is assumed to be a set of discrete‐time samples, and
lems. According to the results of this system, the NFC
both transforms are convolutions [6].
shows better results than several existing methods [12].
The Discrete Wavelet Transform gives information
In 2003 a 2‐D phoneme sequence pattern recogniion
about the frequency function in the signal where it’s a
using the fuzzy neural network was proposed by Kwan
weakness in the DFT function. A wavelet is a little piece
andDong. They used the self‐organizing map and the
of a wave. While theFourier transforms use a sinusoidal
learning vector quantization to organize the phoneme
wave carries with repeating itself to infinity, a wavelet
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 15
feature vectors of short and long phonemes segmented summarized by that the performance of the neural net‐
from speech samples to obtain the phoneme maps. They work were dependent on pronunciation, and that the
formed the 2‐D phoneme response sequences of the three‐layered neural networks were effective for an indi‐
speech samples on the phoneme maps by the Viterbi vidual identification using voice patterns [17].
search algorithm. Then they used these 2‐D phoneme re‐
sponse sequence curves as inputs to the fuzzy neural In 2010 Shahgoshtasbi proposed system that improves
network for training and recognition of 0‐9 digit‐voice the equality of speech recognition system. This system
utterances [13]. has two parts: The first part filters the input signal and
packs it. Then it gets the average of three packets as an
Toyoda et al. proposed a system by using a multi‐ identification of the signal and send it to the second part.
layered perceptron NN system for environmental sound The second part which is based on the human auditory
recognition. Environmental sound recognition depends cortex was an associative neural network that maps the
more on the robot computer system task. The input data input set to a desired output set. By experiment this sys‐
was the one‐dimensional combination of the instantane‐ tem is able to recognize a word even anoisy one [18].
ous spectrum at the power peak and the power pattern in
time domain. Two experiments were conducted using an
original database and a database created. The result of 3 EXPERIMENT
recognition rate for 45 environmental sound data sets was
The design of the proposed system based on the prepro‐
about 92%. They found that the new method is fast and
cessing of the wave signal by using the wavelet transform,
simple compared to the HMM‐based methods, and suita‐
ble for an on‐board system of a robot for home use, e.g. a and also on the ANNs which are designed to train the
security monitoring robot or a home‐helper robot [14]. system to recognize the samples.
In 2007, Soltani and Ainon proposed an experimental 3.1 Recording the Voice
study on six emotions, happiness, sadness, anger, fear, The first step of the voice recognition system is the sound
neutral and boredom. This experiment used speech fun‐ record of the words that will be recognized by the system.
damental frequency, formants, energy and voicing rate as The record process can be achieved by many methods,
extracted features. The features were selected manually such as the sound recorder that is in the accessories of the
for different experiments in order to get the best results. windows, the Audio recorder in any program with the
These features were included into a features vector with input (N, Fs, and CH). That is records N audio samples
different sizes as input for different neural network classi‐ at Fs Hertz from CH number of input channels. With
fiers. The database which was used for this experiment is the WAVE recording as output, and the third method is
the Berlin Database of Emotional Speech [15]. the voice record with matlab environment, which is
achieved by using a list of commands written in the
In the study of Al‐Alaoui et al. they implemented a new command window and record the desired voice. In this
pattern classification method, where they used Neural system we record 8 words; each one is recorded 5 times, so
Networks trained using the Al‐Alaoui Algorithm. we have 40 voice samples. Table 1 summarizes the recorded
The proposed speech recognition system was part of the words in both Arabic and English languages.
Teaching and Learning Using Information Technology
(TLIT) project which would implement a set of reading
TABLE 1
lessons to assist adult illiterates in developing better read‐
THE WORDS IN THE RECOGNITION SYSTEM
ing capabilities.They compared two different methods for
English words Arabic words
automatic Arabic speech recognition for isolated words
Open eftah
and sentences. The result showed that the using of the Al‐
Close egleg
Alaoui Algorithm better than HMM in the prediction of
Right yameen
both words and sentences [16].
Left yasar
Onishi et al. proposed their system in 2009. They con‐
structed an individual identification system with three‐
layered neural networks. The voice signals were prepro‐ 3.2 Analysis of voice signal
cessed by Fast Fourier Transform (FFT), and then they The designed system works on a limited vocabulary
used as input data of the neural networks with a back‐ consists of 8 words. Each words recorded with input
propagation learning algorithm. The results of this study parameters as (44100 Hz, 16, stereo).It was read with this
Figure 2: The topology of one ANN
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 17
Figure 5: The Original Sample
Figure 4: The Regression of ANN
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 5, MAY 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 19
4 CONCLUSION and implementation”, Technical report, Stanford Uni‐
In this study, the multilayer p e r c e p t r o n is used as versity, 1991.
structure of the ANN and the backpropagation training [7]www.thepolygoners.com/tutorials/dwavelet/dwttut.ht
ml. Accessed on 20‐2‐2011.
algorithm is used to train the developed ANN.
[8] R. Murdock, J. Husseiny, A. Liang, E. Abolrous, S. Ro‐
The proposed system applied the DWT o n the orig‐
driguez, “Improvement on speech recognition and syn‐
inal data seven times to convert to a smaller data. Each
thesis for disabled individuals using fuzzy
level of this transformation has an individual network
neural net retrofits”, Neural Networks, IEEE Interna‐
with the same parameters except the input value. The
tional Conference on 24‐27 Jul, 1988.
recognition process can be achieved by using using the
[9] S. Shikano, K. Nakamura,” Speaker adaptation applied
special function called sim which indicate the similarity to HMM and neural networks”, Acoustics, Speech, and
between the sample and the trained samples. Signal Processing, ICASSP‐89., International Con‐
The testing process shows that the best level that ference on 23‐26 May, 1, 1989.
gives the higher recognition rate is the seven level, that [10] , J.B. A II Waibel, A.H. Hampshire, “novel objec‐
ensure that the DWT is an effecient approach that com‐ tive function for improved phoneme recognition using
pressed the data to minimize it’s features that will time‐delay neural networks”, Neural Networks, IEEE
make the recognition process faster and it improved the Transactions, V 2(216‐228), 1990.
accuracy of the system. The accuracy of this system is [11] K.Ng, Y. Zhou, R. Ng, “A voice controlled robot
80%‐100% according to the sample and the overall accura‐ using neural network”, Intelligent Information Sys‐
cy is 90%. The performance of the ANN of the seventh tems. Second Australian and New Zealand Conference
level is extremely 99%. on 29 Nov‐2 Dec, 1994.
[12] P.A. Taylor, J.M. Nava,” Speaker independent voice
recognition with a fuzzy neural network”, Fuzzy Sys‐
5 FUTURE WORKS tems, Proceedings of the Fifth IEEE International
Conference on 8‐11 Sep, 3, 1996.
There are many directions are recommended to enhance
[13] H.Dong, X. Kwan, “Phoneme sequence pattern
the voice recognition system using ANNs, such as: im‐ recognition using fuzzy neural network”, Neural Net‐
proving the accuracy of the voice recognition system by works and Signal Processing, Proceedings of the 2003
training the ANNs on more data or by taking a specif‐ International Conference on 14‐17 Dec., 1, 2003.
ic duration of the voice sample to minimize the data [14] S. Ding, Y. Liu, Y. Toyoda, J. Huang, “Environmental
and eliminate all unnecessary durations. Improving the sound recognition by multilayered neural networks”,
recognition system by using sentences not only words in Computer and Information Technology, CIT ʹ04. The
the training and recognizing processes. Comparing the Fourth International Conference on 14‐16 Sept., 2004.
accuracy of the system that is applied on female voice [15] K. Ainon, R. Soltani, “Speech emotion detection
with the same system which is applied on male voice. based on neural networks”, Signal Processing and Its
Finally, applying the voice recognition system on anoth‐ Applications. ISSPA 2007. 9th International Sympo‐
er type of ANNs. sium on 12‐15 Feb., 2007.
[16] J. Azar, E. Yaacoub, M. Al‐Alaoui, L. Al‐Kanj,
REFERENCES “Speech recognition using artificial neural networks and
[1] B. Juang, L. Rabiner, “ Fundamentals of speech recog‐ hidden markov models”, In IMCL2008 Conference, 2008.
nition”, PTR prentice‐hall,Inc.,A simon and schuster [17] A. Hasegawa, H. Kinoshita, K. Kishida, S.Onishi,
company, 1993. S. Tanaka, “Construction of individual identifica‐
[2] www.physics.otago.ac.nz/internal/elec401/dsp‐ tion system using voice in three‐layered neural net‐
smith/ch01.pdf. Accessed on 10‐12‐2010. works”, Intelligent Signal Processing and Communi‐
[3] R. Adams,” Sourcebook of automatic identification cation Systems. ISPACS 2009. International Sympo‐
and data collection”, Van Nostrand Reinhold, New sium on 7‐9 Jan., 2009.
York, 1990. [18] D. Shahgoshtasbi, “ A biological speech recognition
[4] D. Colton,” Automatic speech recognition tutori‐ system by using associative neural networks”, World
al”,2003. Automation Congress (WAC), 2010.
[5] Clariety,” Voice Recognition Technology The Perfect [19] V. Sandrasegaran, K. Venayagamoorthy, G.K.
Computer Interface for the Real Estate Industry”, Clarei‐ Moonasar, “Voice recognition using neural Networks”,
ty Consulting & Communications, Inc., 2004. Communications and Signal Processing, COMSIG ʹ98.
accessed on 22‐1‐2011. Proceedings of the 1998 South African Symposium on 7‐8
[6] T. Edwards,” Discrete wavelet transforms: Theory Sep, 1998.