You are on page 1of 4

Speech Recognition using Neural Network

1st Lakshya Jain 2nd Neha Mane


Dept. of Electronics Dept. of Electronics
K.J. Somaiya College of Engineering K.J. Somaiya College of Engineering
Mumbai, India Mumbai, India
lakshya.jain@somaiya.edu neha.mane@somaiya.edu

1812082 1812088

Abstract—A Speech Recognition is a cycle to empower Com- interaction. The feature extraction is the process of removing
puters to distinguish and react to human speech sounds. The unwanted and redundant information and retains only the
project analyzes include extraction procedures applied in speech useful information in type of speaker independent automatic
acknowledgment, even presence of numerous strategies; the
exactness rate is a central point of interest in speech acknowl- speech recognition.
edgment. In this article we present some notable extraction
procedures, for example, LPC, MFCC, RASTA, PCA, LDA,
and PLP recognize for the most part utilized element extraction
strategy in speech acknowledgment measure.
Index Terms—Speech Recognition, Feature Extraction, LPC,
RASTA, MFCC, PCA, LDA ,PLP II. L ITERATURE S URVEY

I. I NTRODUCTION
Ganga Banavath, Sreedhar Potla aimed to describe
Speech is one of the ancient ways to express ourselves.
an efficient technique or method resulting in effective
Today these speech signals are also used in biometric
speech processing applications. Performance of a classifier
recognition technologies and communicating with machine.
is a function depending on the length of speech sample,
These speech signals are slowly timed varying signals (quasi-
environment etc. They carried out their work using Mel
stationary). When examined over a sufficiently short period
frequency cepstral coefficients (MFCC) along with Gaussian
of time (5-100 msec), its characteristics are fairly stationary.
Mixture Model (GMM) classifier.
But, if for a period of time the signal characteristics changes,
Kishori R. Ghule, R. R. Deshmukh recognized that speech
it reflects to the different speech sounds being spoken. The
features extraction and word recognition, these two steps
information in speech signal is actually represented by short
are followed. After feature extraction feature matching is
term amplitude spectrum of the speech wave form. This
performed for word recognition. Their paper describe the
allows us to extract features based on the short term amplitude
different feature extractions techniques like MFCC, LPC,
spectrum from speech. Speech Recognition is the ability
LPCC, DWT etc.
P.V. Naresh examined feature extraction techniques applied
in speech recognition, even existence of many techniques; the
accuracy percentage is a key issue in speech recognition.In
their article they have presented some well-known extraction
techniques such as LPC, MFCC, RASTA, PCA, LDA, and
PLP identify mostly used feature extraction technique in
speech recognition process.
Urmila Shrawankar examined that features should describe
each segment in such a characteristic way that other similar
Fig. 1. Feature Extraction
segments can be grouped together by comparing their
of machine or program to identify words and phrase from features. There are enormous interesting and exceptional
spoken language and convert them in to machine readable ways to describe the speech signal in terms of parameters.
format. The main intension of speech recognition area is to Anup Vibhute intended to focus on the survey of various
evolve techniques and system for speech input to machine. feature extraction techniques in speech processing such as Fast
Speech is the way of communication in human beings, and Fourier Transforms, Linear Predictive Coding, Mel Frequency
it is the dominancy of this medium that motivates research Cepstral Coefficients, Discrete Wavelet Transforms, Wavelet
efforts to allow speech to become a viable human computer Packet Transforms, Hybrid Algorithm DWPD and their
applications in speech processing.
III. F EATURE E XTRACTION T ECHNIQUES E. Wavelet Packet Decomposition(WPD)
A. Linear prediction coding (LPC) The wavelet packet decomposition is the wavelet transform.
LPC is one of the good signal analysis methods for linear In WPD the signal is passed through more filters than discrete
prediction in speech recognition process. The feature extrac- wavelet transform. Wavelet packets are the linier combination
tion techniques find out the basic parameters of speech. LPC of wavelets. The coefficients in the linear combinations are
is the most powerful method for determining the basic pa- computed by a recursive algorithm making each newly com-
rameter and computational model of speech. The idea behind puted wavelet packet Coefficient sequence the root of its own
LPC is the Speech sample can be approximated as a linear analysis tree.[5]
combination of past speech samples[1]
B. Mel frequency Cepstral Coefficient (MFCC) F. Perceptual Linear Prediction (PLP)
MFCC is most popular feature extraction technique. Fre-
The Perceptual Linear prediction (PLP) technique is devel-
quency bands are placed logarithmically here so it approx-
oped by the Hermansky. PLP removes the unwanted informa-
imates the human system response more closely than any
tion of the speech and thus improves speech recognition rate.
other system. Due to its advantage of less complexity in
PLP is identical to LPC except that its spectral characteristics
implementation of feature extraction algorithm, only sixteen
have been transformed to match characteristics of human
coefficients of MFCC corresponding to the Mel scale fre-
auditory system.
quencies of speech Cepstrum are extracted from spoken word
samples in database.[2][3]
IV. S PEECH R ECOGNITION T ECHNIQUES

A. Acoustic phonetic approach


This approach of speech recognition is based on finding
speech sound and providing appropriate labels these sounds.
The term acoustic is deals with the different sounds in speech
and phonetic is the study of phonemes in the language. The
basis of acoustic phonetic approach is based on the fact that,
there exist finite and exclusive phonemes in spoken language
and these phonemes are broadly characterized by a set of
Fig. 2. Block Diagram of MFCC acoustic properties that are demonstrated in the speech signal
over time.[6]
C. Linear prediction cepstral coefficient (LPCC)
The feature extraction is used to demonstrate the speech B. Pattern recognition approach
signals by finite number of measures of the signals. To obtain Pattern Recognition technique is one of the very important
LPCC coefficients the LPC coding is used. LPCC implemented and actively searched trait or branch of artificial intelligence.
using autocorrelation method. The main drawback of LPCC is Pattern training and pattern comparison are two steps involves
that the LPCC are highly sensitive to quantization noise. [4] in pattern recognition approach. Using a well formulated math-
ematical framework and initiates consistent speech pattern
representation for pattern comparison, from a set of labeled
training samples through formal training algorithm is essential
feature of this approach.[7][8]

C. Artificial intelligence approach


Fig. 3. Block Diagram of LPCC
To study the thought processes of human and deals with
representing those processes via machines these two basic
D. Discrete Wavelet Transform (DWT) ideas involves in Artificial intelligence. Machine behaves like
DWT can be considered as filtering process achieved by human being so it is called as Artificial intelligence. AI makes
a low pass scaling filter and a high pass wavelet filter. The machine very smarter and useful. The artificial intelligence ap-
transform decomposition separates the lower frequency con- proach is the combination of the pattern recognition approach
tents and higher frequency contents of the original signals. The and acoustic phonetic approach so it is called hybrid approach
lower frequency contents provide a sufficient approximation of of pattern recognition and acoustic phonetic approach and
the signal while the finer details of the variation are contained its recognition procedure is according to how person applies
in the higher frequency contents.[3] intelligence on set of measured acoustic features.[9]
V. M ETHODOLOGY 9) Next we explore the data and compare the waveform,
We followed these steps to implement our model: the spectrogram and the actual audio of one example
from the dataset.
1) We imported the necessary modules and dependencies
such as pathlib and os.
2) We downloaded a portion of the Speech Commands
dataset. The original dataset consists of over 105,000
WAV audio files of people saying thirty different words.
We use a portion of the dataset to save time with data
loading.
3) We select audio files that say the words ”down”, ”go”,
”left”, ”no”, ”right”, ”stop”, ”up” and ”yes” in different
accents,
4) We checked the basic statistics and commands about
the dataset and extracted the audio files into a list and
shuffled it.
5) We then split the files into training, validation and test
sets using a 80:10:10 ratio, respectively.
6) The Audio file was initially a binary file, which we have
to convert into a numerical tensor. We use a function
which returns WAV-encoded audio as a Tensor and the
sample rate.
7) We then examine a few audio waveforms with their
corresponding labels.
Fig. 5. Spectrogram of the Audio files

10) We now build and train a model. For the model, we


use a simple convolutional neural network (CNN), since
we have transformed the audio files into spectrogram
images.

Fig. 6. Training and validation loss curve

11) We now run the model on the test set and check its
performance. We observe an accuracy of 90 %.
Fig. 4. Audio waveforms with their labels 12) We finally, verify the model’s prediction output using an
input audio file of someone saying ”no.” We can see that
8) We convert the waveform into a spectrogram, which our model very clearly recognized the audio command
shows frequency changes over time and can be repre- as ”no.” Because ’no’ and ’go’ audio commands are very
sented as a 2D image. This can be done by applying similar in terms of syllables. Hence the prediction rate
the short-time Fourier transform (STFT) to convert the for ’no’ as ’go’ is high. Even humans can make mistakes
audio into the time-frequency domain. differentiating between the two.
will eventually give better results when provided with more
audio files and more time to train the model.
R EFERENCES
[1] Leena R Mehta 1, S.P.Mahajan 2, Amol S Dabhade” Comparative Study
Of MFCC And LPC For Marathi Isolated Word Recognition System”
Lecturer, Dept. of ECE, CusrowWadia Institute of Technology, Pune,
Maharashtra, India 1 Associate Professor, Dept. of ECE, College of
Engineering , Pune, Maharashtra, India 2 PG Student [SP], Dept. of ECE,
College of Engieering, Pune, Maharashtra, India 3, International Journal
of Advanced Research in Electrical, Electronics and Instrumentation En-
gineering Vol. 2, Issue 6, June 2013
[2] Vimal Krishnan V.R “Features of Wavelet Packet Decomposition and Dis-
crete Wavelet Transform for Malayalam Speech Recognition”, BabuAnto P
School of Information Science and Technology Kannur University, Kerala,
India. 670 567, International Journal of Recent Trends in Engineering, Vol.
1, No. 2, May 2009
[3] Hazrat Ali1,2*, Nasir Ahmad3, Xianwei Zhou2, Khalid Iqbal2 and
Sahibzada Muhammad Ali4 “DWT features performance analysis for auto-
matic speech recognition of Urdu” Ali et al. SpringerPlus a SpringerOpen
Journal 2014.
[4] Nidhi Desai1, Prof.Kinnal Dhameliya2, Prof.Vijayendra Desai3 “ Feature
Extraction and Classification Techniques for Speech Recognition: A Re-
Fig. 7. Confusion Matrix view” 1M.Tech. [Electronics andCommunication] Student, Department Of
Electronics and Communication Engineering, C.G.P.I.T, Bardoli, Gujarat,
International Journal of Emerging Technology and Advanced Engineer-
ing, Website: www.ijetae.com ISSN 2250-2459, ISO 9001:2008 Certified
Journal, Volume 3, Issue 12, December 2013
[5] M.A.Anusuya, S.K.Katti, “Comparison of Different Speech Feature Ex-
traction Techniques with and without Wavelet Transform to Kannada
Speech Recognition”, International Journal of Computer Science and
Information security, Vol.6, No.3, 2010
[6] Santosh K.Gaikward and Bharti W.Gawali, “A Review on Speech Recog-
nition Technique,” International Journal of Computer Applications, vol 10,
No.3, November 2010
[7] Shivanker Dev Dhingra 1, Geeta Nijhawan 2 , Poonam Pandit3, “Isolated
Speech Recognition Using MFCC And DTW”,International Journal of
Advanced Research in Electrical, Electronics and Instrumentation Engi-
neering, Vol. 2, Issue 8, August 2013
[8] M.A.Anusuya, “Speech Recognition by Machine,” International Journal
of Computer Science and Information security, Vol.6, No.3, 2009
[9] Nidhi Srivastava and Dr.Harsh Dev“Speech Recognition using MFCC and
Neural Networks”, International Journal of Modern Engineering Research
(IJMER), march 2007

Fig. 8. Predictions for ”NO”

VI. R ESULTS AND ANALYSIS


We trained the model using the data set which contained
6400 audio files. Our model is giving us good results. We
have reached an accuracy at the highest of 90 %. This accuracy
had been achieved by a Basic Convolution Neural Network,
We have successfully plotted the graph in which it is clear
that the accuracy rate is increasing and most importantly the
loss rate is decreasing constantly. we can experiment more
with our model in terms of increasing the number of epochs ,
finding more optimal learning rates using plot function, more
data augmentation techniques, etc.

VII. C ONCLUSION
We have successfully designed a speech recognition model
which has been trained to identify eight different words being
said by the user. We observe that the model gives an accuracy
of about 90 % on giving around 8000 audio files. This model

You might also like