You are on page 1of 5

Automatic Speech Recognition using Correlation

Analysis

Arnab Pramanik
1
, Rajorshee Raha
2

1,2
Heritage Institute of Technology, Kolkata
1
arnabpramanik.ece@gmail.com
2
rajorshee87@gmail.com



Abstract The growth in wireless communication
and mobile devices has supported the development of
Speech recognition systems. So for any speech
recognition system feature extraction and patter
matching are two very significant terms. In this
paper we have developed a simple algorithm for
matching the patterns to recognize speech. We used
Mel frequency cepstral coefficients (MFCCs) as the
feature of the recorded speech. This algorithm is
implemented simply by using the principle of
correlation. All the simulation experiments were
carried out using MATLAB where the method
produced relatively good results. This paper gives a
details introduction of recorded speech processing,
design considerations and evaluation results.

Keywords Automated Speech Recognition (ASR), Mel-frequency
cepstral coefficients (MFCCs), Discrete Fourier 1ransform (DF1),
Fast Fourier 1ransforms (FF1, Mel-frequency cepstrum (MFC).

I. INTRODUCTION
The basic principle of Automatic Speech Recognition
System (ASR) is to recognize the spoken words irrespective
of the speakers and to convert them to text. The growth in
wireless communication and mobile devices has given us a
large access to a pool of different information resources and
service. In this case ASR is a key component. Present mobile
devices are having limited memory and processing capacities
which are adding several challenges to ASR. As a result ASR
systems executing on mobile devices only supports low
complexity recognition tasks such as simple name dialing.
In this paper we have shown a simple and less complex
algorithm for recognizing the spoken words. So many
researches have been made to recognize patterns for ASR. But
is this paper we use Mel frequency cepstral coefficients as the
feature of the speech sample. The proposed algorithm uses the
principle of Correlation to recognize the spoken word
perfectly.
In communication and signal processing, cross-
correlation is a measure of similarity of two waveforms as a
function of a time-lag applied to one of them. This is also
known as a sliding dot product or sliding inner-product. It is
commonly used for searching a long-signal for a shorter,
known feature. It also has applications in pattern recognition,
single particle analysis, electron tomographic averaging,
cryptanalysis, and neurophysiology.
For continuous functions, f' and g, the cross-
correlation is defined as

Where f * denotes the complex conjugate of f.



So keeping this principle in our mind we implemented our
algorithm of pattern recognition to recognize the spoken
words efficiently but in a simple manner.
x Automatic speech recognition implementation:
A Speech Recognition system can be roughly devided into
three parts namely data acquisition and processing, feature
extraction and pattern matching as shown in the Fig 1.









Fig. 1 Basic blok of the Speech Recognition System

x Data acquisition and processing:
The first step is data acquisition which consists of recorded
voice signal which is band limited to 300Hz to 3400Hz. The
sampling rate as we have taken is 8000Hz considering the
Niquist sampling theorem. After the data acquisition the
recorded signal is filtered using a band pass filter (see Fig. 4).
After that a proper noise cancellation has been taken to
denoise the filtered signal. We have used wavelet transform to
denoise the signal.
After the noise cancellation the uttered speech is selected
using the end point detection technique. The end point
detection determines the position of the spoken speech signal
is time series. In the end-point detection method, it is often
Speech
Data
Acquisitio
n and
processing
Feature
Extraction
Decoded
word
Patten
Recognition
670 978-1-4673-4805-8/12/$31.00 c 2012 IEEE
assumed that during several frames at the beginning of the
incoming speech signal there the speaker has not said
anything. So those frames give the silence or the background
noise.
To detect the speech over the background noise concept
of thresholding is used. This often is based on power threshold
which is a function of time. For this power frames are
calculated and the threshold is set taking the noise frames. The
frames above the calculated threshold are kept and other
discarded leaving only high powered frames which consists of
the speech. Proper end point detection requires proper
calculation of threshold.
After that the signal is pre-emphasized to avoid
overlooking the high frequencies. In this step the pre-
emphasized speech signal is segmented into frames, which are
spaced 10 msec apart, with 5 msec overlaps for short-time
spectral analysis. Each frame is then multiplied by a fixed
length window. Window functions are signals that are
concentrated in time, often of limited duration N
0
. Here
Hamming window is used to taper the signals to quite small
values (nearly zeros) at the beginning and end of each frame
for minimizing the signal discontinuities at the edge of each
frame.
Hamming window, which is defined as:
(1)
The output of the Hamming window can be expressed as
Y[n] =h[n]*s[n], 0nN
0
(2)
After that the speech signal is converted to frequency
domain (see Fig. 5) from time domain by using Fast Fourier
Transform (FFT) which is expressed by
(3)
If the number of FFT points, N, is larger than the frame
size N0, N-N0 zeros are usually inserted after the N0 speech
samples.
x Feature Extraction:
After FFT the FFT coefficients passes through the mel
filterbank. A popular formula to convert f hertz into m Mel is:
(4)
And the reverse is:
(5)
The filter bank with M filters (m=1, 2,, M), where filter m is
the triangular filter given by:
(6)
If we define fl and fh be the lowest and highest
frequencies of the filterbank in Hz, Fs the sampling frequency
in Hz, M the number of filters and N the size of FFT, the
cantering frequency f (m) of the m
th
filter bank is
(7)
The normal hertz scale vs the Mel scale comparison is shown
in the next figure(Fig. 2),

Fig 2.The x axis shows the frequency range (fl and fh taken are 0 Hz and 4
kHz) over which the Mel frequency is calculated and also the Mel filter bank
frequency range.

The Mel filter bank gives its coefficients in a two
dimensional matrix and is in a sparse matrix form. The no of
filters taken in the filter bank is 30 ; the coefficient matrix of
the filter bank is plotted in image form where the square
blocks shows the non zero values. Also a column wise plot is
shown in Fig 3.
2012 World Congress on Information and Communication Technologies 671

Fig 3. A columb wise plot of Mel Filter Bank

x Mel frequency cepstral coefficients:
The Mel frequency cepstral coefficient (see Fig. 6) is
calculated by passing the FFT coefficients through the mel
filterbank frame by frame. The frame size is 10 msec and 5
msec overlap. The MFCC spectrum is the proposed feature of
the speech in our proposed model. In the MFCC spectrum plot
according to the value of the coefficients the colormap is
chosen. Here we extract 13 mfcc coeffs from each time frame.
The no of MFCC coefficients extracted are depending upon us
but it is kept usually at 10 to 13 per frame.
x Cepstral Weighting:
A weighting window (named liftering window) is applied
after decorrelating the cepstral coefficients. In this work, the
sinusoidal lifter is utilize to minimize the sensitivities by
lessening the higher and lower cepstral coefficients.
(8)
x Pattern recognition:

In this paper we have come with a simple efficient
algorithm to do the pattern recognition. For this we have used
simple principle of correlation. This gives us a satisfactory
results.
For the pattern recognition the most important thing is
data base. While delaing with database we tried to realize is in
simple way to test our system. That is why we made our
database with four simple words. These words are as follows:
CAR, BALL, MOM, GOLD
However this model can be further extended to a larger
database and computation.
The procedure of our pattern recognition process is as
follows:
x All the Mel frequency cepstral coefficients of the
specified words are loaded to make the database.
x Coefficient vectors of each word are stored in each
row of a matrix. We have taken 13 samples of each
word from different speakers. So number of rows is
the number of available speech samples of each
word. Likewise we have got four different matrices
for the four words.
x The MFCC vector of the test speech sample is
loaded.
x The MFCC vector of the test speech sample
correlated with each row vector of each matrices and
the maximum correlation value from each row is
stored in a column matrix. So likewise four column
matrixes for each word are computed.
x The maximum value from each column matrix and
stored it in an array. This array has four correlation
values from four different words.
x Finally the maximum of these four values decides
which our recognized word is (see Table 1/2/3).
We developed this algorithm considering four words only.
The basic aim of using a small database is to test our
algorithm successfully. This algorithm can also be applicable
using a larger database. Any word or any sound from any
environment can be used to recognize using our speech
recognition model.
II. RESULTS

Fig 4: The recorded speech sample and filtered speech sample respectively
672 2012 World Congress on Information and Communication Technologies

Fig 5: The uttered speech and its corrosponding ^$ point FFT plot repestively


Fig 6: Image plot of the Mel frequency cepstral coefficients of the word
BALL

Followings are the results for some inputs:

Table 1: Input speech : CAR
WORD CAR BALL MOM GOLD
Max
Correlat
ed value
9.7504e+
03
9.5711e+
03
9.7400e+
03
9.2974e+
03
Output: The selected Word is CAR: HIT

Table 2: Input speech : MOM
WORD CAR BALL MOM GOLD
Max
Correlat
ed value
9.8656e+
03
9.4671e+
03
9.9263e+
03
9.3485e+
03
Output: The selected Word is MOM: HIT

Table 3: Input speech : GOLD
WORD CAR BALL MOM GOLD
Max
Correlat
ed value
1.1170e+
04
1.1009e+
04
1.1123e+
04
1.0842e+
04
Output: The selected Word is CAR: MISS

III. APPLICATION
In the health care domain, even in the wake of improving
speech recognition technologies, medical transcriptionists
have not yet become obsolete. The services provided may be
redistributed rather than replaced. Speech recognition can be
implemented in front-end or back-end of the medical
documentation process. Application of ASR is in the military
area also such as High-performance fighter aircraft,
Helicopters; Battle management; Training air traffic
controllers; Telephony and other domains.
V. CONCLUSION
This paper has described simple and efficient pattern recognition
method to recognize word after the extraction of the Mel frequency
coefficients from the recorded and processed speech samples. As the
algorithm development and testing was the main area of concern so
the greatest compromise we made is the size of the database. We
tested our system with fifty samples of four words. Out of that for
75% of the totals cases the detection was successful. This algorithm
can also be implemented using large database in future and the
results will be surely better and the efficiency will increase
accordingly. We implement our system by using TMS320C6713
DSP kit. With its own set of function and modules we will try to
improve our model in future.
ACKNOWLEDGEMENT
The authors are grateful to Prof. Prabir Banerjee,
Associate Professor and Prof. Siladitya Sen, Head of the
Department of Electronics and Communication
Engineering (ECE), Heritage Institute of Technology,
Kolkata for giving them tremendous support and
valuable suggestions. They are also grateful to Heritage
institute of Technology, Kolkata for allowing them to
use the facilities in the Project laboratory of ECE
Department.
REFERENCES

[1]. X. D. Huang ana K. F. Lee. Phonene classification using
semicontinuous hiaaen markov moaels. IEEE Trans. on Signal
Processessing, 40(5).19621067, May 1992.
[2]. S. E. Levinson L.R. Rabiner, B.H. Juang ana M. M. Sonahi.
Recognition of isolatea aigits using hiaaen markov moaels with
continuous mixture aensities. AT & T Technical Journal, 64(6).1211
1234, July-August 1985.
[3]. Acero, Acoustical ana environmental robustness in automatic speech
recognition,Kluwer Acaaemic Pubs. 1993.
[4]. Rabiner, L. R., Schafer, R.W. Digital Processing of Speech Signals,
Prentice Hall, 1978.
[5]. F. Jelinek. "Continuous Speech Recognition by Statisical Methoas."
IEEE Proceeaings 64.4(1976). 532-556
[6]. Young, S., A Review of Large-Jocabulary Continuous Speech
Recognition, IEEE Signal Processing Maga:ine, pp. 45-57, Sep. 1996
[7]. Rabiner L. R. ana Juang B.-H., Funaamentals of Speech Recognition,
Prentice-Hall, 1993.
2012 World Congress on Information and Communication Technologies 673
[8]. Mel Frequency Cepstral Coefficients . An Evaluation of Robustness
of MP3 Encoaea Music by Siguraur Sigurasson, Kaare Branat
Petersen ana TueLehn-Schioler
[9]. 'Speech ana speaker recognition. A tutorial` by Samuaravifaya K S.
J. Young. The general use of tying in phoneme-basea hmm speech
recognisers. Proceeaings of ICASSP 1992
[10]. A. J. Nefian, L. Liang, X. Pi, X. Liu, ana C. Mao. An couplea hiaaen
Markov moael for auaio-visual speech recognition. In International
Conference on Acoustics, Speech ana Signal Processing, 2002.
[11]. http.// www.wikipeaia.org
[12]. http.// www.google.co.in

674 2012 World Congress on Information and Communication Technologies

You might also like