You are on page 1of 10

DESIGN OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM USING MATLAB Jamel Price, Sophomore Student Department of Engineering and Aviation

Sciences University of Maryland Eastern Shore Princess Anne, MD 21853 Research Adviser: Dr. Ali Eydgahi

Progress Report for: Chesapeake Information Based Aeronautics Consortium August 2005

Review of Speech Recognition Process


MatLab can easily and effectively be used to construct and test a speech recognition system. With its built in routines, the many procedures which make up the a particular speech recognition algorithm are easily mimicked. A waveform can be sampled, in the time domain, into MatLab using the wavread command. After a waveform has been input into a temporary buffer, the waveform has to be simplified into a fingerprint. A fingerprint represents the basic but unique characteristics of the sound file in the frequency domain. The fingerprint is merely a vector of numbers where each number represents the magnitude of sound that was heard during a particular. This vector is then stored in a database as a reference to be tested against in future sessions. To recognize a word you perform the same fingerprinting technique on the unknown sound file and then compare that vector with the references stored in the database by computing the Euclidian distance. The Euclidian distance is a procedure in which the element wise difference between the unknown vector and reference vector is found, squared and them summed. The smaller the Euclidian distance the better the match. If for example there are forty reference fingerprints in a database, the fingerprint which has the smallest Euclidian distance when compared against the unknown fingerprint will be recognized as the match.

Mel-frequency Cepstrum Coefficient


The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to create the fingerprint of the sound files. The MFCC are based on the known variation of the human ears critical bandwidth frequencies with filters spaced linearly at low frequencies and logarithmically at high frequencies used to capture the important characteristics of speech. Studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the Mel scale. The Mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels (Do 6). The following formula is used to compute the Mels for a particular frequency:
mel ( f ) = 2595 * log 10(1 + f / 700)

Figure 1
A block diagram of the MFCC processes is shown in Figure 1. The function of each block was discussed in the previous report but just to summarize frame blocking sequence, the speech waveform is cropped to remove silence or acoustical interference that may be present in the beginning or eng of the sound file. The windowing block minimizes the discontinuities of the signal by tapering the beginning and end of each frame to zero. The FFT block converts each frame from the time domain to the frequency domain. In the Mel-frequency wrapping block, the signal is plotted against the Melspectrum to mimic human hearing. In the final step, the Cepstrum, the Mel-spectrum scale is converted back to standard frequency scale. This spectrum provides a good representation of the spectral properties of the signal which is key for representing and recognizing characteristics of the speaker. After the fingerprint is created, you will have will is also referred to as an acoustic vector. This vector is the one which was referred to in the earlier section. This vector will be stored as a reference in the database. When an unknown sound file is imported into MatLab, a fingerprint will be created of it also and its resultant vector will be compared against those in the database, again using the Euclidian distance technique, and a suitable match will be determined. This process is as referred to as feature matching.

MatLab and the MFCC Process


The following section will merely be a list of commands, along with a brief description, used to implement the MFCC algorithm. >> tap = wavread('s8.wav'); Reads .wav file entitled s8 into buffer and assigns it the reference variable tap

>> plot(tap);

Plots s8 in the time domain

0.6

0.4

0.2

-0.2

-0.4

-0.6

-0.8 0

5000

10000

15000

>>[ceps,freqresp,fb,fbrecon,freqrecon] = mfcc(tap,12000,100); The parameters passed to the mfcc.m are as follow: ceps cepstrum coefficients, freqresp fast fourier transform, fb converts to Mel-frequency scale, fbrecon converts to standard frequency scale, freqreon can be used to resample the data in order to create original spectrogram. tap reference variable of sound file, 12000 the sampling rate of this particular sound file, 100 is the frame rate used in the process.

>> imagesc(flipud(freqresp));

produces FFT spectrogram of data file

50

100

150

200

250 10 20 30 40 50 60 70 80 90 100 110

>> imagesc(flipud(fb));

spectrogram of data plotted against Mel-frequency scale

10 15

20

25

30

35

40 10 20 30 40 50 60 70 80 90 100 110

>> imagesc(flipud(fbrecon)); 5

Cepstral Conversion

10 15

20

25

30

35

40 10 20 30 40 50 60 70 80 90 100 110

User-Defined Script Files


At the time in which this information was recorded, I have created two scripts entitled train.m and test.m. The train.m script file creates an Excel database containing each sound file read into the system. The acoustic_data.xls database file is created and will serve as the reference base of speech recognition system. Using the test.m script the used will attempt to identify a previously recorded unknown sound file. The two scripts files in which I have created are based on the mfcc.m and dtw.m functions respectively. The data.zip file contains two folders train and temp which were used to test the system. The train folder contains eight sound files, and as the name of the folder implies, that are used to train the system. The test folder also contains eight sound files each of which is used to test the system. There were eight female speakers, labeled from S1 to S8. All speakers uttered the same single digit "zero" once in a training session and once in a testing session later on. The sessions were help six months apart to simulate the voice variation over the time. This database can be downloaded from http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html.

The Speech Database


The speech database used to build and test the system was downloaded from the Swiss Federal Institute of Technology website http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project). After unzipping the file you will find two folders train and test. In the train folder are eight female voices, labeled s1 through s8, each saying the digit zero. In the test folder are those same females saying the digit zero, but these sound files were recorded six months later to take in account the many changes that occur in a speakers voice (time, health etc). The voices in the test folder are labeled s1_1 through s8_8.

The Recognition Phase


To verify a speakers identity, the same fingerprinting process is done to the incoming speech as was used on the references voiceprints used to construct the library. This project uses the Mel Frequency Cepstrum Coefficient (MFCC) procedure to create the aforementioned voiceprints as detailed in an earlier paper. This newly attained feature vector (voiceprint) in the case is compared against those reference vectors created and stored during the training phase. This procedure is known as feature matching. The most common recognition techniques used in speech recognition include, but are not limited to: Dynamic Time Warping (DTW) and Hidden Markov Modeling (HMM).

The Dynamic Time Warping Process


The reference voiceprint with the lowest distance measure from the input pattern is the recognized word. The best match (lowest distance measure) is based upon Dynamic Time Warping (DTW). There are two underlying concepts which comprise the DTW procedure: (1) Features the information in each signal has to be represented in some manner (MFCC). (2) Distances some form of distance metric* has to be used in order to obtain match. Under distances metric subclass exist two additional concepts: 7

Local a computational difference between a feature of one signal and a feature of another. Global an overall computational difference between the entire signal and another signal. The distance metric most commonly used within DTW is the *Chebychev Distance Measure. The Chebychev distance between two points is the maximum distance between the points in any single dimension. The distance between points X=(X1, X2, etc.) and Y= (Y1, Y2, etc.) is computed using the formula: Maxi |Xi - Yi| where Xi and Yi are the values of the ith variable at points X and Y, respectively. Speech is a time-dependent process. Hence the utterances of the same word will have different durations, and utterances of the same word with the same duration will differ in the middle, due to different parts of the words being spoken at different rates. To obtain a global distance between two speech patterns (represented as a sequence of vectors) a time alignment must be performed. The best matching template is the one for which there is the lowest distance path aligning the input pattern to the template. A simple global distance score for a path is simply the sum of local distances that go make up the path [4].

Hidden Markov Modeling


The Hidden Markov Modeling algorithm is a very involved process. This following information represents my most basic understanding of the procedure. In the coming weeks I hope to fully understand every aspect of the process. Hidden Markov Processes are part of a larger group known as statistical models; models in which one tries to characterize the statistical properties of the signal with the underlying assumption that a signal can be characterized as a random parametric signal of which the parameters can be estimated in a precise well defined manner. In order to implement an isolated word recognition system using HMM the following steps must be taken (1) For each reference word, a markov model must be built using parameters that optimize the observations of the word. (2) A calculation of model likelihoods for all possible reference models against the unknown model must be completed using the Viterbi algorithm* followed by the selection of the reference with the highest model likelihood value. I too have a very basic understanding of the Viterbi algorithm. In the coming weeks I wish to gain a better understanding of this process as well. With the *Viterbi algorithm, we take a particular HMM, and determine from an observation sequence the most likely 8

sequence of underlying hidden states that might have generated it. For example be examining the observation sequence of the s1_1 test HMM one would determine that the s1 train HMM is most likely the voiceprint that created it, thus returning the highest likelihood value.

Experimental Results
After writing test.m and train.m scripts which utilize the Hidden Markov Model procedure I then ran a number of tests comparing this new system against the Dynamic Time Warping system I developed weeks ago. I used the built in MATLAB commands tic and toc to measure the time which elapses after the user inputs the name of the unknown sound file and the moment the system returns an reference value. There are no input arguments for the Dynamic Time Warping function, but there are how ever three input arguments for the Hidden Markov Model procedure: number of Gaussian Mixtures, number of HMM states and number of iterations i.e., HMM = model(x,y,z) with the variables of the right side representing each input argument respectively. The number of states may range from 2 to 5 and the number of Gaussian Mixture components 1 to 3. I chose to keep the number of iterations constant at 5. With this range of arguments there are twelve possible combinations each of which were tested.

Preliminary Results
As you can see from the data the most successful system was the one based on the Hidden Markov Model pattern recognition procedure. More specifically the system using three states and one Gaussian Mixture was the most successful overall. This system had 87.5% success rate with an average running time of 7.9 seconds compared to average running time of 22.1 seconds and 25% success rate of the Dynamic Time Warping the Hidden Markov System is the obvious choice.

Future Goals
My goals in the upcoming weeks are to better understand the Hidden Markov algorithm as well as add a graphical user interface to the system which would utilize the built in MATLAB command audiorecorder which will allow me to read the user voice data into MATLAB in real-time. I will use either the built in MATLAB program called GUIDE or the third party application jMATLINK to add a graphical user interface. jMATLINK is a program which allows the user to use MATLAB commands and functions within a JAVA applet http://www.held-mueller.de/JMatLink/. 9

REFERENCES
[1] Digital Signal Processing Mini Project. An Automatic Speaker Recognition System. 14 June 2005. <http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html>. [2] ECE3Speech Recognition. A Simple Speech Recognition Algorithm. 15 April 2003. 1 July 2005. http://www.eecg.toronto.edu/~aamodt/ece341/speech-recognition/ [4] Isolated Word, Speech Recognition using Dynamic Time Warping. Dynamic Time Warping.14 June 2005. <http://www.cnel.ufl.edu/~kkale/dtw.html> [5] Speech Recognition by Dynamic Time Warping. Speech Recognition by Dynamic Time Warping. 20 April 1998. 06 July 2005. <http://www.dcs.shef.ac.uk/~stu/com326/> [6] J. Bilmes: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report, University of Berkeley, ICSI-TR-97-021. April 1998. 01 August 2005 <http://crow.ee.washington.edu/people/bulyko/papers/em.pdf>

10

You might also like