Professional Documents
Culture Documents
Sciences University of Maryland Eastern Shore Princess Anne, MD 21853 Research Adviser: Dr. Ali Eydgahi
Progress Report for: Chesapeake Information Based Aeronautics Consortium August 2005
Figure 1
A block diagram of the MFCC processes is shown in Figure 1. The function of each block was discussed in the previous report but just to summarize frame blocking sequence, the speech waveform is cropped to remove silence or acoustical interference that may be present in the beginning or eng of the sound file. The windowing block minimizes the discontinuities of the signal by tapering the beginning and end of each frame to zero. The FFT block converts each frame from the time domain to the frequency domain. In the Mel-frequency wrapping block, the signal is plotted against the Melspectrum to mimic human hearing. In the final step, the Cepstrum, the Mel-spectrum scale is converted back to standard frequency scale. This spectrum provides a good representation of the spectral properties of the signal which is key for representing and recognizing characteristics of the speaker. After the fingerprint is created, you will have will is also referred to as an acoustic vector. This vector is the one which was referred to in the earlier section. This vector will be stored as a reference in the database. When an unknown sound file is imported into MatLab, a fingerprint will be created of it also and its resultant vector will be compared against those in the database, again using the Euclidian distance technique, and a suitable match will be determined. This process is as referred to as feature matching.
>> plot(tap);
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8 0
5000
10000
15000
>>[ceps,freqresp,fb,fbrecon,freqrecon] = mfcc(tap,12000,100); The parameters passed to the mfcc.m are as follow: ceps cepstrum coefficients, freqresp fast fourier transform, fb converts to Mel-frequency scale, fbrecon converts to standard frequency scale, freqreon can be used to resample the data in order to create original spectrogram. tap reference variable of sound file, 12000 the sampling rate of this particular sound file, 100 is the frame rate used in the process.
>> imagesc(flipud(freqresp));
50
100
150
200
>> imagesc(flipud(fb));
10 15
20
25
30
35
40 10 20 30 40 50 60 70 80 90 100 110
>> imagesc(flipud(fbrecon)); 5
Cepstral Conversion
10 15
20
25
30
35
40 10 20 30 40 50 60 70 80 90 100 110
Local a computational difference between a feature of one signal and a feature of another. Global an overall computational difference between the entire signal and another signal. The distance metric most commonly used within DTW is the *Chebychev Distance Measure. The Chebychev distance between two points is the maximum distance between the points in any single dimension. The distance between points X=(X1, X2, etc.) and Y= (Y1, Y2, etc.) is computed using the formula: Maxi |Xi - Yi| where Xi and Yi are the values of the ith variable at points X and Y, respectively. Speech is a time-dependent process. Hence the utterances of the same word will have different durations, and utterances of the same word with the same duration will differ in the middle, due to different parts of the words being spoken at different rates. To obtain a global distance between two speech patterns (represented as a sequence of vectors) a time alignment must be performed. The best matching template is the one for which there is the lowest distance path aligning the input pattern to the template. A simple global distance score for a path is simply the sum of local distances that go make up the path [4].
sequence of underlying hidden states that might have generated it. For example be examining the observation sequence of the s1_1 test HMM one would determine that the s1 train HMM is most likely the voiceprint that created it, thus returning the highest likelihood value.
Experimental Results
After writing test.m and train.m scripts which utilize the Hidden Markov Model procedure I then ran a number of tests comparing this new system against the Dynamic Time Warping system I developed weeks ago. I used the built in MATLAB commands tic and toc to measure the time which elapses after the user inputs the name of the unknown sound file and the moment the system returns an reference value. There are no input arguments for the Dynamic Time Warping function, but there are how ever three input arguments for the Hidden Markov Model procedure: number of Gaussian Mixtures, number of HMM states and number of iterations i.e., HMM = model(x,y,z) with the variables of the right side representing each input argument respectively. The number of states may range from 2 to 5 and the number of Gaussian Mixture components 1 to 3. I chose to keep the number of iterations constant at 5. With this range of arguments there are twelve possible combinations each of which were tested.
Preliminary Results
As you can see from the data the most successful system was the one based on the Hidden Markov Model pattern recognition procedure. More specifically the system using three states and one Gaussian Mixture was the most successful overall. This system had 87.5% success rate with an average running time of 7.9 seconds compared to average running time of 22.1 seconds and 25% success rate of the Dynamic Time Warping the Hidden Markov System is the obvious choice.
Future Goals
My goals in the upcoming weeks are to better understand the Hidden Markov algorithm as well as add a graphical user interface to the system which would utilize the built in MATLAB command audiorecorder which will allow me to read the user voice data into MATLAB in real-time. I will use either the built in MATLAB program called GUIDE or the third party application jMATLINK to add a graphical user interface. jMATLINK is a program which allows the user to use MATLAB commands and functions within a JAVA applet http://www.held-mueller.de/JMatLink/. 9
REFERENCES
[1] Digital Signal Processing Mini Project. An Automatic Speaker Recognition System. 14 June 2005. <http://lcavwww.epfl.ch/~minhdo/asr_project/asr_project.html>. [2] ECE3Speech Recognition. A Simple Speech Recognition Algorithm. 15 April 2003. 1 July 2005. http://www.eecg.toronto.edu/~aamodt/ece341/speech-recognition/ [4] Isolated Word, Speech Recognition using Dynamic Time Warping. Dynamic Time Warping.14 June 2005. <http://www.cnel.ufl.edu/~kkale/dtw.html> [5] Speech Recognition by Dynamic Time Warping. Speech Recognition by Dynamic Time Warping. 20 April 1998. 06 July 2005. <http://www.dcs.shef.ac.uk/~stu/com326/> [6] J. Bilmes: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical Report, University of Berkeley, ICSI-TR-97-021. April 1998. 01 August 2005 <http://crow.ee.washington.edu/people/bulyko/papers/em.pdf>
10