This action might not be possible to undo. Are you sure you want to continue?
Sciences University of Maryland Eastern Shore Princess Anne, MD 21853 Research Adviser: Dr. Ali Eydgahi
Progress Report for: Chesapeake Information Based Aeronautics Consortium August 2005
Review of Speech Recognition Process MatLab can easily and effectively be used to construct and test a speech recognition system. squared and them summed. The following formula is used to compute the Mels for a particular frequency: mel ( f ) = 2595 * log 10(1 + f / 700) 2 . After a waveform has been input into a temporary buffer. A fingerprint represents the basic but unique characteristics of the sound file in the frequency domain. The Mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. As a reference point. the waveform has to be simplified into a fingerprint. If for example there are forty reference fingerprints in a database. Studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. To recognize a word you perform the same fingerprinting technique on the unknown sound file and then compare that vector with the references stored in the database by computing the Euclidian distance. The MFCC are based on the known variation of the human ear’s critical bandwidth frequencies with filters spaced linearly at low frequencies and logarithmically at high frequencies used to capture the important characteristics of speech. With its built in routines. is defined as 1000 Mels (Do 6). 40 dB above the perceptual hearing threshold. This vector is then stored in a database as a reference to be tested against in future sessions. in the time domain. a subjective pitch is measured on a scale called the Mel scale. The fingerprint is merely a vector of numbers where each number represents the magnitude of sound that was heard during a particular. A waveform can be sampled. The Euclidian distance is a procedure in which the element wise difference between the unknown vector and reference vector is found. into MatLab using the wavread command. f. measured in Hz. The smaller the Euclidian distance the better the match. the pitch of a 1 kHz tone. the many procedures which make up the a particular speech recognition algorithm are easily mimicked. the fingerprint which has the smallest Euclidian distance when compared against the unknown fingerprint will be recognized as the match. Mel-frequency Cepstrum Coefficient The Mel-frequency Cepstrum Coefficient (MFCC) technique is often used to create the fingerprint of the sound files. Thus for each tone with an actual frequency.
This vector will be stored as a reference in the database. In the final step. The FFT block converts each frame from the time domain to the frequency domain. again using the Euclidian distance technique. the Mel-spectrum scale is converted back to standard frequency scale. MatLab and the MFCC Process The following section will merely be a list of commands. This process is as referred to as feature matching. In the Mel-frequency wrapping block. and a suitable match will be determined.wav file entitled ‘s8’ into buffer and assigns it the reference variable ‘ tap’ 3 . When an unknown sound file is imported into MatLab. the speech waveform is cropped to remove silence or acoustical interference that may be present in the beginning or eng of the sound file. The windowing block minimizes the discontinuities of the signal by tapering the beginning and end of each frame to zero. This vector is the one which was referred to in the earlier section. along with a brief description. the signal is plotted against the Melspectrum to mimic human hearing. >> tap = wavread('s8. The function of each block was discussed in the previous report but just to summarize frame blocking sequence. This spectrum provides a good representation of the spectral properties of the signal which is key for representing and recognizing characteristics of the speaker. you will have will is also referred to as an acoustic vector. a fingerprint will be created of it also and its resultant vector will be compared against those in the database. Reads . After the fingerprint is created.wav'). used to implement the MFCC algorithm. the Cepstrum.Figure 1 A block diagram of the MFCC processes is shown in Figure 1.
freqresp.6 0. fb – converts to Mel-frequency scale. ‘12000’ – the sampling rate of this particular sound file.fb.4 0.8 0 5000 10000 15000 >>[ceps.freqrecon] = mfcc(tap.100).2 0 -0.>> plot(tap).m’ are as follow: ceps – cepstrum coefficients. >> imagesc(flipud(freqresp)). The parameters passed to the ‘mfcc.12000. fbrecon – converts to standard frequency scale. Plots ‘s8’ in the time domain 0.6 -0. produces FFT spectrogram of data file 4 .4 -0. freqresp – fast fourier transform. freqreon – can be used to resample the data in order to create original spectrogram.fbrecon. ‘100’ is the frame rate used in the process.2 -0. ‘tap’ – reference variable of sound file.
5 Cepstral Conversion . spectrogram of data plotted against Mel-frequency scale 5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 110 >> imagesc(flipud(fbrecon)).50 100 150 200 250 10 20 30 40 50 60 70 80 90 100 110 >> imagesc(flipud(fb)).
and as the name of the folder implies.m script file creates an Excel database containing each sound file read into the system.m script the used will attempt to identify a previously recorded “unknown” sound file.m and test.m. The train. 6 . This database can be downloaded from http://lcavwww. labeled from S1 to S8.m functions respectively.html. The acoustic_data. that are used to train the system. I have created two scripts entitled train. The data. All speakers uttered the same single digit "zero" once in a training session and once in a testing session later on.ch/~minhdo/asr_project/asr_project.xls database file is created and will serve as the reference base of speech recognition system. The train folder contains eight sound files. The test folder also contains eight sound files each of which is used to test the system.5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 80 90 100 110 User-Defined Script Files At the time in which this information was recorded.zip file contains two folders – train and temp which were used to test the system.m and dtw. There were eight female speakers. The two scripts files in which I have created are based on the mfcc. Using the test. The sessions were help six months apart to simulate the voice variation over the time.epfl.
” but these sound files were recorded six months later to take in account the many changes that occur in a speaker’s voice (time.” In the “train” folder are eight female voices. There are two underlying concepts which comprise the DTW procedure: (1) Features – the information in each signal has to be represented in some manner (MFCC). (2) Distances – some form of distance metric* has to be used in order to obtain match. The Recognition Phase To verify a speaker’s identity.ch/~minhdo/asr_project/asr_project). After unzipping the file you will find two folders “train” and “test. This newly attained feature vector (voiceprint) in the case is compared against those reference vectors created and stored during the training phase. labeled s1 through s8. Under distances metric subclass exist two additional concepts: 7 . The Dynamic Time Warping Process The reference voiceprint with the lowest distance measure from the input pattern is the recognized word. This procedure is known as feature matching. The voices in the “test” folder are labeled s1_1 through s8_8. health etc).” In the “test” folder are those same females saying the digit “zero. the same fingerprinting process is done to the incoming speech as was used on the references voiceprints used to construct the library. This project uses the Mel Frequency Cepstrum Coefficient (MFCC) procedure to create the aforementioned voiceprints as detailed in an earlier paper. The best match (lowest distance measure) is based upon Dynamic Time Warping (DTW).The Speech Database The speech database used to build and test the system was downloaded from the Swiss Federal Institute of Technology website http://lcavwww. but are not limited to: Dynamic Time Warping (DTW) and Hidden Markov Modeling (HMM). The most common recognition techniques used in speech recognition include.epfl. each saying the digit “zero.
) and Y= (Y1. (2) A calculation of model likelihoods for all possible reference models against the unknown model must be completed using the Viterbi algorithm* followed by the selection of the reference with the highest model likelihood value. and utterances of the same word with the same duration will differ in the middle. In the coming weeks I hope to fully understand every aspect of the process. models in which one tries to characterize the statistical properties of the signal with the underlying assumption that a signal can be characterized as a random parametric signal of which the parameters can be estimated in a precise well defined manner. due to different parts of the words being spoken at different rates. With the *Viterbi algorithm. In order to implement an isolated word recognition system using HMM the following steps must be taken (1) For each reference word. we take a particular HMM. a markov model must be built using parameters that optimize the observations of the word.Local – a computational difference between a feature of one signal and a feature of another. A simple global distance score for a path is simply the sum of local distances that go make up the path . The distance metric most commonly used within DTW is the *Chebychev Distance Measure. Speech is a time-dependent process. To obtain a global distance between two speech patterns (represented as a sequence of vectors) a time alignment must be performed. The Chebychev distance between two points is the maximum distance between the points in any single dimension. Hidden Markov Modeling The Hidden Markov Modeling algorithm is a very involved process. Hence the utterances of the same word will have different durations. The best matching template is the one for which there is the lowest distance path aligning the input pattern to the template. respectively. Hidden Markov Processes are part of a larger group known as statistical models. X2. I too have a very basic understanding of the Viterbi algorithm. and determine from an observation sequence the most likely 8 . Global – an overall computational difference between the entire signal and another signal. etc.Yi| where Xi and Yi are the values of the ith variable at points X and Y. etc. Y2. In the coming weeks I wish to gain a better understanding of this process as well.) is computed using the formula: Maxi |Xi . The distance between points X=(X1. This following information represents my most basic understanding of the procedure.
jMATLINK is a program which allows the user to use MATLAB commands and functions within a JAVA applet http://www.held-mueller. I chose to keep the number of iterations constant at 5. Experimental Results After writing “test. 9 . For example be examining the observation sequence of the s1_1 test HMM one would determine that the s1 train HMM is most likely the voiceprint that created it.9 seconds compared to average running time of 22.sequence of underlying hidden states that might have generated it. HMM = model(x.m” and “train. but there are how ever three input arguments for the Hidden Markov Model procedure: number of Gaussian Mixtures. The number of states may range from 2 to 5 and the number of Gaussian Mixture components 1 to 3. Future Goals My goals in the upcoming weeks are to better understand the Hidden Markov algorithm as well as add a graphical user interface to the system which would utilize the built in MATLAB command audiorecorder which will allow me to read the user voice data into MATLAB in real-time.de/JMatLink/. I used the built in MATLAB commands “tic” and “toc” to measure the time which elapses after the user inputs the name of the unknown sound file and the moment the system returns an reference value. number of HMM states and number of iterations i. Preliminary Results As you can see from the data the most successful system was the one based on the Hidden Markov Model pattern recognition procedure.1 seconds and 25% success rate of the Dynamic Time Warping the Hidden Markov System is the obvious choice. This system had 87. thus returning the highest likelihood value.. There are no input arguments for the Dynamic Time Warping function. More specifically the system using three states and one Gaussian Mixture was the most successful overall.z) with the variables of the right side representing each input argument respectively.m” scripts which utilize the Hidden Markov Model procedure I then ran a number of tests comparing this new system against the Dynamic Time Warping system I developed weeks ago. I will use either the built in MATLAB program called GUIDE or the third party application jMATLINK to add a graphical user interface.y. With this range of arguments there are twelve possible combinations each of which were tested.5% success rate with an average running time of 7.e.
html>  “Speech Recognition by Dynamic Time Warping.dcs.edu/~kkale/dtw.edu/~aamodt/ece341/speech-recognition/  “Isolated Word.pdf> 10 .  “ECE3Speech Recognition. 20 April 1998.ufl. Speech Recognition using Dynamic Time Warping. Technical Report.cnel.eecg.14 June 2005.shef.edu/people/bulyko/papers/em.uk/~stu/com326/>  J.” Dynamic Time Warping.ch/~minhdo/asr_project/asr_project.REFERENCES  “Digital Signal Processing Mini Project. ICSI-TR-97-021. http://www. 15 April 2003. <http://www.ee. University of Berkeley. April 1998.ac. 06 July 2005. 01 August 2005 <http://crow. 1 July 2005.html>.epfl. Bilmes: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.washington. 14 June 2005. <http://www.” An Automatic Speaker Recognition System.toronto. <http://lcavwww.” Speech Recognition by Dynamic Time Warping.” A Simple Speech Recognition Algorithm.
This action might not be possible to undo. Are you sure you want to continue?