METHODOLOGY

Acquire Speech Signal Preprocessing Feature Extraction Feature Matching

Recognized Command

Voiced excitation pulse train P(f) Unvoiced excitation white noise N(f)

= . + .

. .

= . .

Spectral Shaping ()

Changing the shape of the vocal tract changes the spectral shape of the speech signal, thus articulating different speech sounds Most valuable information for speech recognizer is contained in the way the spectral shape of the speech signal changes in time. Direct computation of power spectrum from the speech signal results in a spectrum containing ripples caused by the excitation spectrum (). A smooth spectral shape without the ripples that represent () has to be estimated.

Cepstral Transformation

= . . = . log = log( . ) = log + log( )

Interpret this log-spectrum as a time signal The ripples caused by () would then have a high-frequency. Hence, by using a kind of low pass filtering we can get the smooth spectral shape Inverse Fourier transform of the log spectrum brings us back to the time domain, giving the so called cepstrum. Low pass filtering is done by setting the higher valued Cepstral coefficients to zero and then transforming back to the frequency domain. The process of filtering in the Cepstral domain is called liftering.

Human ear does not show a linear frequency resolution but builds several group of frequencies and integrates the spectral energies within a given group The mid frequency and bandwidth of these groups are nonlinearly distributed. The non-linear warping of the frequency axis is modeled by the mel-scale where the frequency groups are assumed to be linearly distributed = 2595. log(1 + ) 700

Common way to do mel frequency warping is to use triangle shaped filter in the spectral domain to build a weighted sum over the power spectrum coefficients which lie within each window. This gives us a new set of coefficients known as the mel spectral coefficie Perform Cepstral Transformation on them to extract Mel frequency Cepst Coefficients. The MFCC are directly used for recognition instead of transforming them back to frequency domain.

Feature Matching

Distance calculation using Dynamic Time Warping

Each utterance is divided into frames of 20ms. MFCC for each of frame is computed and represented by a vector. Hence each utterance is represented by a vector sequence. X = {x0,x1,.,xTx1} Distance between individual vectors are found using the Euclidean distance formula.

DTW Algorithm

Finding the optimal alignment path

DTW Algorithm

Key points to find the optimal path A grid point (i,j) in the optimal path can have the predecessors (i-1,j), (i-1,j-1) and (i,j-1)

Bellmans Principle : If Popt is the optimal path through the matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx1), and grid point (i,j) is part of path Popt, then the partial path from (0,0) to (i,j) is also part of Popt Creating an Accumulated distance matrix, according to the formula

The accumulated distance at the point (Tw-1,Tx-1) is the distance between the vector sequence W and X .

Front Panel

Block Diagram

The input speech signal has been acquired using LabVIEW Acquire Sound Express VI for 3sec at a sampling rate of 11025Hz. An array of LEDs in the front panel indicates the progress of acquiring.

Step 2: Pre-processing

Preprocessing of the input speech signal consist of the following steps

2.1 Pre-Emphasis

The goal of pre-emphasis is to compensate for the high frequency part that was suppressed during the sound production mechanism of humans. Thus the speech signal is passed through a FIR high pass filter which increases the magnitude of some higher frequencies with respect to the magnitude of other frequencies hence improves the over-all signal to noise ratio. = 0.95[ 1]

2.2 Framing

The input speech signal is segmented into small frames of 20ms length with 50% overlap with the adjoining frames to create continuity.

2.3 Windowing

Each frame is multiplied with the hamming window in time domain. This helps to reduce the discontinuity at the start and end of each frames. 2 = 0.54 0.46 cos 1

For detecting the starting of the utterance from the 3sec long input speech signal, energy of each frame of the input speech signal is calculated and stored into an array. Size of the energy array will be equal to the total number of frames. This energy array is arranged in the ascending order and mean of first 15 elements gives the energy of the noise. Threshold set was 10 times the noise energy.

Once the threshold has been calculated, all the elements in the energy array which are greater than the threshold are replaced by 1 and the rest by 0. Thus a Boolean array of the following form is obtained.

Sometimes spikes due to the external noise crosses the threshold and contributes 1 to the Boolean array. To remove these spikes a Median filter VI in LabVIEW with left and right rank as 3 is used. The median filter replaces the ith element in the Boolean array with the median of { 3, 2, 1, , 1, + 2, + 3}elements. Hence the median filter smoothen the Boolean array.

Now we use the Peak detector VI in LabVIEW to find the index of the start and end of the utterance. Using these index extract the corresponding frames containing the utterance. N.B: In my project, all commands where of length less than 0.6sec. Sometimes spikes due to noise remained even after using the median filter and hence the ending index was not detected accurately. But the start index was detected accurately most of the time, so I used to extract 0.6sec of sound after the start index.

FFT is done on each frame of the utterance and half of it is taken. The spectrum of each frame is warped onto the Mel scale and thus Mel spectral coefficients are obtained. Discrete cosine transform is done on Mel spectral coefficients of each frame, hence obtaining MFCC. The first 2 coefficients of the obtained MFCC are removed as they varied significantly between different utterances of the same word. Liftering is done by replacing all MFCC except the first 14 by zero. The first coefficient of MFCC of each frame was replaced by the log energy of that frame. Delta and Acceleration coefficients are found from the MFCC so as to increase the dimension of the feature vector of the frames, thereby increasing the accuracy.

Delta coefficients are found from the following equation. Value of p chosen was 1.

Acceleration coefficients are found by replacing the MFCC in the above equation by delta coefficients Feature vector is normalized by subtracting their mean from each elements Thus each frame of utterance is converted into a feature vector of dimension 35.

Dictionary with six sets has been created. In each set, the feature vector sequence of the words to be recognized are stored. The feature of the test sequence is compared with each words in the sets using DTW and the best match in each set is outputted. The mode of all six set is considered to be the recognized command. Threshold is set so that random speech signal doesn't result in a match with the commands.

http://www.youtube.com/watch?v=aEqa-t_TWiY

Limitations

Environment Dependent

The input speech feature vector is compared with a set of feature vectors in the dictionary which were recorded in a particular environment. So when used from a different environment the efficiency decreases unless the threshold and the dictionary are updated accordingly.

Speaker Dependent

As the dictionary is trained by a particular user, the VI outputs consistent results when used by the trainer.

Questions..?

Thank You

