You are on page 1of 1

Number Recognition system using mel-cepstral coefficients and Dynamic time

warping method.
Abstract:
The numbers spoken by a human can be extracted and analyzed to give an output
efficiently by a system generally computer. Objective is to extract the numbers when
spoken and display them each using a graphical user interface in python.
Generally a trained sequence of numbers is recorded from the user by sampling
and matched with them every time he speaks and display the result accordingly.
The presented voice signal both the training and the testing one are framed to a small
duration signal typically of 25ms,overlap of 10ms and these frames are then windowed
using a hamming window to reduce the discontinuity of the speech signal in the ends of
the respective frames thus ending the frame by a zero.

Using the Fast Fourier Transform (FFT) algorithm the spectral coefficients of
speech frame are estimated. Mel filtering to mimic the sensitivity decrease of the human
ear for the higher frequencies than the lower frequencies the spectrum of each frame is
them multiplied by the mel filter bank
f Mel = 2595 *log 10 1 +f / 700 )
The discrete cosing transform of this log-magnitude of fourier transform of the
signal gives the mel-cepstral coefficients.
The first thirteen of these coefficients are extracted and stored in an array for each of the
frame in the signal.
Dynamic time Warping:
Every time the human speaks differently, no two times it matches. Like once he
may use an vowel longer than before than previous time hence the dynamic time
warping method a non-linear one can be used to make this non-linear speaking to match
the given training signal by extending and contracting different positions of the speech
signal. The minimum distance taken by a test signal for a training signal determines the
actual spoken value.

Reference: http://research.ijcaonline.org/volume40/number3/pxc3877167.pdf

You might also like