You are on page 1of 1

MEL SCALE

The mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one
another. The reference point between this scale and normal frequency measurement is defined by
assigning a perceptual pitch of 1000 mels to a 1000 Hz tone, 40 d abo!e the listener"s threshold.
#bo!e about $00 Hz, larger and larger inter!als are judged by listeners to produce equal pitch
increments. #s a result, four octa!es on the hertz scale abo!e $00 Hz are judged to comprise about
two octa!es on the mel scale.
Mel Frequency Cepstral Coefficient (MFCC) tutorial
The first step in any automatic speech recognition system is to e%tract features i.e. identify the
components of the audio signal that are good for identifying the linguistic content and discarding all
the other stuff which carries information li&e bac&ground noise, emotion etc. The shape of the !ocal
tract manifests itself in the en!elope of the short time power spectrum, and the job of '())s is to
accurately represent this en!elope.
*e frame the signal into +0,40ms frames. -*e start with a speech signal, we"ll assume sampled at
1.&Hz. (rame the signal into +0,40 ms frames. +$ms is standard. This means the frame length for a
1.&Hz signal is 0.0+$/1.000 0 400 samples. (rame step is usually something li&e 10ms -1.0
samples1, which allows some o!erlap to the frames. The first 400 sample frame starts at sample 0,
the ne%t 400 sample frame starts at sample 1.0 etc. until the end of the speech file is reached. 2f the
speech file does not di!ide into an e!en number of frames, pad it with zeros so that it does1
The ne%t steps are applied to e!ery single frame, one set of 1+ '()) coefficients is e%tracted for
each frame. *e calculate the power spectrum of each frame. 3ur periodogram estimate identifies
which frequencies are present in the frame. The periodogram spectral estimate still contains a lot of
information not required for #utomatic 4peech 5ecognition -#451. 2n particular the cochlea can
not discern the difference between two closely spaced frequencies. This effect becomes more
pronounced as the frequencies increase. (or this reason we ta&e clumps of periodogram bins and
sum them up to get an idea of how much energy e%ists in !arious frequency regions. This is
performed by our 'el filterban&6 the first filter is !ery narrow and gi!es an indication of how much
energy e%ists near 0 Hertz. #s the frequencies get higher our filters get wider as we become less
concerned about !ariations. *e are only interested in roughly how much energy occurs at each spot.
The 'el scale tells us e%actly how to space our filterban&s and how wide to ma&e them.
3nce we ha!e the filterban& energies, we ta&e the logarithm of them. This is also moti!ated by
human hearing6 we don"t hear loudness on a linear scale. This compression operation ma&es our
features match more closely what humans actually hear. The logarithm allows us to use cepstral
mean subtraction, which is a channel normalisation technique.
The final step is to compute the 7)T of the log filterban& energies. There are + main reasons this is
performed. ecause our filterban&s are all o!erlapping, the filterban& energies are quite correlated
with each other. The 7)T decorrelates the energies which means diagonal co!ariance matrices can
be used to model the features in e.g. a H'' classifier. ut notice that only 1+ of the +. 7)T
coefficients are &ept. This is because the higher 7)T coefficients represent fast changes in the
filterban& energies and it turns out that these fast changes actually degrade #45 performance, so we
get a small impro!ement by dropping them.

You might also like