You are on page 1of 31

Text Independent Speaker Recognition

Pravin Gondaliya[08BEC029] Surendra Jalu[08BEC034]

-Prepared by

-Guided by Dr. Tanish H. Zave

Our Goal:
To understand the Digital Speech Processing and exploit it into spartan 3A DSP kit. Signal

Todays Agenda:
Basics What

of speech processing

is speech enhancement? enhancement algorithm

Speech Spartan ISE

3A DSP kit

tool for Designing

Introduction to speech processing

Speech processing is the application of Digital signal processing (DSP) techniques to the processing and or analysis of speech signals.
Application of Speech processing include - Speech coding - Speech Recognition - Speaker Verification Identification - Speech Enhancement - Speech synthesis (Text to Speech conversion)

Figure shows a schematic diagram of the speech production /speech perception process in human beings.
The speech production process begins when the talker formulates a message in his/her mind to transmit to the listener via speech.
The next step in the process is the conversion of the message into a language code. This corresponds to converting the message into a set phoneme sequences corresponding to the sounds that make up the words. Along with prosody (syntax) markers denoting duration of sounds, loudness of sounds and pitch associated with the sounds.

Information Rate of the speech Signal

The discrete symbol information rate in the raw message text is rather low (about 50 bits per second corresponding to about 8 sounds per sounds per second, where each sound is one of the about 50 distinct symbols). After the language code conversion, with the inclusion of prosody information, the information rate rises to about 200 bps.

The mechanism of Speech production

In order to apply DSP techniques to speech processing problems it is important to understand the fundamentals of the speech production process.
Speech signals are composed of a sequence of sounds and the sequence of sounds are produced as a result of acoustical excitation of the vocal tract when air is expelled from the lungs.

Speech Production Mechanism


Vocal tracts begins at the opening between the vocal cords and ends at the lips In the average male, the total length of the vocal tract is about 17 cm

The cross-sectional area of the vocal , determined by the positions of the tongue , lips, jaw and velum varies from zero (complete closure) to about 20 cm

Classification of Speech

Sounds

In speech processing, speech sounds are divided into TWO broad classes which depend on the role of the vocal chords on the speech production mechanism. -VOICED speech is produced when the vocal chords play an active role (i.e. vibrate) in the production of a sound: Examples: voiced sounds /a/,/e/,/i/ -UNVOICED speech is produced when vocal chords are inactive Examples: unvoiced sounds /s/,/f/

Voiced Speech

Voiced speech occurs when air flows through the vocal chords into the vocal tract in discrete puffs rather than as a continuous flow

The vocal chords vibrates at particular frequency, which is called the fundamental frequency of the sound - 50:200 Hz for male speakers - 150:300 Hz for female speakers - 200:400 Hz for child speakers

Unvoiced speech
For unvoiced speech, the vocal chords are held open and air flows continuously through them The vocal tract, however, is narrowed resulting in a turbulent flow of air along the tract Examples include the unvoiced fricatives /f/ & /s/ Characterized by high frequency components

Other Sound classes


Nasal Sounds - Vocal tract coupled acoustically with nasal cavity through velar opening - Sound radiated from nostrils as well as lips - Examples include m, n, ing

Plosive Sounds Characterized by complete closure/constriction towards front of the vocal tract - Build up of pressure behind closure, sudden release - Examples include p, t, k

Speech Enhancement
Speech enhancement is concerned with improving some perceptual aspect of speech that has been degraded by additive noise. Different kind of noise affect on the quality of the speech. Different speech enhancement techniques are used to improve the quality of speech and reduce the specific noise coming from different sources at different SNRs.

Block Diagram of MFCC algorithm

Preprocessing & Frame Blocking


Continuous human speech is recorded and preprocessed. In preprocessing , silence detection and amplification takes place. Then after the preprocessed output is fed to frame blocking section. In frame blocking, the continuous speech signal is blocked into frames of some number of samples. This process continues until all the speech is accounted for within one or more frames.

Windowing

The next step in the processing is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If we define the window as w(n), 0 n N-1, where N is the number of samples in each frame, then the resulting signal y(n)=x(n)w(n) ; 0 n N-1 Typically the Hamming window is used, which has the form w(n)=0.54 046 cos(2n/N-1) ; 0 n N-1

Mel Frequency Cepstrum


The power cepstrum (of a signal) is the squared magnitude of the Fourier transform of the logarithm of the squared magnitude of the Fourier transform of a signal. Mathematically: power cepstrum of signal |F{log(|F{Y(t)}|2)}|2 Algorithmically: signal FT abs() square log FT abs() square power cepstrum The cepstrum can be seen as information about rate of change in the different spectrum bands. It was originally invented for characterizing the seismic echoes resulting from earthquakes and bomb explosions. It has also been used to determine the fundamental frequency of human speech and to analyze radar signal returns. Cepstrum pitch determination is particularly effective because the effects of the vocal excitation (pitch) and vocal tract (formants) are additive in the logarithm of the power spectrum and thus clearly separate.

The independent variable of a cepstral graph is called the quefrency. The quefrency is a measure of time, though not in the sense of a signal in the time domain. For example, if the sampling rate of an audio signal is 44100 Hz and there is a large peak in the cepstrum whose quefrency is 100 samples, the peak indicates the presence of a pitch that is 44100/100 = 441 Hz. This peak occurs in the cepstrum because the harmonics in the spectrum are periodic, and the period corresponds to the pitch. Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. So, Our next step is FFT(Fast Fourier Transform) of a speech signal and then it is fed to mel

Difference between normal and mel cepstrum

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearlyspaced frequency bands used in the normal cepstrum. This frequency warping can allow

Why MEL scale?


psychophysical

studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the mel scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above

MEL scale
The mel scale, is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The name mel comes from the word melody to indicate that the scale is based on pitch comparisons. A popular formula to convert f hertz into m mel is:

m = 2595 log10 {1+(f/700)}

MFCC
MFCCs are commonly derived as follows: Take the Fourier transform of (a windowed excerpt of) a signal. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. Take the logs of the powers at each of the mel frequencies. Take the discrete cosine transform of the list of mel log powers, as if it were a signal. The MFCCs are the amplitudes of the resulting spectrum.

Implementation

So, most of the work has been done. Now For each speaker we record 5 samples of speech. Each speech sample will undergo mel frequency cepstral analysis and MFCC are calculated for each of the sample. The computed values are then stored in DB.mat database.

Then Pattern matching will takes place. It will ask user to enter his/her speech for testing and compare the computed MFCC of this test speech with that of the DB.mat database. If it matches then user will

Pattern matching
In this process, the centroid of the values for five samples is computed as shown in figure. Then for each speaker, the test speech of each speaker is compared with each of the samples including the centroid one. The best match is selected on basis of maximum values matched in the particular sample. So if for any speaker any one out of five is matched with test speech then that user will be identified.

Waiting for Your Valuable Suggestions

Thank You

Resonant Frequencies of Vocal Tract

Vocal tract is a non-uniform acoustic tube that is terminated at one end by the vocal chords and at the other end by the lips The cross-sectional area of the vocal tract is determined by the positions of tongue, lips, jaws and velum The spectrum of vocal tract response consists of a number of resonant frequencies of the vocal tract The frequencies are called formants Three to four formants present below 4KHz of speech

Formant Frequencies
Speech normally exhibits one formant frequency in every 1KHz For VOICED speech, the magnitude of the lower formant frequencies are successively larger than magnitude of the higher formant frequencies For UNVOICED speech, the magnitude of the higher formant frequencies are successively larger than magnitude of the lower formant frequencies

You might also like