Speaker Recognition Using MFCC and Vector Quantization Model

A Major Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of


Department of Electrical Engineering Electronics & Communication Engineering Program Institute of Technology, NIRMA UNIVERSITY
AHMEDABAD 382 481 May 2011

This is to certify that the Major Project Report entitled “Speaker Recognition Using MFCC and Vector quantization model” submitted by Pravin Gareta (Roll No. 08BEC156) and Darshan Mandalia (Roll No. 07BEC042) as the partial fulfillment of the requirements for the award of the degree of Bachelor of Technology in Electronics & Communication Engineering, Institute of

Technology, Nirma University is the record of work carried out by them under my supervision and guidance. The work submitted in our opinion has reached a level required for being accepted for the examination.

Date: 17/05/2011

Prof. RACHNA SHARMA Project Guide

Prof. A. S. Ranade HOD (Electrical Engineering) Nirma University, Ahmedabad

In order to achieve better performance, we should have to learn our outside environment. There are lots of forces which will acting upon us to get the better result, but for that we have to change our attitude to see them. There are lots of problems we facing, but due to it we don’t stop. It is not a good human being ,if he is stop. Yes, we are sometimes frustrated due to the problems we can’t solve it. But the next day we act on the problem with same efficiency and strength we have. Moreover, We are highly thankful to our parents and teachers were there all the times, backing us up and with their undue support has been the pushing drive for us to complete project within time. We would deeply like to express our sincere gratitude towards project guide Prof. Rachna Sharma and faculty members of our panel who have guided until the completion of the project. We also extend our thanks towards our Head of the Department Prof. A.S.Ranade and all the staff, Department of Electronics and Communication Engineering, the Institute of Technology, Nirma University for their assistance during the project whether it was a technical help or concerned with providing facilities for internet, implementing and simulating the ideas of the project. Their excessive support has been the source of motivation to perform our best regarding the project. We would like to express our gratitude towards our parents who have been there not only in this project but through all our entire life. If their helping hand, moral as well as financial support, had not been there we wouldn’t have been able to finish in such a proficient way. We are grateful for their aid and support.

Pravin Gareta (08BEC156) Darshan Mandalia (07BEC042)


e. However. database access services. There are many types of features. This is often referred as the signal-processing front end. it is short time stationary which is studied and analyzed using short time. another key characteristic of speech is quasi-stationarity. The key is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate) for further analysis and processing. ii . we have first made a comparative study of the MFCC approach. and others.Abstract Speech Recognition is the process of automatically recognizing a certain word spoken by a particular speaker based on individual information included in speech waves. voice mail and remote access to computers. This project presents one of the techniques to extract the feature set from a speech signal. such as Linear Prediction Coding (LPC). To achieve this. The optimum feature set is still not yet decided though the vast efforts of researchers. MelFrequency Cepstrum Coefficients (MFCC). i. voice based dialing. A particular speaker utters the password once in the training session so as to train and store the features of the access word. which are derived differently and have good impact on the recognition rate. A wide range of possibilities exist for parametrically representing the speech signal for the speaker recognition task. MFCC is perhaps the best known and most popular. MFCCs are based on the known variation of the human ear‟s critical bandwidths with frequency filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech. Signal processing front end for extracting the feature set is an important stage in any speech recognition system. and these will be used in this project. This technique makes it possible to use the speaker‟s voice to verify his/her identity and provide controlled access to services like voice based biometrics. frequency domain analysis. Later in the testing session the user utters the password again in order to achieve recognition if there is a match. At this stage an intruder can also test the system to test the inherent security feature by uttering the same word . The feature vectors unique to that speaker are obtained in the training phase and this is made use of later on to grant authentication to the same speaker who once again utters the same word in the testing phase. which can be used in speech recognition systems. The voice based biometric system is based on isolated or single word recognition.

I Ii Iii V Basic Acoustic and Speech Signal 2.3 4.3 3.2 4.1.1 4.1.2 1.1.4 3.2 2.4 4.1 Processing 4.1. Acknowledgement Abstract Index List of Figures 1 Introduction 1.6 Introduction Speech Recognition Basics Classification of ASR system Why is Automatic Speaker Recognition Hard? Speech Analyzer Speech Classifier 10 12 14 15 16 18 4 Feature Extraction 4.5 Frame Blocking Windowing Fast Fourier Transform Mel Frequency Warping Cepstrum iii 21 22 22 22 22 23 .4 2 Introduction Motivation Objective Outline of thesis 1 2 4 4 Title Page No.2 3.1 3.3 1.1 2.1 1.1.Index Chapter No.5 3.3 The Speech Signal Speech production Properties of Human Voice 6 8 9 3 Automatic Speech Recognition System(ASR) 3.

1.m test.3.3 Matlab code for VQ Approach 7.1 Clustering the training vector 6 Sample Training and Recognition Session GUI 6.m Conclusion Applications Scope for Future work References iv .3.1 7.1 7.1 7.2 7.3 FFT Approach Using VQ 5.6 Train.3.3.2 5.1 Mel Frequency Cepstrum Co-efficient 25 MFCC Approach 5.3 7.6 5 Algorithm 5.4 6.3.m disteu.2 vqlbg.m mfcc.2 Training Code Testing Code 45 45 48 51 51 51 57 58 58 58 59 60 61 63 63 65 66 Matlab code for FFT Approach 7.4 Main Menu GUI of MFCC method GUI of FFT method GUI of VQ method 38 39 42 43 7 Source Code 7.1.1 Matlab code for MFCC Approach 7.2 7.2 Voice Recording Matlab Code Training and Testing Code 7.1 MFCC Approach Algorithm 27 31 32 33 35 5.5 7.1.1 6.m melfb.

1 3.1 4.3 Feature Extraction Steps Filter bank in Mel Frequency Scale Mel Frequency Scale 21 23 26 27 28 29 29 29 30 33 5.2 4. Title Page No.6 5.3 5. 1.4 5.1 MFC MFCC Approach 5.2 2.1 2.2 5.List of Figures Fig.2 Speaker Identification Training Speaker Identification Testing Schematic Diagram of the Speech Production/Perception Process Human Vocal Mechanism Utterance of “HELLO” Conceptual diagram illustrating vector quantization codebook formation. 3 3 6 8 12 20 4.2 3.1 1.7 The word “”Hello” taken for analysis The word “”Hello” after silence detection The word “”Hello” after windowing The word “”Hello” after FFT The word “”Hello” after Mel-warping The vector generated from training before VQ v . No.5 5.

3 6.4 6.6 6.10 37 38 39 39 40 40 41 42 42 43 44 vi .5 6.8 6.9 6.1 6.9 The representative feature vector result after VQ Conceptual diagram illustrating vector quantization codebook formation.10 6.One speaker can be discriminated from another based of the location of centroids.2 6.7 6.8 5.5. Flow diagram of the LBG algorithm Main Menu of Speech recognition Application Training Menu in MFCC Approach Waveform of Training Session Testing Session GUI Final Result(1) Final Result(2) Create Database GUI User Authentication GUI GUI of Database Creation GUI of User matching 33 35 5.

Institute of technology. In this thesis.1 Introduction Speech is the most natural way to communicate for humans. and recognition [4]. The concept of a machine than can recognize the human voice has long been an accepted feature in Science Fiction.” . Apart from very short notes. enhancement. synthesis.) hails Automatic Speaker Recognition (ASR) as one of the most important innovations for future computer operating systems. Bill Gates (co-founder of Microsoft Corp. From a technological perspective it is possible to distinguish between two broad types of ASR: direct voice input‟ (DVI) and „large vocabulary continuous speech recognition‟ (LVCSR). LVCSR systems involve vocabularies of perhaps hundreds of thousands of words. From „Star Trek‟ to George Orwell‟s „1984‟ . In both cases the underlying technology is more or less the same. it was usual to dictate everything into the speakwriter.It has been commonly assumed that one day it will be possible to converse naturally with an advanced computer-based system. the invention and widespread use of the telephone. Also. audio-phonic storage media.Chapter 1 Introduction 1. at least one vendor has offered a telephone-based dictation service in which the transcribed document is e-mailed back to the user. DVI systems are typically configured for small to medium sized vocabularies (up to several thousand words) and might employ word or phrase spotting techniques. radio. and are typically configured to transcribe continuous speech. While this has been true since the dawn of civilization. DVI systems are usually required to respond immediately to a voice command. and television has given even further importance to speech communication and speech processing [2]. Also. Indeed in his book „The Road Ahead‟. the issue of speech recognition is studied and a speech recognition system is developed for Isolated word using Vector quantization model. Nirma University Page 1 . Electronics & communication Eng. whereas LVCSR systems are used for form filling or voice-based document creation. The advances in digital signal processing technology has led the use of speech processing in many different application areas like speech compression.for example.“Actually he was not used to writing by hand. LVCSR need not be performed in real-time . DVI devices are primarily aimed at voice command-and-control.

e. i. semantic and pragmatic. many other forms of knowledge. must be built into the recognizer.1): feature extraction and feature matching. We will discuss each module in detail in later sections. a convenient and desirable mode of communication with machines. a complete model of the human. therefore. using syntactic and semantic information as well very powerful low level pattern classification and processing. it do not effectively discriminate between classes and. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. linguistic. it is man’s principle means of communication and is. or simply from the fact that talking can be faster than typing.2 Motivation The motivation for ASR is simple. further. i. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. however. all speaker recognition systems contain two main modules (refer to Fig 1. that the better the features the easier is the classification task. It is the case. Electronics & communication Eng.g. Speech communication has evolved to be efficient and robust and it is clear that the route to computer based speech recognition is the modeling of the human system.e. Nor. the classifier itself must have a considerable degree of sophistication. a good set of features to be used in a pattern classifier). Nirma University Page 2 . At the highest level. the benefits of using ASR derive from providing an extra communication channel in hands-busy eyes-busy human-machine interaction (HMI). Institute of technology. 1. is it sufficient merely to generate “a good” representation of speech (i. e. and the practical. even at a lower level of sophistication. Unfortunately from pattern recognition point of view.From an application viewpoint.e. the tools that science and technology provide and that costs allow. in the final analysis. Powerful classification algorithms and sophisticated front ends are. human recognizes speech through a very complex interaction between many levels of processing. Automatic speech recognition is therefore an engineering compromise between the ideal. not enough.

In case of speaker verification systems. Speaker Identification Testing All Recognition systems have to serve two different phases. 1. In the training phase.Fig. in addition. Speaker Identification Training Fig. The first one is referred to the enrollment sessions or training phase while the second one is referred to as the operation sessions or testing phase. a speaker-specific threshold is also computed from the training Electronics & communication Eng.2. Nirma University Page 3 . 1. Institute of technology.1. each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker.

Speech recognition is a difficult task and it is still an active research area. During the testing (operational) phase (see Figure 1. at present. speaking rates. the speaker has a cold). the capacities of adapting to unknown conditions on machines are greatly poorer than that of ours. The principle source of variance is the speaker himself. Introduction about automatic speaker recognition system is discussed. Examples of these are acoustical noise and variations in recording environments (e.2). the system is called “Robust”. that present a challenge to speech recognition technology. Institute of technology. Speech signals in training and testing sessions can be greatly different due to many facts such as people voice change with time. In chapter 3. they always hope it can achieve as good recognition performance as human's ears do which can constantly adapt to the environment characteristics such as the speaker. There are also other factors. etc. Unfortunately. However this task has been challenged by the highly variant of input speech signals.3 Objective The objective of the project is to Design a Speaker recognition model using MFCC extraction technique and also with Vector quantization model. the input speech is matched with stored reference model(s) and recognition decision is made.g. speaker uses different telephone handsets). beyond speaker variability.samples. Automatic speech recognition works based on the premise that a person’s speech exhibits characteristics that are unique to the speaker. So what characterizes a “Robust System”? When people use an automatic speech recognition (ASR) system in real environment. the background noise and the transmission channels. health conditions (e. In chapter 2.g. Introduction about basic speech signal. 1. If the recognition accuracy does not degrade very much under mismatch conditions. Outline of thesis This thesis is organized as follows. Nirma University Page 4 . The challenge would be make the system “Robust”. Electronics & communication Eng. the performance of speech recognition systems trained with clean speech may degrade significantly in the real world because of the mismatch between the training and testing environments. In fact.

In chapter 4. In chapter 5. Algorithm used in this thesis is discussed. Institute of technology. GUI of all three methods will be given. Feature extraction method is discussed. application of these project and Scope for future work will be discussed. conclusion. In Chapter 6. Electronics & communication Eng. In chapter 7. Source code. Nirma University Page 5 .

this chapter intends to discuss how the speech signal is produced and perceived by human beings. The waveform is transferred via the air (C. This is an essential subject that has to be considered before one can pursue and decide which approach to use for speech recognition. See Figure 2. Nirma University Page 6 . 2. E. B. Acoustic air. This formulation is used by the human vocal mechanism (B. Speech formulation) is associated with the formulation of the speech signal in the talker’s mind. Schematic Diagram of the Speech Production/Perception Process Five different elements. Speech comprehension. 2. C.1. The first element (A.1 The Speech Signal Human communication is to be seen as a comprehensive diagram of the process from speech production to speech perception between the talker and listener. D. Human vocal mechanism) to produce the actual speech waveform. Electronics & communication Eng. A. Fig.Chapter 2 Basic Acoustics and Speech Signal As relevant background to the field of speech recognition. Institute of technology. Speech formulation. Perception of the ear. Human vocal mechanism.1.

May have high pitch. During this transfer the acoustic wave can be affected by external sources. One issue with speech recognition is to “simulate” how the listener process the speech produced by the talker. distinct units (phonemes). There are several actions taking place in the listeners head and hearing system during the process of speech signals. When the wave reaches the listener’s hearing system (the ears) the listener percepts the waveform (D. Composed of known. Perception of the ear) and the listener’s mind (E. or varying in speed. May be fast. Speech comprehension) starts processing this waveform to comprehend its content so the listener understands what the talker is trying to tell him. The basic theoretical unit for describing how to bring linguistic meaning to the formed speech. in the mind. Phonemes can be grouped based on the properties of either the time waveform or frequency characteristics and classified in different sounds produced by the human vocal tract. Is different for every speaker. resulting in a more complex waveform. is called phonemes. or be whispered. Institute of technology. for example noise. The perception process can be seen as the inverse of the speech production process. Speech is: • • • • • • • • • • Time-varying signal.Acoustic air) to the listener. May not have distinct boundaries between units (phonemes). Has an unlimited number of words. Nirma University Page 7 . Depends on known physical movements. slow. Has widely-varying types of environmental noise. Electronics & communication Eng. Well-structured communication process. low pitch.

The most important parts of the human vocal mechanism are the vocal tract together with nasal cavity. the nasal cavity is coupled together with the vocal tract to formulate the desired speech signal.3 Properties of Human Voice One of the most important parameter of sound is its frequency. 2.2. Sound waves are the waves that occur from vibration of the materials. Institute of technology. the sound gets high-pitched and irritating. When the frequency of a sound increases. Fig.2. see Figure II. lips.2 Speech Production To be able to understand how the production of speech is performed one need to know how the human’s vocal mechanism is constructed. the sound gets deepen. The sounds are discriminated from each other by the help of their frequencies.2 . The crosssectional area of the vocal tract is limited by the tongue. When the velum is lowered. jaw and velum and varies from 0-20 cm2. When the frequency of a sound decreases. The velum is a trapdoor-like mechanism that is used to formulate nasal sounds when needed. Nirma University Page 8 . The highest value Electronics & communication Eng. which begins at the velum. Human Vocal Mechanism 2.

This frequency interval changes for every person. • • Regional accents are the differences in resonant frequencies. and pitch. Nirma University Page 9 . and children’s speech are different. And a frequency change of 0. And the lowest value is about 70 Hz. makes recognition of other speaker types much worse. • Due to the differences in vocal tract length. A normal human speech has a frequency interval of 100Hz .3200Hz and its magnitude is in the range of 30 dB . These are the maximum and minimum values. female.90 dB. Speaker Characteristics. A human ear can perceive sounds in the frequency range between 16 Hz and 20 kHz. Electronics & communication Eng.of the frequency that a human can produce is about 10 kHz. durations.5 % is the sensitivity of a human ear. male. Individuals have resonant frequency patterns and duration patterns that are unique (allowing us to identify speaker). And the magnitude of a sound is expressed in decibel (dB). Institute of technology. Training on data from one type of speaker automatically “learns” that group or person’s characteristics.

1 Introduction Speech processing is the study of speech signals and the processing methods of these signals. besides the actual literal content. but subjectively annoying to the listener. Nirma University Page 10 . In addition. also speaker identity. It should be emphasized that the intelligibility of speech includes. The techniques used are similar to that in audio data compression and audio coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the human auditory system. Natural language generation systems convert information from computer databases into normal-sounding human language. The signals are usually processed in a digital representation whereby speech processing can be seen as the interaction of digital signal processing and natural language processing. intonation. since it is possible that degraded speech is completely intelligible. timbre etc. Natural language processing is a subfield of artificial intelligence and linguistics. emotions. It studies the problems of automated generation and understanding of natural human languages. speech coding differs from audio coding in that there is a lot more statistical information available about the properties of speech. Speech coding: It is the compression of speech (into a code) for transmission with speech codecs that use audio signal processing and speech processing techniques. in narrow band speech coding. only information in the frequency band of 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.Chapter 3 Automatic Speech Recognition System(ASR) 3. that are all important for perfect intelligibility. However. with a constrained amount of transmitted data. and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. In speech coding. For example. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility. the most important criterion is preservation of intelligibility and "pleasantness" of speech. Institute of technology. Electronics & communication Eng. some auditory information which is relevant in audio coding can be unnecessary in the speech coding context.

and by its ability to be understood. Alternatively. For specific usage domains. the storage of entire words or sentences allows for high-quality output. Movements in the vocal cords are rapid. A text-to-speech (TTS) system converts normal language text into speech. Institute of technology. a system that stores phones or diaphones provides the largest output range. other systems render symbolic linguistic representations like phonetic transcriptions into speech. the speech sound is recorded outside the mouth and then filtered by a mathematical method to remove the effects of the vocal tract. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Systems differ in the size of the stored speech units. Imaging methods such as x-rays or ultrasounds do not work because the vocal cords are surrounded by cartilage which distorts image quality. In inverse filtering methods. However. fundamental frequencies are usually between 80 and 300 Hz. Synthesized speech can also be created by concatenating pieces of recorded speech that are stored in a database. The location of the vocal cords effectively prohibits direct measurement of movement. Many computer operating systems have included speech synthesizers since the early 1980s. Nirma University Page 11 . thus preventing usage of ordinary video. but may lack clarity. Most important indirect methods are inverse filtering of sound recordings and electroglottographs (EGG). The quality of a speech synthesizer is judged by its similarity to the human voice. This method produces an estimate of the waveform of the pressure pulse which again inversely Electronics & communication Eng. Voice analysis: Voice problems that require voice analysis most commonly originate from the vocal cords since it is the sound source and is thus most actively subject to tiring.Speech synthesis: Speech synthesis is the artificial production of human speech. High-speed videos provide an option but in order to see the vocal cords the camera has to be positioned in the throat which makes speaking rather difficult. analysis of the vocal cords is physically difficult. a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.

Fig. It thus yields one-dimensional information of the contact area. Speech recognition: Speech recognition is the process by which a computer (or other type of machine) identifies spoken words. which operates with electrodes attached to the subject’s throat close to the vocal cords. This is the key to any speech related application.2 Speech Recognition Basics Utterance An utterance is the vocalization (speaking) of a word or words that represent a single meaning to the computer.1. a few words. Institute of technology. there are a number ways to do this but the basic principle is to somehow extract certain key features from the uttered speech and then treat those features as the key to recognizing the word when it is uttered again. 3. a sentence. Utterances can be a single word. Changes in conductivity of the throat indicate inversely how large a portion of the vocal cords are touching each other. Neither inverse filtering nor EGG is thus sufficient to completely describe the glottal movement and provide only indirect evidence of that movement. or even multiple sentences. Basically. Utterance of “HELLO” Electronics & communication Eng. and having it correctly recognize what you are saying. The other kind of inverse indication is the electroglottographs. it means talking to your computer. 3. As shall be explained later. Nirma University Page 12 .indicates the movements of the vocal cords.

Training a recognizer usually improves its accuracy. Unlike normal dictionaries. As long as the speaker can consistently repeat an utterance. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Vocabularies Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the SR system. They assume the speaker will speak in a consistent voice and tempo. Good ASR systems have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application. When the system has this ability. but much less accurate for other speakers. ASR systems with training should be able to adapt Electronics & communication Eng. Adaptive systems usually start as speaker independent systems and utilize training techniques to adapt to the speaker to increase their recognition accuracy.or how well it recognizes utterances. Smaller vocabularies can have as few as 1 or 2 recognized utterances (e. Training Some speech recognizers have the ability to adapt to a speaker. it may allow training to take place. Speaker independent systems are designed for a variety of speakers.g. each entry doesn't have to be a single word. or pronouncing certain words. They generally are more accurate for the correct speaker. They can be as long as a sentence or two. Training can also be used by speakers that have difficulty speaking. smaller vocabularies are easier for a computer to recognize.” Wake Up"). while very large vocabularies can have a hundred thousand or more! Accuracy The ability of a recognizer can be examined by measuring its accuracy . Generally. while larger vocabularies are more difficult. Nirma University Page 13 . Institute of technology.Speaker Dependence Speaker dependent systems are designed around a specific speaker. This includes not only correctly identifying an utterance but also identifying if the spoken utterance is not in its vocabulary.

while the computer determines the content. depending on which mode they're using. Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. but allow separate utterances to be 'run-together' with a minimal pause between them. Isolated Words Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. for small/large vocabulary. Basically. It doesn't mean that it accepts single words. Connected Words Connect word systems (or more correctly 'connected utterances') are similar to Isolated words. but does require a single utterance at a time. Most packages can fit into more than one class. Nirma University Page 14 . Continuous Speech Continuous recognition is the next step. Institute of technology. isolated/continuous speech recognition. Often. Electronics & communication Eng. where they require the speaker to wait between utterances (usually doing processing during the pauses). These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance.3. these systems have "Listen/Not-Listen" states. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Isolated Utterance might be a better name for this class. it's computer dictation.3 Classification of ASR System A speech recognition system can operate in many different conditions such as speaker dependent/independent. Continuous speech recognizers allow users to speak almost naturally.

Spontaneous Speech
There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.

Speaker Dependence
ASR engines can be classified as speaker dependent and speaker independent. Speaker Dependent systems are trained with one speaker and recognition is done only for that speaker. Speaker Independent systems are trained with one set of speakers. This is obviously much more complex than speaker dependent recognition. A problem of intermediate complexity would be to train with a group of speakers and recognize speech of a speaker within that group. We could call this speaker group dependent recognition.

3.4 Why is Automatic Speaker Recognition hard?
There are a few problems in speech recognition that haven‟t yet been discovered. However there are a number of problems that have been identified over the past few decades most of which still remain unsolved. Some of the main problems in ASR are:

Determining word boundaries
Speech is usually continuous in nature and word boundaries are not clearly defined. One of the common errors in continuous speech recognition is the missing out of a minuscule gap between words. This happens when the speaker is speaking at a high speed.

Varying Accents
People from different parts of the world pronounce words differently. This leads to errors in ASR. However this is one problem that is not restricted to ASR but which plagues human listeners too.

Electronics & communication Eng. Institute of technology, Nirma University

Page 15

Large vocabularies
When the number of words in the database is large, similar sounding words tend to cause a high amount of error i.e. there is a good probability that one word is recognized as the other.

Changing Room Acoustics Noise is a major factor in ASR. In fact it is in noisy conditions or in changing room acoustic that the limitations of present day ASR engines become prominent.

Temporal Variance
Different speakers speak at different speeds. Present day ASR engines just cannot adapt to that.

3.5 Speech Analyzer
Speech analysis, also referred to as front-end analysis or feature extraction, is the first step in an automatic speech recognition system. This process aims to extract acoustic features from the speech waveform. The output of front-end analysis is a compact, efficient set of parameters that represent the acoustic properties observed from input speech signals, for subsequent utilization by acoustic modeling. There are three major types of front-end processing techniques, namely linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC), and perceptual linear prediction (PLP), where the latter two are most commonly used in state-of-the-art ASR systems.

Linear predictive coding
LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube (voiced sounds), with occasional added hissing and popping sounds. Although apparently crude, this model is actually a close approximation to the reality of speech production. The glottis (the space between the vocal cords) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is characterized by its resonances, which are called formants. Hisses and pops are generated by the action of the tongue, lips and throat during sibilants and plosives.

Electronics & communication Eng. Institute of technology, Nirma University

Page 16

LPC analyzes the speech signal by estimating the formants, removing their effects from the speech signal, and estimating the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is called the residue.The numbers which describe the intensity and frequency of the buzz, the formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the buzz parameters and the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech. Because speech signals vary with time, this process is done on short chunks of the speech signal, which are called frames; generally 30 to 50 frames per second give intelligible speech with good compression.

Mel Frequency Cepstrum Coefficients
These are derived from a type of cepstral representation of the audio clip (a "spectrum-of-aspectrum"). The difference between the cepstrum and the Mel-frequency cepstrum is that in the MFC, the frequency bands are positioned logarithmically (on the mel scale) which approximates the human auditory system's response more closely than the linearly-spaced frequency bands obtained directly from the FFT or DCT. This can allow for better processing of data, for example, in audio compression. However, unlike the sonogram, MFCCs lack an outer ear model and, hence, cannot represent perceived loudness accurately. MFCCs are commonly derived as follows: 1. Take the Fourier transform of (a windowed excerpt of) a signal 2. Map the log amplitudes of the spectrum obtained above onto the Mel scale, using triangular overlapping windows. 3. Take the Discrete Cosine Transform of the list of Mel log-amplitudes, as if it were a signal. 4. The MFCCs are the amplitudes of the resulting spectrum.

Electronics & communication Eng. Institute of technology, Nirma University

Page 17

it can be also referred to as feature matching. even Electronics & communication Eng. is based on the short-term spectrum of speech. The classes here refer to individual speakers. similar to LPC analysis. PLP analysis is more consistent with human hearing. Since the classification procedure in our case is applied on extracted features. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. Hidden Markov Modeling (HMM). similarities in walking patterns would be detected. In comparison with conventional linear predictive (LP) analysis. Dynamic Time Warping Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance. 3. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. (2) The equal-loudness curve. In contrast to pure linear predictive analysis of speech. perceptual linear prediction (PLP) modifies the short-term spectrum of the speech by several psychophysically based transformations. Institute of technology.6 Speech Classifier The problem of ASR belongs to a much broader topic in scientific and engineering so called pattern recognition. and Vector Quantization (VQ). The state-of-the-art in feature matching techniques used in speaker recognition includes Dynamic Time Warping (DTW).Perceptual Linear Prediction Perceptual linear prediction. and (3) The intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) The critical-band spectral resolution. Nirma University Page 18 .

Electronics & communication Eng. any data which can be turned into a linear representation can be analyzed with DTW. Nirma University Page 19 . Such a probabilistic system would be more efficient than just cepstral analysis as these is some amount of flexibility in terms of how the words are uttered by the users. After the utterance of a particular phoneme. audio. i. to cope with different speaking speeds. in the recognition phase when the user utters a word it is split up into phonemes as done before and it‟s HMM is created.0). Each region is called a cluster and can be represented by its center called a centroid. Hidden Markov Model The basic principle here is to characterize words into probabilistic models wherein the various phonemes which contribute to the word represent the states of the HMM while the transition probabilities would be the probability of the next phoneme being uttered (ideally 1. This chain from one phoneme to another continues and finally at some point we have the most probable word out of the stored words which the user would have uttered and thus recognition is brought about in a finite vocabulary system. and graphics -indeed.g. A well known application has been automatic speech recognition. In general. the most probable phoneme to follow is found from the models which had been created by comparing it with the newly formed model. DTW has been applied to video. or even if there were accelerations and decelerations during the course of one observation. Vector Quantization VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. time series) with certain restrictions. Institute of technology.e.if in one video the person was walking slowly and if in another they were walking more quickly. Models for the words which are part of the vocabulary are created in the training phase. the sequences are "warped" nonlinearly to match each other. Now. it is a method that allows a computer to find an optimal match between two given sequences (e. The collection of all codewords is called a codebook. This sequence alignment method is often used in the context of hidden Markov models.

a speaker-specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. In the recognition phase. Institute of technology. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. In the figure. Fig.Fig 3. One speaker can be discriminated from another based of the location of centroids. Nirma University Page 20 . Conceptual diagram illustrating vector quantization codebook formation. an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed.2 shows a conceptual diagram to illustrate this recognition process. The speaker corresponding to the VQ codebook with smallest total distortion is identified. only two speakers and two dimensions of the acoustic space are shown. 3. The result codewords (centroids) are shown in Figure by black circles and black triangles for speaker 1 and 2. In the training phase. Electronics & communication Eng.2. respectively. The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2.

Fig. Feature Extraction is used in both training and recognition phases. Windowing 3. FFT (Fast Fourier Transform) 4. The schematic diagram of the steps is depicted in Figure 4. Processing Obtaining the acoustic characteristics of the speech signal is referred to as Feature Extraction. Electronics & communication Eng. Institute of technology. It comprise of the following steps: 1. 4.1. Frame Blocking 2.Chapter 4 Feature Extraction 4. Feature Extraction Steps. Nirma University Page 21 . Cepstrum (Mel Frequency Cepstral Coefficients) Feature Extraction This stage is often referred as speech processing front end.1. Mel-Frequency Wrapping 5.1. The main goal of Feature Extraction is to simplify recognition by summarizing the vast amount of speech data without losing the acoustic properties that defines the speech [12].

Nirma University Page 22 . f. Thus for each tone with an actual frequency. As a reference point. 4. If the windowing function is defined as w(n). As mentioned above. This is done in order to eliminate discontinuities at the edges of the frames. Each frame overlaps its previous frame by a predefined size. psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. For this reason.1. Generally hamming windows are used [12]. the pitch of a 1 kHz tone. The Mel-Scale (Melody Scale) filter bank which characterizes the human ear perceiveness of frequency. Institute of technology.1.3 Fast Fourier Transform The next step is to take Fast Fourier Transform of each frame. It is divided into frames with sizes generally between 30 and 100 milliseconds. 4. the resulting signal will be.1. 0 < n < N-1 where N is the number of samples in each frame.1. This transformation is a fast way of Discrete Fourier Transform and it changes the domain from time to frequency [12]. is defined as 1000 mels. It is used as a band pass filtering for this stage of identification.1 Frame Blocking Investigations show that speech signal characteristics stays stationary in a sufficiently short period of time interval (It is called quasi-stationary).4. 4.4 Mel Frequency Warping The human ear perceives the frequencies non-linearly. Therefore we can use the following approximate formula to compute the mels for a given frequency f in Hz: Electronics & communication Eng. speech signals are processed in short time intervals. measured in Hz. 40 dB above the perceptual hearing threshold. Researches show that the scaling is linear up to 1 kHz and logarithmic above that. The goal of the overlapping scheme is to smooth the transition from frame to frame [12]. a subjective pitch is measured on a scale called the „mel‟ scale. y(n) = x(n)w(n). The signals for each frame is passed through Mel-Scale band pass filter to mimic the human ear [17][12][18].2 Windowing The second step is to window all frames.

The number of mel cepstral coefficients.One approach to simulating the subjective spectrum is to use a filter bank. Fig. and the spacing as well as the bandwidth is determined by a constant mel-frequency interval. Filter Bank in Mel frequency scale 4. That filter bank has a triangular bandpass frequency response. therefore it simply amounts to taking those triangle-shape windows in the Fig 4. A useful and efficient way of implementing this is to consider these triangular filters in the Mel scale where they would in effect be equally spaced filters. Nirma University Page 23 .2 on the spectrum. 4. is typically chosen as 20. A useful way of thinking about this mel-warped filter bank is to view each filter as a histogram bin (where bins have overlap) in the frequency domain. Mathematically we can say Cepstrum of signal = FT(log(FT(the signal))+j2IIm) Electronics & communication Eng. K.5 Cepstrum Cepstrum name was derived from the spectrum by reversing the first four letters of spectrum. Institute of technology. The modified spectrum of S() thus consists of the output power of these filters when S() is the input. Note that this filter bank is applied in the frequency domain.1.2. We can say cepstrum is the Fourier Transformer of the log with unwrapped phase of the Fourier Transformer. one filter for each desired mel-frequency component.

which allows the reconstruction of the signal. Once we convolved the quickly varying part and slowly varying part it makes difficult to separate the two parts.FT .phase unwrapping .Where m is the interger required to properly unwrap the angle or imaginary part of the complex log function. Some of them need a phase-warping algorithm. The equation for the cepstrum is given below: Electronics & communication Eng. As we discussed in the Framing and Windowing section that speech signal is composed of quickly varying part e(n) excitation sequence convolved with slowly varying part (n) vocal system impulse response. While for defining the complex values whereas the complex cepstrum uses the complex logarithm function. Algorithmically we can say – Signal . Nirma University Page 24 . cepstrum is introduced to separate this two parts.FT Cepstrum. where as complex cepstrum holds information about both magnitude and phase of the initial spectrum. For defining the real values real cepstrum uses the logarithm function. Figure below shows the pipeline from signal to Cepstrum.log . We can calculate the cepstrum by many ways. Institute of technology. others do not. The real cepstrum uses the information of the magnitude of the spectrum.

Nirma University Page 25 . whereas in the Fourier Transform the frequency bands are not positioned logarithmically. It is derived from the Fourier Transform of the audio clip. As the frequency bands are positioned logarithmically in MFCC. it approximates the human system response more closely than any other system. Mel frequency Cepstral Coefficients are coefficients that represent audio based on perception. Electronics & communication Eng.1. In the Mel Frequency Cepstral Coefficients the calculation of the Mel Cepstrum is same as the real Cepstrum except the Mel Cepstrum’s frequency scale is warped to keep up a correspondence to the Mel scale. In this technique the frequency bands are positioned logarithmically.The multiplication becomes the addition by taking the logarithm of the spectral magnitude The domain of the signal cs(n) is called the quefrency-domain. Institute of technology. These coefficients allow better processing of data. 4.6 Mel Frequency Cepstrum Coefficient In this project we are using Mel Frequency Cepstral Coefficient. This coefficient has a great success in speaker recognition application.

The Mel scale was projected by Stevens. Then this frequency labelled 2000 Mel. Volkmann and Newman in 1937. In this test the listener or test person started out hearing a frequency of 1000 Hz. The scale is divided into the units mel. Fig. Institute of technology.3. Then the listeners were asked to change the frequency till it reaches to the frequency twice the reference frequency. to get back the normal frequency. 4. Figure below shows the example of normal frequency is mapped into the Mel frequency.Mel frequency scale The equation (1) above shows the mapping the normal frequency into the Mel frequency and equation (2) is the inverse. Nirma University Page 26 . and so on. and labelled it 1000 Mel for reference. The Mel scale is mainly based on the study of observing the pitch or frequency perceived by the human. then this frequency labelled as 500 Mel. On this basis the normal frequency is mapped into the Mel frequency. The Mel scale is normally a linear mapping below 1000 Hz and logarithmically spaced above 1000 Hz. Electronics & communication Eng. The same procedure repeated for the half the frequency.

MFCC approach: A block diagram of the structure of an MFCC processor is as shown in Fig 4.1. But. so. It was observed from our experiment that the actual uttered speech eliminating the static portions came up to about 2500 samples. In addition.1. Nirma University Page 27 . This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion. Fig. rather than the speech waveforms themselves. by using a simple threshold technique we carried out the silence detection to extract the actual uttered speech. which cover most energy of sounds that are generated by humans. It is clear that what we wanted to achieve was a voice based biometric system capable of recognizing isolated words.Chapter 5 Algorithm We have make ASR system using three methods namely: (1) MFCC approach (2) FFT approach (3) Vector quantization 5.MFCC Approch We first stored the speech signal as a 10000 sample vector. The main purpose of the MFCC processor is to mimic the behavior of the human ears. MFCC‟s are shown to be less susceptible to mentioned variations. The speech input is typically recorded at a sampling rate above 10000 Hz.1. when we passed this speech signal through a MFCC processor. Institute of technology.1. 5. Electronics & communication Eng. As our experiments revealed almost all the isolated words were uttered within 2500 samples. These sampled signals can capture all frequencies up to 5 kHz.

Institute of technology. This algorithm however. As explained earlier the key to this approach is using the energies within each triangular window. Thus when we convert this into the frequency domain we just have about 250 spectrum values under each window. we eliminated the block which did the Mel warping. It was seen from the experiments that because of the prominence given to energy. The simulation was carried out in MATLAB. We obtained the energy within each triangular window. Nirma University Page 28 . The various stages of the simulation have been represented in the form of the plots shown. This implied that converting it to the Mel scale would be redundant as the Mel scale is linear till 1000 Hz. We directly used the overlapping triangular windows in the frequency domain. however. The word “Hello” taken for analysis Electronics & communication Eng. has a drawback. as this takes the summation of the energy within each triangular window it would essentially give the same value of energy irrespective of whether the spectrum peaks at one particular frequency and falls to lower values around it or whether it has an equal spread within the window. Also. The input continuous speech signal considered as an example for this project is the word “HELLO”. followed by the DCT of their logarithms to achieve good compaction within a small number of coefficients as described by the MFCC approach.it spilt this up in the time domain by using overlapping windows each with about 250 samples.2. 5. So. This is why we decided not to go ahead with the implementation of the MFCC approach. Fig. this may not be the best approach as was discovered. this approach failed to recognize the same word uttered with different energy.

Fig. 5. 5. Institute of technology. Nirma University Page 29 . The word “Hello” after FFT Electronics & communication Eng.5.4. 5.3. The word “Hello” after windowing using Hamming window Fig. The word “Hello” after silence detection Fig.

Nirma University Page 30 .Fig. Institute of technology. The word “Hello” after Mel-warping Electronics & communication Eng. 5.6.

5 Convert toMel Frequency Cepstrum Determine DCT of Spectrum energy Determine Mean Square Error Defining Overlapping Triangle window Determine energy within each window Determine DCT of Spectrum energy Electronics & communication Eng. Institute of technology. Nirma University YES Page 31 .1.5. MFCC approach Algorithm: Record Voice of Person for Testing Start Do the Windowing Record your Voice For Training Silence detection Silence Detection Convert Speech into FFT End Windowing Defining overlapping Triangle Window Same user NO Not Same user Convert individual column into frequency domain Determine energy within each window If it is <1.

2.5. Nirma University Page 32 . Institute of technology. FFT approach: Electronics & communication Eng.

3.5. 5. the representative feature vectors resulted after VQ Electronics & communication Eng. with a process called vector quantization.7.8. Institute of technology. since these distributions are defined over a high-dimensional space. Fig. It is often easier to start by quantizing each feature vector to one of a relatively small number of template vectors. VQ is a process of taking a large set of feature vectors and producing a smaller set of measure vectors that represents the centroids of the distribution. the vectors generated from training before VQ Fig. 5. Using VQ: A speaker recognition system must able to estimate probability distributions of the computed feature vectors. Storing every single vector that generate from the training mode is impossible. Nirma University Page 33 .

the VQ approach will be used. Figure 5. Furthermore.The technique of VQ consists of extracting a small number of representative feature vectors as an efficient means of characterizing the speaker specific features. Hidden Markov Modeling (HMM). The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. By means of VQ. The remaining patterns are then used to test the classification algorithm. storing every single vector that we generate from the training is impossible. then one has a problem in supervised pattern recognition. the data from the tested speaker is compared to the codebook of each speaker and measure the difference. we label each input speech with the ID of the speaker (S1 to S8). The problem of speaker recognition belongs to a much broader topic in scientific and engineering so called pattern recognition. Nirma University Page 34 . In the training Electronics & communication Eng. due to ease of implementation and high accuracy. If the correct classes of the individual patterns in the test set are also known. it can be also referred to as feature matching. This is exactly our case since during the training session. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. These patterns comprise the training set and are used to derive a classification algorithm. Institute of technology. then one can evaluate the performance of the algorithm.7 shows a conceptual diagram to illustrate this recognition process. In this project. Since the classification procedure in our case is applied on extracted features. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The collection of all codewords is called a codebook. and Vector Quantization (VQ). In the figure. only two speakers and two dimensions of the acoustic space are shown. The state-of-the-art in feature matching techniques used in speaker recognition include Dynamic Time Warping (DTW). if there exists some set of patterns that the individual classes of which are already known. The classes here refer to individual speakers. These differences are then use to make the recognition decision. Each region is called a cluster and can be represented by its center called a codeword. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. these patterns are collectively referred to as the test set. In the recognition stage. By using these training data features are clustered to form a codebook for each speaker.

In the recognition phase. One speaker can be discriminated from another based of the location of centroids. Institute of technology. Conceptual diagram illustrating vector quantization codebook formation. Buzo and Gray. 5. for clustering a set of L training vectors into a set of M codebook vectors. 5. There is a well-know algorithm. a speaker-specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The algorithm is formally implemented by the following recursive procedure: 1. As described above. this is the centroid of the entire set of training Electronics & communication Eng. Design a 1-vector codebook.phase. The speaker corresponding to the VQ codebook with smallest total distortion is identified. the acoustic vectors extracted from input speech of a speaker provide a set of training vectors. Fig.1 Clustering the training vector After the enrolment session. an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed. namely LBG algorithm [Linde. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. the next important step is to build a speaker-specific VQ codebook for this speaker using those training vectors. The result codewords (centroids) are shown in Figure 5 by black circles and black triangles forspeaker 1 and 2. respectively. 1980].9. Nirma University Page 35 .3.

It starts first by designing a 1-vector codebook. Iteration 2: repeat steps 2. 2. “Compute D (distortion)” sums the distances of all training vectors in the nearest-neighbor search so as to determine whether the procedure has converged. Double the size of the codebook by splitting each current codebook yn according to the rule where n varies from 1 to the current size of the codebook. 4. find the codeword in the Current codebook that is closest (in terms of similarity measurement). and assign that vector to the corresponding cell (associated with the closest codeword). in a flow diagram. “Find centroids” is the centroid update procedure. and continues the splitting process until the desired M-vector codebook is obtained. the detailed steps of the LBG algorithm. Centroid Update: update the codeword in each cell using the centroid of the training vectors assigned to that cell. and e is a splittingparameter (we choose e =0. 5.01). Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset Threshold 6. 3. the LBG algorithm designs an M-vector codebook in stages. 3 and 4 until a codebook size of M is designed Intuitively. then uses a splitting technique on the codewords to initialize the search for a 2-vector codebook. Electronics & communication Eng. Figure 5.10 shows. Institute of technology. Nearest-Neighbor Search: for each training vector. “Cluster vectors” is the nearest-neighbor search procedure which assigns each training vector to a cluster associated with the closest codeword.vectors (hence. no iteration is required here). Nirma University Page 36 .

Fig.10. Nirma University Page 37 . Institute of technology. 1993) Electronics & communication Eng. Flow diagram of the LBG algorithm (Adapted from Rabiner and Juang. 5.

Main Menu Fig. Electronics & communication Eng. So. Institute of technology. Main Menu of the Speech Recognition Application This is the main menu of our code form this menu you can select any method from which you want to recognize the person.1. we go one by one method first we look about mfcc method. Nirma University Page 38 .Chapter 6 Sample training and Recognition Session with Screenshot 6. 6.1.

Training Menu in MFCC approch Here you just click the button and record your voice and also see your speech plot by clicking show plot button. 6. Institute of technology. 6. Waveform of Training Session Electronics & communication Eng. Nirma University Page 39 . Fig.2.2 GUI of MFCC method : Fig.3.6.

Testing Session GUI If the speaker is identified then.After clicking the go to testing session you can now check whether the speaker is identified or not. Institute of technology. Fig. Fig. 6.5. Nirma University Page 40 . 6. Final Result(1) Electronics & communication Eng.4.

Fig.If it is not identified then. Final Result(2) Electronics & communication Eng. Nirma University Page 41 . Institute of technology.6. 6.

Institute of technology. 6. 6. Fig. Create Database GUI Here First of all you have to create the database of the user for that we are going to record user voice 10 times.8.6. Nirma University Page 42 . User authentication GUI Electronics & communication Eng.7. Then we going for the testing phase.3 GUI of FFT method : Fig.

Electronics & communication Eng.3 GUI of VQ method : Fig. GUI of Database creation Here first we have to decide number of speaker and their training paths in the computer we have to enter this paths then codebook will be created now after we have to go for the testing phase in the testing phase again we have to store the voice of user for authentication in the computer and give the path and VQ will matches the speaker to speaker from training to the testing phase. Nirma University Page 43 . Institute of technology. 6.9.6.

Nirma University Page 44 . 6. Institute of technology.Fig.10. GUI of User matching Electronics & communication Eng.

win1 = [tri . tri . Electronics & communication Eng. tri . win19 = [zeros(900.1) . win12 = [zeros(550.1)]. zeros(9050.1) .Matlab Code for MFCC approach : 7.1) . tri .1)]. win15 = [zeros(700.1) .1)]. tri . tri . win4 = [zeros(150. zeros(9200.1)].1)].1)].1) . zeros(9650.1. % Sampling Frequency t = hamming(4000).1) . zeros(9150. tri . tri .Chapter 7 Source Code 7.1)]. Institute of technology.1)].1) .1. % Defining overlapping triangular windows for win2 = [zeros(50.1)]. win5 = [zeros(200. % frequency domain analysis win3 = [zeros(100. win17 = [zeros(800.1)].1) . zeros(9000. win6 = [zeros(250.1 Training Code: fs = 10000. win7 = [zeros(300. f = (1:10000).1) . zeros(9300. tri . tri . zeros(9500. mel(f) = 2595 * log(1 + f / 700).1)]. tri .1)]. zeros(9100.1) . win14 = [zeros(650.1) . win10 = [zeros(450. tri .1)].1)].1)].1)]. zeros(9900. zeros(9750.1) . win8 = [zeros(350. tri . tri . % Linear to Mel frequency scale conversion tri = triang(100). tri . win16 = [zeros(750.1) . zeros(9350.1) . tri . zeros(9850.1) . win11 = [zeros(500. zeros(9550. tri .1) . zeros(9600.1)]. tri . win9 = [zeros(400.1)]. win13 = [zeros(600. zeros(9250.1) .1) .1)]. Nirma University Page 45 . zeros(6000. zeros(9450.1)]. zeros(9700. zeros(9800. win18 = [zeros(850. zeros(9400. tri . % Hamming window to smooth the speech signal w = [t .

* win13.* w.* win11. 'double'). nx9 = nx.* win8. nx14 = nx. Institute of technology. tri .* win10.* win1. Electronics & communication Eng. nx3 = nx.* win3.* win9. zeros(8950. nx10 = nx.* win7.* win15. end x(1 : i) = []. fs. nx2 = nx. nx11 = nx. % Transform to frequency domain nx = abs(mx(floor(mel(f)))). i = 1. % Mel warping nx = nx. nx7 = nx. nx5 = nx. nx8 = nx.05 % Silence detection i = i + 1. x1 = x.1) ./ max(nx). nx4 = nx.* win2.* win12. nx13 = nx. x = wavrecord(1 * fs. nx6 = nx. mx = fft(x1).* win5.* win4. Nirma University Page 46 . while abs(x(i)) <0. % Record and store the uttered speech plot(x). nx12 = nx.1)]. nx15 = nx.win20 = [zeros(950. nx1 = nx. wavplay(x). x(6000 : 10000) = 0.* win14.* win6.

* win16. sx8 = sum(nx8.^ 2). sx19.^ 2). sx2. % Determine DCT of Log of the spectrum energies Electronics & communication Eng.^ 2).* win19.^ 2). sx8.^ 2).^ 2).^ 2).* win20. sx = [sx1. sx11 = sum(nx11. sx13 = sum(nx13.^ 2). Institute of technology. sx4 = sum(nx4. sx7 = sum(nx7.^ 2). sx1 = sum(nx1. nx20 = nx.nx16 = nx.^ 2).^ 2). sx16. nx18 = nx.^ 2).^ 2). sx19 = sum(nx19. % Determine the energy of the signal within each window sx2 = sum(nx2. sx6 = sum(nx6. sx7. sx10. sx18 = sum(nx18.* win18. sx18. sx10 = sum(nx10. Nirma University Page 47 . sx9. sx20 = sum(nx20.^ 2). sx13. nx17 = nx. sx12 = sum(nx12.^ 2). sx6. % by summing square of the magnitude of the spectrum sx3 = sum(nx3.^ 2). sx5. sx20].^ 2). sx14 = sum(nx14. sx14.* win17. sx9 = sum(nx9.^ 2).^ 2). sx = log(sx). sx3. sx5 = sum(nx5. sx11. sx15 = sum(nx15. sx15. sx16 = sum(nx16. sx17. nx19 = nx. sx12. sx4.^ 2). sx17 = sum(nx17. dx = dct(sx).

zeros(9550. win14 = [zeros(650. 'w'). zeros(9750. mel(f) = 2595 * log(1 + f / 700). zeros(9150.1)]. win7 = [zeros(300. win5 = [zeros(200.1) . % frequency domain analysis win3 = [zeros(100. zeros(9200. zeros(9800. tri . zeros(9600. tri .1) .1) .1) . zeros(9450. zeros(9400.1. tri .1) .1) . tri .fid = fopen('sample.1)]. % Hamming window to smooth the speech signal w = [t . win6 = [zeros(250. 7. tri . tri . win13 = [zeros(600.1)]. win1 = [tri .1)].1)]. tri . tri .dat'.1)]. win17 = [zeros(800.1) . Nirma University Page 48 . zeros(9350.1)]. win10 = [zeros(450.1) . zeros(9900. zeros(9700. zeros(6000. zeros(9500. tri . zeros(9650. zeros(9850.1) . Electronics & communication Eng.1)].1) . win8 = [zeros(350.1) . win9 = [zeros(400. zeros(9050.1) . % Defining overlapping triangular windows for win2 = [zeros(50. 'real*8').1)]. % Sampling Frequency t = hamming(4000).2 Testing Code: fs = 10000. win11 = [zeros(500.1)]. tri . zeros(9250. f = (1:10000).1) .1)].1)]. % Linear to Mel frequency scale conversion tri = triang(100).1)]. tri . zeros(9100.dat file fclose(fid).1) . win4 = [zeros(150. tri . win18 = [zeros(850.1)].1) .1)]. fwrite(fid.1) . % Store this feature vector as a .1) . tri . tri .1)]. dx.1)]. win16 = [zeros(750. win12 = [zeros(550.1)]. tri . Institute of technology. tri .1)]. zeros(9300. tri . win15 = [zeros(700.

ny5 = ny.* win5. ny4 = ny. tri .win19 = [zeros(900. zeros(9000. Electronics & communication Eng. ny6 = ny. ny10 = ny. y = wavrecord(1 * fs. y(6000 : 10000) = 0. ny2 = ny.1)]. my = fft(y1).* win1. ny13 = ny.* win14. ny15 = ny.* win11. zeros(8950.1) . ny14 = ny.* win15. 'double'). fs.* win8. tri . ny16 = ny.* win12. end y(1 : i) = []. ny9 = ny.* win3.05 % Silence Detection i = i + 1. ny3 = ny. ny12 = ny. win20 = [zeros(950. ny1 = ny.1)].* win13. % Transform to frequency domain ny = abs(my(floor(mel(f)))).* w. ny11 = ny./ max(ny).* win10.1) . ny7 = ny.* win2. y1 = y.* win4. % Mel warping ny = ny. Institute of technology.* win6. ny8 = ny. %Store the uttered password for authentication i = 1.* win9. Nirma University Page 49 .* win7.* win16. while abs(y(i)) < 0.

* win20. ny19 = ny.* win17. sy19. Nirma University Page 50 . ny18 = ny. sy17 = sum(ny17.^ 2).* win18.^ 2). sy14 = sum(ny14. sy = [sy1.^ 2).^ 2). sy13. sy7 = sum(ny7. % Determine the energy of the signal within each window sy11 = sum(ny11. sy5. sy18. Institute of technology. ny20 = ny.ny17 = ny. sy9. sy5 = sum(ny5. sy4.^ 2).^ 2). sy20]. sy16 = sum(ny16.^ 2).^ 2). sy17.^ 2). sy8 = sum(ny8. sy8. sy7. sy3 = sum(ny3. sy = log(sy). sy18 = sum(ny18. sy14.dat'.^ 2). sy12. % by summing square of the magnitude of the spectrum sy12 = sum(ny12. sy10. sy20 = sum(ny20.^ 2). sy2.^ 2).^ 2). sy19 = sum(ny19.^ 2).^ 2).* win19. Electronics & communication Eng.^ 2). sy2 = sum(ny2. sy10 = sum(ny10. sy15. sy16.^ 2). sy6. sy13 = sum(ny13. % Determine DCT of Log of the spectrum energies fid = fopen('sample. sy6 = sum(ny6. sy1 = sum(ny1. sy3.^ 2). dy = dct(sy).^ 2). sy9 = sum(ny9. sy15 = sum(ny15. sy4 = sum(ny4.^ 2).'r'). sy11.

'. input('You have 2 seconds to say your name. Institute of technology.wav').dx = fread(fid. % “Access Granted” is output if within threshold %wavplay(Grant).'s').44100. ytemp = zeros (88200.>'. % evaluated in the training phase dx = dx.file). y = wavrecord(88200.^ 2)) / 20. % Obtain the feature vector for the password fclose(fid). 'real*8').5 fprintf('\n\nYou are the same user\n\n'). 20.wav').wav'.1 Voice Recording Matlab code : for i = 1:10 file = sprintf('%s%d. Press enter when ready to record--> ').20). %Deny=wavread('Deny.Matlab Code for FFT approach : 7.'g'. end 7.dy). else fprintf('\n\nYou are not a same user\n\n'). %Grant=wavread('Grant. % Determine the Mean squared error if MSE<1. Nirma University Page 51 .44100). wavwrite(y. % “Access Denied” is output in case of a failure %wavplay(Deny).1). r = zeros (10. end 7.2.44100). for j = 1:10 Electronics & communication Eng. sound(y.i).2 Training and Testing Code : name = input ('Enter the name that must be recognized -.2. MSE=(sum((dx .2.

1 && i <=7000 start = 1.2 * j) = t (start:last). last = 88200. for i = 1:88200 if s (i) >=. ytemp (1: last . Electronics & communication Eng.start + 1. if s (k)>=. Institute of technology.j).'g'. fs] = wavread(file). s = abs (t).wav'. break end if s (k)>= . break end end for i = 1:88200 k = 88201-i. break end end r (j) = last-start.file = sprintf('%s%d.1 && k <81200 last = k + 7000. [t. break end if s (i) >=.1 && i > 7000 start = i-7000. Nirma University Page 52 .1 && k>=81200 last = 88200. start = 1.

y = zeros (min (r).20). for i = 1:20 fn (1:600.ytemp (1: last . for i = 1:20 y (:. end % Convert the individual columns into frequency % domain by applying the Fast Fourier Transform.i) = fy (1:600. Nirma University Page 53 . Institute of technology.(2*j .1).1)) = t (start:last).i).20).start + 1.i) = ytemp (1:min (r). fy = fft (y). % Normalize the spectra of each recording and place into the matrix fn.i)/sqrt(sum (abs (fy (1:600. for i = 1:20 pu = pu + fn (1:600.i)). %Then take the modulus squared of all the entries in the matrix. end % The rows of the matrix are truncated to the smallest length % of the 10 recordings. fy = fy. end % Find the average vector pu pu = zeros (600. Electronics & communication Eng.^2)). fn = zeros (600. %Only frequiencies upto 600 are needed to represent the speech of most % humans.*conj (fy).i).

end std = sqrt (std/19). % Find the Standard Deviation std = 0.^2)).^2). rec = input ('Are you happy with this recording? \nPress 1 to record again or just press enter to proceed--> ').44100). sound (usertemp.44100). sound (usertemp. Press enter when ready') % Record Voice and confirm if the user is happy with their recording usertemp = wavrecord (88200.i)-tn). Nirma University Page 54 . Institute of technology. %%%%%%%% Verification Process %%%%%%%% input ('You will have 2 seconds to say your name. ''.44100). for i = 1:20 std = std + sum (abs (fn (1:600. input ('You will have 2 seconds to say your name.end pu = pu/20. % Normalize the average vector tn = pu/sqrt(sum (abs (pu). Electronics & communication Eng. Press enter when ready') usertemp = wavrecord (88200. while rec == 1 rec = 0.44100).

break end if s (k)>= . last = 88200.1 && k <83200 last = k + 5000. Nirma University Page 55 . for i = 1:88200 if s (i) >=.1 && k>=83200 last = 88200. start = 1. break end end for i = 1:88200 k = 88201-i. Institute of technology. break end if s (i) >=.1 && i <=5000 start = 1. if s (k)>=.1 && i > 5000 start = i-5000. end % Crop recording to a window that just contains the speech s = abs (usertemp). break end Electronics & communication Eng.rec = input ('Are you happy with this recording? \nPress 1 to record again or just press enter to proceed--> ').

plot (userfn) title ('Normalized Frequency Spectra Of Recording') subplot (2.1). userftemp = fft (user).'.' !!!!').2).' !!!!').name. if s < 2*std name = strcat ('HELLO----'.1. name else name = strcat ('YOU ARE NOT---.*conj (userftemp).end % Transform the recording with FFT and normalize user = usertemp (start:last).tn). Nirma University Page 56 .1. Institute of technology. userfn = userf/sqrt(sum (abs (userf). userftemp = userftemp. title ('Normalized Frequency Spectra of Average') % Confirm weather user's voice is within two standard deviations of mean.name. s = sqrt (sum (abs (userfn . plot (tn). subplot (2. name end Electronics & communication Eng. userf = userftemp (1:600).^2)).^2)). % Plot the spectra of the recording along with the average normal vector hold on.

k).Matlab Code for VQ approch : 7.n) / m) + 1.2 mfcc. Institute of technology. disp(file). n = 256.1 Train. for i = 1:n for j = 1:nbFrame Electronics & communication Eng. traindir. i). l = length(s).wav'. Nirma University Page 57 . fs) m = 100.3. fs] = wavread(file). nbFrame = floor((l . n) k = 16. % number of centroids required for i = 1:n % train a VQ codebook for each speaker file = sprintf('%s%d.m function r = mfcc(s.m function code = train(traindir.7. fs). % Compute MFCC's code{i} = vqlbg(v. [s. end % Train VQ codebook 7.3. v = mfcc(s.3.

^2.i) = fft(M2(:.3. j) = s(((j . n. if (M ~= M2) error('Matrix dimensions do not match. :)). y) [M. Nirma University Page 58 . [M2. for i = 1:nbFrame frame(:. N] = size(x). tmax = l / fs.1) * m) + i).3 disteu. M2 = diag(h) * M. z = m * abs(frame(1:n2. 7. r = dct(log(z)).M(i. end end %size(M) h = hamming(n).') Electronics & communication Eng.m function d = disteu(x. n2 = 1 + floor(n / 2). end t = n / 2. m = melfb(20. Institute of technology. fs). P] = size(y). i)).

^2.m function m = melfb(p. 1)'.N).5/f0) / (p+1). n. fn2 = floor(n/2). fs) f0 = 700 / fs.P).^2. Nirma University Page 59 . Institute of technology.end d = zeros(N.:) = sum((x(:.4 melfb. end else copies = zeros(1. 1).y) . for n = 1:N d(n. Electronics & communication Eng.1)). if (N < P) copies = zeros(1.^0. for p = 1:P d(:.y(:. lr = log(1 + 0. b1 = floor(bl(1)) + 1.5. p+copies)) . P). % convert to fft bin numbers with 0 for DC term bl = n * (f0 * (exp([0 1 p p+1] * lr) . end end d = d.3. n+copies) . 7.p) = sum((x .

pf = log(1 + (b1:b4)/n/f0) / lr. []. m = sparse(r. 2). v = 2 * [1-pm(b2:b4) pm(1:b3)]. j) = mean(d(:. r = mean(d. c = [b2:b4 1:b3] + 1.fp. Electronics & communication Eng. b3 = floor(bl(3)). 2). p. for i = 1:log2(k) r = [r*(1+e). Institute of technology. 2). b4 = min(fn2. for j = 1:2^i r(:. find(ind == j)). while (1 == 1) z = disteu(d. r = [fp(b2:b4) 1+fp(1:b3)]. r). Nirma University Page 60 . ceil(bl(4))) . r*(1-e)]. c. 7.m function r = vqlbg(d. pm = pf .1. fp = floor(pf).b2 = ceil(bl(2)).5 vqlbg. dpr = 10000. [m.k) e = . v.3.0003. 1+fn2).ind] = min(z. t = 0.

fs] = wavread(file). % each trained codebook. find(ind == j)). v = mfcc(s. code{l}).wav'. for l = 1:length(code) d = disteu(v.3. % Compute MFCC's distmin = inf. end end end 7. j)). k1 = 0.6 test.[]. Nirma University Page 61 . if dist < distmin Electronics & communication Eng.1). code) for k = 1:n % read test sound file of each speaker file = sprintf('%ss%d. r(:. testdir.t)/t) < e) break.x = disteu(d(:. [s. else dpr = t. n. Institute of technology.2)) / size(d. fs).m function test(testdir. end end if (((dpr . for q = 1:length(x) t = t + x(q). k). compute distortion dist = sum(min(d.

Institute of technology. end Electronics & communication Eng. disp(msg). end end msg = sprintf('Speaker %d matches with speaker %d'. k. k1). Nirma University Page 62 .distmin = dist. k1 = l.

In this method. ample memory and an ever-present power supply. a distortion measure which based on the minimizing the Euclidean distance was used when matching an unknown speaker with the speaker database. During this project. with the rapid evolvement of hardware and software technologies. Single purpose command and control system such as voice dialing for cellular. By investigating the extracted features of the unknown speech and then compare them to the stored extracted features for each different speaker in order to identify the unknown speaker. the LBG algorithm is used to do the clustering. home. ASR has become more and more expedient as an alternative human-to-machine interface that is needed for the following application areas: Stand-alone consumer devices such as wrist watch. In these years. toys and hands-free mobile phone in car where people are unable to use other interfaces or big input platforms like keyboards are not available. Nirma University Page 63 . The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients). The speaker was modeled using Vector Quantization (VQ). we have found out that the VQ based clustering approach provides us with the faster speaker identification process than only mfcc approach or FFT approach. However.Conclusion The goal of this project was to create a speaker recognition system. most state-of-the-art ASR systems run on desktop with powerful microprocessors. In the recognition stage. speech recognition technology has reached a relatively high level. A VQ codebook is generated by clustering the training feature vectors of each speaker and then stored in the speaker database. and office phones where multi-function computers (PCs) are redundant. and apply it to a speech of an unknown speaker. Applications After nearly sixty years of research. Institute of technology. Electronics & communication Eng.

The system not only verifies if the correct password has been said but also focused on the authenticity of the speaker. Here the user initially trains the system by uttering the digits from 0 to 9. Electronics & communication Eng. Iris. She/he has to initially speak the right password to gain access to a system. The key focus of this application is to aid the physically challenged in executing a mundane task like telephone dialing. Nirma University Page 64 . Institute of technology. The ultimate goal is do have a system which does a Speech. Thus the inherent security provided by the system was confirmed. Typically a user has two levels of check. The algorithm is run on a particular speaker and the MFCC coefficients determined. the system can recognize the digits uttered by the user who trained the system. Once the system has been trained. Now the algorithm is applied to a different speaker and the mismatch was clearly observed. Presently systems have also been designed which incorporate Speech and Speaker Recognition. Fingerprint Recognition to implement access control. This system can also add some inherent security as the system based on cepstral approach is speaker dependent.Some of the applications of speaker verification systems are: Time and Attendance Systems Access Control Systems Telephone-Banking/Broking Biometric Login to telephone aided shopping systems Information and Reservation Services Security control for confidential information Forensic purposes Voice based Telephone dialing is one of the applications we simulated.

But we feel the idea can be extended to “Continuous Word Recognition” and ultimately create a Language Independent Recognition System based on algorithms which make these systems robust. The use of Statistical Models like HMMs. Nirma University Page 65 . the greater the recognition accuracy. The size of the training data i. the same words spoken by male/female speakers and the word being spoken under different conditions say under conditions in which the speaker may have a sore throat etc. Our VQ code takes a very long time for the recognition of the actual voice averagely half and hour we have found this can be decreased using some other algorithm. Here we used LBG algorithm. The error rate of determining the beginning and ending of speech segments will greatly increase which directly influence the recognition performance at the pattern recognition part. But someone can try k-means algorithm also. the code book can be increased in VQ as it is clearly proven that the greater the size of the training data. One of these methods could be to use the statistical way to find a distribution which can separate the noise and speech from each other.e. This field is very vast and research is also done for that purpose. This training data could incorporate aspects like the different ways via the accents in which a word can be spoken. Institute of technology.Scope for future work This project focused on “Isolated Word Recognition”. we should try to use some effective way to do detection. Electronics & communication Eng. So. Some other aspects which can be looked into are: The detection used in this work is only based on the frame energy in MFCC which is not good for a noisy environment with low SNR. This would make the system much tolerant to variations like accent and extraneous conditions like noise and associated residues and hence make it less error prone. GMMs or learning models like Neural Networks and other associated aspects of Artificial Intelligence can also be incorporated in this direction to improve upon the present project.

Turkey. [10] Mengüsoglu. University of Arizona.A. IISc Bangalore. Chiu-Sing Choy and Kong-Pang Pun – „An Efficient MFCC Extraction Method in Speech Recognition’. E. Alsteris and Kuldip K. IIT. Abdulla – „Auditory Based Feature Vectors for Speech Recognition Systems’. (1996). MS Thesis.: “Fundamentals of Speech Synthesis and Speech Recognition”. Jan 30 -Feb 1. for Graduate Studies in Pure and Applied Sciences. [8] Markowitz. IEEE Electronics & communication Eng. M. IEEE – ISCAS. J. (1999). The University of Auckland [5] Pradeep Kumar P and Preeti Rao – „A Study of Frequency-Scale Warping for Speaker Recognition’. Cheong-Fat Chan. John Wiley & Sons.time Fourier Phase Spectra’. [12] Woszczyna. Inst. 2004 [6] Beth Logan – „Mel Frequency Cepstral Coefficients for Music Modeling’. Paliwal – „ASR on Speech Reconstructed from Short. New York. Department of Electronic Engineering. School of Microelectronic Engineering Griffth University. Nirma University Page 66 . USA. The Chinese University of Hong Kong. Institute of technology. Institute of Engineering and Science. 2006 [3] Leigh D. (1994).: “A Large Vocabulary Speech Recognition System for Turkish“. Australia.Bombay. MS Thesis. Ankara. Department of Electrical Engineering in the Graduate College.References [1] Lawrence Rabiner. Electrical & Electronic Engineering Department. Hong. C. Bilkent University. Ankara.2004 [4] Waleed H. Brisbane. Prentice Hall. USA. Biing-Hwang Juang – „Fundamentals of Speech Recognition’ [2] Wei Han. (1999).: “Speech Recognition Using Neural Networks”. ICLSP . MS Thesis. [11] Zegers. [9] Yılmaz. E. Hacettepe University. Compaq Computer Corporation [7] Keller.: “JANUS 93: Towards Spontaneous Speech Translation”. NCC 2004. National Conference on Communications.: “Using Speech Recognition”.: “Rule Based Design and Implementation of a Speech Recognition System for Turkish Language”. Dept of Electrical Engineering. Turkey. Arizona. P. Cambridge Research Laboratory. (1998).

Ejnarsson.ac.:. Department of Telecommunications and Signal Processing. Blekinge Institute of Technology. Institute of technology. M. [13] Somervuo.: “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”. PhD Thesis. G. Helsinki University of Technology. M.: “Triphone Clustering in Continuous Speech Recognition”.com/zipped. M. (2002). (2004). www. [18] Brookes. Oflazer. Electronics & communication Eng. [17] www. K. Mermelstein. Tur.: “Speech Recognition using context vectors and multiple feature streams”. [15] Hakkani-Tur. Technical Report.: “Biologically Inspired Noise-Robust Speech Recognition for Both Man and Machine”.: “Speech Recognition Using HMM: Performance Evaluation in Noisy Environments”. P. [19] Davis. (1996). (2000). Bilkent University. P.. (2002). (2001).. (2003).: “VOICEBOX: a MATLAB toolbox for speech processing”.D. [20] Skowronski.html.dspguide. Speech. The Graduate School of the University of Florida. [14] Nilsson. MS Thesis.ic. MS Thesis. Nirma University Page 67 .uk/hp/staff/dmb/voicebox/voicebox. vol. D. and Signal Processing.ee. M. [21] MATLAB Product Help: “MATLAB Compiler: Introducing the MATLAB Compiler: Why Compile M-Files?”. M. MS Thesis. Department of Computer Science.. 4 (1980).htm: “The Scientist and Engineer's Guide to Digital Signal Processing” (Access date: March 2005). S.Proceedings Conference on Neural Networks.. “Statistical Morphological Disambiguation for Agglutinative Languages”. [16] Ursin. IEEE Transactions on Acoustics. (1994).

Sign up to vote on this title
UsefulNot useful