Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Performance Evaluation of Speaker Identification for Partial Coefficients of Transformed Full, Block and Row Mean of Speech Spectrogram using DCT, WALSH and HAAR

Performance Evaluation of Speaker Identification for Partial Coefficients of Transformed Full, Block and Row Mean of Speech Spectrogram using DCT, WALSH and HAAR

Ratings: (0)|Views: 22 |Likes:
Published by ijcsis
Abstract - In this paper an attempt has been made to provide simple techniques for speaker identification using transforms such as DCT, WALSH and HAAR along with the use of spectrograms instead of raw speech waves. Spectrograms form a image database here. This image database is then subjected to different transformation techniques applied in different ways such as on full image, on image blocks and on Row Mean of an image and image blocks. In each method, results have been observed for partial feature vectors of image. From the results it has been observed that, transform on image block is better than transform on full image in terms of identification rate and computational complexity. Further, increase in identification rate and decrease in computations has been observed when transforms are applied on Row Mean of an image and image blocks. Use of partial feature vector further reduces the number of comparisons needed for finding the most appropriate match.

Keywords- Speaker Identification, DCT, WALSH, HAAR, Image blocks, Row Mean, Partial feature vector.
Abstract - In this paper an attempt has been made to provide simple techniques for speaker identification using transforms such as DCT, WALSH and HAAR along with the use of spectrograms instead of raw speech waves. Spectrograms form a image database here. This image database is then subjected to different transformation techniques applied in different ways such as on full image, on image blocks and on Row Mean of an image and image blocks. In each method, results have been observed for partial feature vectors of image. From the results it has been observed that, transform on image block is better than transform on full image in terms of identification rate and computational complexity. Further, increase in identification rate and decrease in computations has been observed when transforms are applied on Row Mean of an image and image blocks. Use of partial feature vector further reduces the number of comparisons needed for finding the most appropriate match.

Keywords- Speaker Identification, DCT, WALSH, HAAR, Image blocks, Row Mean, Partial feature vector.

More info:

Published by: ijcsis on Oct 10, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/10/2010

pdf

text

original

 
 Dr. H. B. Kekre et. al.(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, 2010
Performance Evaluation of Speaker Identification forPartial Coefficients of Transformed Full, Block andRow Mean of Speech Spectrogram using DCT,WALSH and HAAR
Dr. H. B. Kekre
Senior Professor,MPSTME, SVKM’s NMIMSUniversityMumbai, 400-056, India
 
hbkekre@yahoo.com
Dr. Tanuja K. Sarode
 
Assistant Professor,Thadomal Shahani Engg.College, Bandra (W),Mumbai, 400-050, Indiatanuja_0123@yahoo.com
Shachi J. Natu
 Lecturer,Thadomal Shahani Engg.College, Bandra (W),Mumbai, 400-050, Indiashachi_natu@yahoo.com
 
Prachi J. Natu
 
Assistant Professor,GVAIET, SheluKarjat 410201,Indiaprachi.natu@yahoo.com 
 Abstract
-
In this paper an attempt has been made to providesimple techniques for speaker identification using transformssuch as DCT, WALSH and HAAR alongwith the use of spectrograms instead of raw speech waves. Spectrograms form aimage database here. This image database is then subjected todifferent transformation techniques applied in different wayssuch as on full image, on image blocks and on Row Mean of animage and image blocks. In each method, results have beenobserved for partial feature vectors of image. From the results ithas been observed that, transform on image block is better thantransform on full image in terms of identification rate andcomputational complexity. Further, increase in identification rateand decrease in computations has been observed whentransforms are applied on Row Mean of an image and imageblocks. Use of partial feature vector further reduces the numberof comparisons needed for finding the most appropriate match.
 Keywords
- Speaker Identification, DCT, WALSH, HAAR,Image blocks, Row Mean, Partial feature vector.
I.
 
INTRODUCTIONTo provide security in a multiuser environment, it hasbecome crucial to identify users and to grant access only tothose users who are authorized. Apart from the traditional loginand password method, use of biometric technology for theauthentication of users is becoming more and more popularnowadays. Biometrics comprises methods for uniquelyrecognizing humans based upon one or more intrinsic physicalor behavioral traits. Biometric characteristics can be divided intwo main classes: Physiological which are related to the shapeof the body. Examples include fingerprint, face recognition,DNA, hand and palm geometry, iris recognition etc.Behavioral, which are related to the behavior of a person.Examples include typing rhythm, gait and voice. Techniqueslike face recognition, fingerprint recognition and retinal bloodvessel patterns have their own drawbacks. To identify anindividual by these methods, he/she should be willing toundergo the tests and should not get upset by these procedures.Speaker recognition allows non-intrusive monitoring and alsoachieves high accuracy rates which conform to most securityrequirements. Speaker recognition is the process of automatically recognizing who is speaking based on someunique characteristics present in speaker’s voice [2]. There aretwo major applications of speaker recognition technologies andmethodologies: speaker identification and speaker verification.In the speaker identification task, a speech utterance froman unknown speaker is analyzed and compared with speechmodels of known speakers. The unknown speaker is identifiedas the speaker whose model best matches the input utterance.In speaker verification, an identity is claimed by an unknownspeaker, and an utterance of this unknown speaker is comparedwith a model for the speaker whose identity is being claimed. If the match is good enough, that is, above a threshold, theidentity claim is accepted. The fundamental difference betweenidentification and verification is the number of decisionalternatives [3]. In identification, the number of decisionalternatives is equal to the size of the population, whereas inverification there are only two choices, acceptance or rejection,regardless of the population size. Therefore, speakeridentification performance decreases as the size of thepopulation increases, whereas speaker verification performanceapproaches a constant, independent of the size of thepopulation, unless the distribution of physical characteristics of speakers is extremely biased.Speaker identification can be further categorized into text-dependent and text independent speaker identification based onthe relevance to speech contents [2, 4].Text Dependent Speaker Identification requires the speakersaying exactly the enrolled or given password/speech. TextIndependent Speaker Identification is a process of verifying theidentity without constraint on the speech content. It has no
186http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 Dr. H. B. Kekre et. al.(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, 2010
advance knowledge of the speaker’s utterance and is moreflexible in situation where the individuals submitting thesample may be unaware of the collection or unwilling tocooperate, which presents more difficult challenge.Compared to Text Dependent Speaker Identification, TextIndependent Speaker Identification is more convenient becausethe user can speak freely to the system. However, it requireslonger training and testing utterances to achieve goodperformance. Text Independent Speaker Identification is moredifficult problem as compared to Text Dependent SpeakerIdentification because the recognition system must be preparedfor an arbitrary input text.Speaker Identification task can be further classified intoclosed set and open set identification.In closed set problem, from N known speakers, the speakerwhose reference template has the maximum degree of similarity with the template of input speech sample of unknownspeaker is obtained. This unknown speaker is assumed to beone of the given set of speakers. Thus in closed set problem,system makes a forced decision by choosing the best matchingspeaker from the speaker database.In the open set text dependent speaker identification,matching reference template for an unknown speaker’s speechsample may not exist. So the system must have a predefinedtolerance level such that the similarity degree between theunknown speaker and the best matching speaker is within thistolerance.In the proposed method, speaker identification is carried outwith spectrograms and transformation techniques such as DCT,WALSH and HAAR [15-18]. Thus an attempt is made toformulate a digital signal processing problem into patternrecognition of images.The rest of the paper is organized as follows: in section IIwe present related work carried out in the field of speakeridentification. In section III our proposed approach ispresented. Section IV elaborates the experiment conducted andresults obtained. Analysis of computational complexity ispresented in section V. Conclusion has been outlined in sectionVI.II.
 
RELATED
 
WORKAll speaker recognition systems at the highest level containtwo modules, feature extraction and feature matching.Feature extraction is the process of extracting subset of features from voice data that can later be used to identify thespeaker. The basic idea behind the feature extraction is that theentire feature set is not always necessary for the identificationprocess. Feature matching is the actual procedure of identifyingthe speaker by comparing the extracted voice data with adatabase of known speakers and based on this suitable decisionis made.There are many techniques used to parametrically representa voice signal for speaker recognition task. One of the mostpopular among them is Mel-Frequency Cepstrum Coefficients(MFCC) [1].The MFCC parameter as proposed by Davis andMermelstein [5] describes the energy distribution of speechsignal in a frequency field. Wang Yutai et. al. [6] has proposeda speaker recognition system based on dynamic MFCCparameters. This technique combines the speaker informationobtained by MFCC with the pitch to dynamically construct aset of the Mel-filters. These Mel-filters are further used toextract the dynamic MFCC parameters which representcharacteristics of speaker’s identity.Sleit, Serhan and Nemir [7] have proposed a histogrambased speaker identification technique which uses a reduced setof features generated using MFCC method. For these features,histograms are created using predefined interval length. Thesehistograms are generated first for all data in feature set forevery speaker. In second approach, histograms are generatedfor each feature column in feature set of each speaker.Another widely used method for feature extraction is use of linear Prediction Coefficients (LPC). LPCs capture theinformation about short time spectral envelope of speech. LPCsrepresent important speech characteristics such as formantspeech frequency and bandwidth [8].Vector Quantization (VQ) is yet another approach of feature extraction [19-22]. In Vector Quantization basedspeaker recognition systems; each speaker is characterized withseveral prototypes known as code vectors [9]. Speakerrecognition based on non-parametric vector quantization wasproposed by Pati and Prasanna [10]. Speech is produced due toexcitation of vocal tract. Therefore in this approach, excitationinformation can be captured using LP analysis of speech signaland is called as LP residual. This LP residual is furthersubjected to non-parametric Vector Quantization to generatecodebooks of sufficiently large size. Combining nonparametricVector Quantization on excitation information with vocal tractinformation obtained by MFCC was also introduced by them.III.
 
PROPOSED METHODS
 In the proposed methods, first we converted the speechsamples collected from various speakers into spectrograms[11]. Spectrograms were created using Short Time FourierTransfer method as discussed below:In the approach using STFT, digitally sampled data aredivided into chunks of specific size say 128, 256 etc. whichusually overlap. Fourier transform is then obtained to calculatethe magnitude of the frequency spectrum for each chunk. Eachchunk then corresponds to a vertical line in the image, which isa measurement of magnitude versus frequency for a specificmoment in time.Thus we converted the speech database into imagedatabase. Different transformation techniques such as DiscreteCosine Transform [12], WALSH transform and HAARtransform are then applied to these images in three differentways to obtain their feature vectors.1. Transform on full image2. Transform on image blocks obtained by dividing animage into four equal and non-overlapping blocks
187http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
 Dr. H. B. Kekre et. al.(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 6, 2010
3. Transform on Row Mean of an image and on RowMean of image blocks.From these feature vectors, again identification rate isobtained for various portions selected from the feature vectori.e. for partial feature vector [15, 23, 24]. Two different sets of database were generated. First set, containing 60% of the totalimages as trainee images and 40% of the total images as testimages. Second set, containing 80% of the images as traineeimages and 20% of the total images as test images. Euclideandistance between test image and trainee image is used as ameasure of similarity. Euclidean distance between the pointsX(X1, X2, etc.) and point Y (Y1, Y2, etc.) is calculated usingthe formula shown in equation. (1).
D =
=
n1i2ii
)Y(X
 
(1)
Smallest Euclidean distance between test image and traineeimage means the most probable match of speaker. Algorithmsfor transformation technique on full image and transformationtechniques on image blocks are given below.
 A.
 
Transformation techniques on full image[27, 28]:
In the first method 2-D DCT / WALSH / HAAR is appliedon the full image resized to 256*256. Further, instead of fullfeature vector of an image only some portion of feature vectori.e. partial feature vector is selected for identification purpose.This selection of feature vector is illustrated in Fig. 1 and it isbased on the number of rows and columns that have beenselected from the feature vector of an image. For example,initially first full feature vector (i.e. 256*256) has been selectedand then partial feature vectors of size 192*192, 128*128,64*64, 32*32, 20*20 and 16*16 were selected from the featurevector. For these different sizes, identification rate wasobtained.
Fig. 1: Selection of partial feature vector
 
Algorithm for this method is as follows:
Step 1.
 
For each trainee image in the database, resize animage to size 256*256.
Step 2.
 
Apply the transformation technique (DCT / WALSH / HAAR) on resized image to obtain itsfeature vector.
Step 3.
 
Save these feature vectors for further comparison.
Step 4.
 
For each test image in the database, resize animage to size 256*256.
Step 5.
 
Apply the transformation technique (DCT / WALSH / HAAR) on resized image to obtain itsfeature vector.
Step 6.
 
Save these feature vectors for further comparison.
Step 7.
 
Calculate the Euclidean distance between featurevectors of each test image with each traineeimage corresponding to the same sentence.
Step 8.
 
Select the trainee image which has smallestEuclidean distance with the test image anddeclare the speaker corresponding to this traineeimage as the identified speaker.Repeat Step 7 and Step 8 for partial feature vectorobtained from the full feature vector.
 B.
 
Transformation technique on image blocks[27, 29]:
In this second method, resized image of size 256*256 isdivided into four equal parts as shown in Fig. 2 and then 2-DDCT / WALSH / HAAR is applied to each part.
I IIIII IV
 
Fig. 2: Image divided into four equal non-overlapping blocks
Thus when N*N image is divided into four equal and non-overlapping blocks, blocks of size N/2*N/2 are obtained.Feature vector of each block when appended as columns formsa feature vector of an image. Thus size of feature vector of animage in this case is of 128*512. Again Euclidean distance isused as a measure of similarity. Here also using partial featurevectors, identification rate has been obtained. Partial featurevectors of size 96*384, 64*256, 32*128, 16*64 and 8*32 havebeen selected to find identification rate. Detailed steps areexplained in algorithm given below:
Step 1.
 
For each trainee image in the database, resize animage to size 256*256.
Step 2.
 
Divide the image into four equal and non-overlapping blocks as explained in Fig. 2.
Step 3.
 
Apply transformation technique (DCT/ WALSH /HAAR) on each block obtained in Step 2.
Step 4.
 
Append the feature vectors of each block oneafter the other to get feature vector of an image.
Step 5.
 
For each test image in the database, resize animage to size 256*256.
Step 6.
 
Divide the image into four equal and non-overlapping blocks as shown in Fig. 2.
Step 7.
 
Apply transformation technique (DCT /WALSH /HAAR) on each block obtained in Step 6.
Step 8.
 
Append the feature vectors of each block oneafter the other to get feature vector of an image.
188http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->