Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms

Comparative Analysis of Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms

Ratings: (0)|Views: 96 |Likes:
Published by ijcsis
In this paper we propose Speaker Identification using four different Transform Techniques. The feature vectors are the row mean of the transforms for different groupings. Experiments were performed on Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Discrete Sine Transform (DST) and Walsh Transform (WHT). All the Transform give an accuracy of more than 80% for the different groupings considered. Accuracy increases as the number of samples grouped is increased from 64 onwards. But for groupings more than 1024 the accuracy again starts decreasing. The results show that DST performs best. The maximum accuracy obtained for DST is 96% for a grouping of 1024 samples while taking the transform.
In this paper we propose Speaker Identification using four different Transform Techniques. The feature vectors are the row mean of the transforms for different groupings. Experiments were performed on Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Discrete Sine Transform (DST) and Walsh Transform (WHT). All the Transform give an accuracy of more than 80% for the different groupings considered. Accuracy increases as the number of samples grouped is increased from 64 onwards. But for groupings more than 1024 the accuracy again starts decreasing. The results show that DST performs best. The maximum accuracy obtained for DST is 96% for a grouping of 1024 samples while taking the transform.

More info:

Published by: ijcsis on Feb 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

02/15/2011

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
Comparative Analysis of Speaker Identification usingrow mean of DFT, DCT, DST and Walsh Transforms
Dr. H B Kekre
Senior Professor, Computer Department,MPSTME, NMIMS University,Mumbai, Indiahbkekre@yahoo.com 
Vaishali Kulkarni
Associate Professor, Electronics & Telecommunication,MPSTME, NMIMS University,Mumbai, IndiaVaishalikulkarni6@yahoo.com 
 Abstract
 — 
 
In this paper we propose Speaker Identification usingfour different Transform Techniques. The feature vectors are therow mean of the transforms for different groupings. Experimentswere performed on Discrete Fourier Transform (DFT), DiscreteCosine Transform (DCT), Discrete Sine Transform (DST) andWalsh Transform (WHT). All the Transform give an accuracy of more than 80% for the different groupings considered. Accuracyincreases as the number of samples grouped is increased from 64onwards. But for groupings more than 1024 the accuracy againstarts decreasing. The results show that DST performs best. Themaximum accuracy obtained for DST is 96% for a grouping of 1024 samples while taking the transform.
 Keywords - Euclidean distance, Row mean, Speaker Identification,Speaker Recognition
I.
 
I
NTRODUCTION
Human
speech conveys an abundance of information, from thelanguage and gender to the identity of the person speaking. Thepurpose of a speaker recognition system is thus to extract theunique characteristics of a speech signal that identify aparticular speaker. [1, 2, 3] Speaker recognition systems areusually classified into two subdivisions, speaker identificationand speaker verification. Speaker identification (also known asclosed set identification) is a 1: N matching process where theidentity of a person must be determined from a set of knownspeakers [3 - 5]. Speaker verification (also known as open setidentification) serves to establish whether the speaker is who heclaims to be [6]. Speaker recognition can be further classifiedinto text-dependent and text-independent systems. In a textdependent system, the system knows what utterances to expectfrom the speaker. However, in a text-independent system, noassumptions about the text can be made, and the system must bemore flexible than a text dependent system.Speaker recognition technology has made it possible to usethe speaker's voice to control access to restricted services, forexample, for giving commands to computer, phone access tobanking, database services, shopping or voice mail, and accessto
secure equipment.
Speaker Recognition systems have beendeveloped for a wide range of applications [7 - 10].Although many new techniques have been developed,widespread deployment of applications and services is still notpossible. None of these systems gives accurate and reliableresults. We When you open have proposed speaker recognitionusing vector quantization in time domain by using LBG (Linde
Buzo Gray), KFCG (Kekre’s Fast Codebook Generation) andKMCG (Kekre’s Median Codebook Generation) algori
thms[11], [12], [13] and in transform domain using DFT, DCT andDST [14].The concept of row mean of the transform techniques hasbeen used for content based image retrieval (CBIR) [15
 – 
18].This technique also has been applied on speaker identificationby first converting the speech signal into a spectrogram [19].For the purposes of this paper, we will be considering aspeaker identification system that is text-dependent. For theidentification purpose, the feature vectors are extracted bytaking the row mean of the transforms (Which is a columnvector). The technique is used as shown in figure 1. Here aspeech signal of 15 samples is divided into 3 blocks of 5 each,and these 3 blocks form the columns of the matrix whosetransform is taken. Then the mean of the absolute value of eachrow of the transform matrix is taken and this forms the columnvector of mean.The rest of the paper is organized as follows: Section 2explains feature generation using the transform techniques,Section 3 deals with Feature Matching, and the results areexplained in Section 4 and the conclusion in section 5.II.
 
T
RANSFORM TECHNIQUES
 
 A.
 
 Discrete Fourier Transform
Spectral analysis is the process of identifying componentfrequencies in data. For discrete data, the computational basisof spectral analysis is the discrete Fourier transform (DFT).The DFT transforms time- or space-based data into frequency-based data. The DFT allows you to efficiently estimatecomponent frequencies in data from a discrete set of valuessampled at a fixed rate. If the speech signal is represented byy(t), then the DFT of the time series or samples y
0
, y
1
,y
2,
 
…..y
N-1
is defined as:
102http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
Y
k
= 
n
e
-
2jπkn/N
 
(1)Where y
n
=y
s
(nΔt); k= 0, 1, 2…, N
-1.
Δt is the sampling interval.
 
Figure 1. Row Mean Generation Technique
 B.
 
 Discrete Cosine Transform
A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functionsoscillating at different frequencies.
(2)
Where y(k) is the cosine transform, k=1,…, N.
 k=1
2≤k≤N
 
The DCT is closely related to the discrete Fourier transform.You can often reconstruct a sequence very accurately fromonly a few DCT coefficients, a useful property for applicationsrequiring data reduction [20
 – 
22].
C.
 
 Discrete Sine Transform
A discrete sine transform (DST) expresses a sequence of finitely many data points in terms of a sum of sine functions.
)
(3)Where y(k) is the sine transform,
k=1,…, N.
 
 D.
 
Walsh Transform
The Walsh transform or Walsh
 – 
Hadamard transform is a non-sinusoidal, orthogonal transformation technique thatdecomposes a signal into a set of basis functions. These basisfunctions are Walsh functions, which are rectangular or squarewaves with values of +1 or
 – 
1. The Walsh
 – 
Hadamardtransform returns sequency values. Sequency is a moregeneralized notion of frequency and is defined as one half of the average number of zero-crossings per unit time interval.Each Walsh function has a unique sequency value. You canuse the returned sequency values to estimate the signalfrequencies in the original signal. The Walsh
 – 
Hadamardtransform is used in a number of applications, such as imageprocessing, speech processing, filtering, and power spectrumanalysis. It is very useful for reducing bandwidth storagerequirements and spread-spectrum analysis. Like the FFT, theWalsh
 – 
Hadamard transform has a fast version, the fastWalsh
 – 
Hadamard transform (
fwht
). Compared to the FFT,the FWHT requires less storage space and is faster to calculatebecause it uses only real additions and subtractions, while theFFT requires complex values. The FWHT is able to representsignals with sharp discontinuities more accurately using fewercoefficients than the FFT. FWHT
h
is adivide and conqueralgorithmthatrecursivelybreaks down a WHT of size
 N 
intotwo smaller WHTs of size
 N 
/ 2. This implementation followsthe recursive definition of the Hadamardmatrix
 H 
 N 
:(4)The normalization factors for each stage may be groupedtogether or even omitted. The Sequency ordered, also knownas Walsh ordered, fast Walsh
 – 
Hadamard transform, FWHT
w
,is obtained by computing the FWHT
h
as above, and thenrearranging the outputs [23].III.
 
F
EATURE
E
XTRACTION
 The procedure for feature vector extraction is given below:1.
 
The speech signal is divided into groups of n samples.(Where n can take values: 64, 128, 256, 512, 1024,2048, and 4096) samples.2.
 
These blocks are then arranged as columns of a matrixand then the different transforms given in section II aretaken.C
1
C
2
C
3
C
4
C
5
Row Mean(1 × 5)Speech signal (1 × 15)
1 6 112 7 123 8 134 9 145 10 15
Speech signalconverted intomatrix (5 × 3)
Transform matrix(5 × 3)
 
Dividinginto blocksof 5Transform
 
Mean ofeach row
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T
103http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, 2011
3.
 
The mean of the absolute values of the rows of thetransform matrix is then calculated.4.
 
These row means form a column vector (1
×
n wheren is the number of rows in the transform matrix).
5.
 
This column vector forms the feature vector forthe speech sample.
 6.
 
The feature vectors for all the speech samples arecalculated for different values of n and stored in thedatabase.Figure 2 shows the row mean generated for the fourtransforms for a grouping of 64 samples for one of thespeech signal in the databases. These 64 row meansform the feature vector for the particular sampleconsidered. In a similar fashion, the feature vectors forother speech signals were also calculated. This processwas repeated for all values of n. As can be seen fromfigure 2, the 64 mean values form a 1×64 featurevector.IV.
 
R
E
SULTS
 A.
 
 Basics of speech signal
The speech samples used in this work are recorded usingSound Forge 4.5. The sampling frequency is 8000 Hz (8 bit,mono PCM samples). Table I shows the database description.The samples are collected from different speakers. Samples aretaken from each speaker in two sessions so that training modeland testing data can be created. Twelve samples per speaker aretaken. The samples recorded in one session are kept in databaseand the samples recorded in second session are used for testing.
01020304050607000.511.522.53Mean of the absolute value for each row of the transform matrix
   A  m  p   l   i   t  u   d  e
Row Mean for DFT for a grouping of 64 samples
 
01020304050607000.511.52Row mean of the absolute value for each row of the Transform matrix
   A  m  p   l   i   t  u   d  e
Row mean for DST for a grouping of 64 samples
 
01020304050607000.010.020.030.04 
   A   m   p   l   i   t   u   d   e
Row Mean for Walsh for a grouping of 64 samples
 
01020304050607000.10.20.30.4Row mean of the absolute value for each row of the Transform matrix
   A  m  p   l   i   t  u   d  e
Row mean for DCT for a grouping of 64 samples
 
(A)(B)(C)(D)
 
Figure 2. Row Mean Generation for a grouping of 64 samples for one of the speech signal
104http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->