Professional Documents
Culture Documents
Abstract-Text-independent speaker recognition is a research entire frequency band. This step consists in filtering the
hotspot. After 12 MFCCs and centroid of each frame were signal with a first-order high-pass filter described as in (1).
extracted as audio features of speakers, codebooks of speakers It rather weakens the influence of the base voice frequency
are obtained by LBG-VQ algorithm. At last, speaker and strengthens the higher frequencies.
identification is done according to Chebyshev distance rather
than Euclidean distance of codebooks between testing and each
H ( z ) = 1 − k * Z −1 , k ∈ [0.9,1] (1)
training sequence. The experiments show the proposed system In which, k is usually equal to 0.95 or 0.97. In this paper,
is simple and workable in speaker identification with high the value of k is set to 0.95.
speed.
I INTRODUCTION B. Framing
In this step, a whole speech signal is divided into some
Speaker recognition can identify a particular person or shorter signal segments – frames. The length of these
verify a person’s claimed identity by audio features of frames is usually about 10-20ms. Sometimes, the frames
speakers, and it is an important part of speech signal may overlap each other, which shortens the step from the
processing. Now, it is one of the current research hotspots in 10-20 ms to a smaller one (seen in Fig.1).
biological feature recognition, since it has good prospects of
applications in police, military affairs, electrical bank, C. Windowing
information service, and so on. Each frame has to be multiplied with a Hamming window
A speaker recognition system can be divided into two in order to keep the continuity of the first and the last points
main tasks: feature extraction and speaker classification [1]. in the frame. If the signal in a frame is denoted by s(n), n =
The selection of features determines the separability of 0,…N-1, then the signal after Hamming windowing is
speakers, and it also has large influence on the classification s(n)*w(n), where w(n) is the Hamming window defined by :
step. Several analytical approaches have been applied to the 2πn
task of speaker classification. Dynamic Time Warping w(n, α ) = (1 − α ) − α × cos( ) n ∈ [0, N − 1] (2)
(DTW), Vector Quantization (VQ) and Hidden Markov N −1
Different value of α corresponds to different curve for
Models (HMM) are three of the most common approaches
[2]. VQ is a coding technique applied to speech data to form the Hamming windows. In practice, α is set to 0.46.
a representative set of features. This set, or codebook, can
be used to represent a given speaker. VQ is essentially time III FEATURE EXTRACTION
independent, giving identical results even if the time
Feature extraction of speech is one of the most important
sequence of the testing features were randomly shuffled.
issues in the field of speaker recognition, and it is
In this paper*, we proposed a new speaker recognition
representative of the speech. In this part, we will extract
system based on VQ, after extracting Mel Frequency
Mel Frequency Cepstral Coefficient (MFCC for short) and
Cepstral Coefficients (MFCCs) and centroids. It is shown
that this system achieves very high recognition accuracy. centroid.
A. MFCC
For speaker recognition, the most commonly used
II PREPROCESSING OF SPEECH SIGNAL acoustic feature is MFCC. The feature takes human perception
In order to improve system performance, preprocessing of sensitivity with respect to frequencies into consideration, thus
speech signal is needed, which is usually including it is best for speaker recognition.
pre-emphasis, framing and windowing.
A. Pre-emphasis
Pre-emphasis is used traditionally to compensate for the
-6dB/octave spectral slope of the speech signal by giving a
+6 dB/octave lift in the appropriate range, so that the
measured spectrum has a similar dynamic range across the
*
Funded by the Natural Science Foundation of Shandong Province Fig.1 Framing
(No.Y2007G44)
Pg 1989
system, it is better than Euclidean distance. Table.I
ACCURACY RATE
V SIMULATION RESULTS
The speaker identification experiments described in this Euclidean distance Chebyshev distance
section are ‘closed-set’ tests in which all test speakers are 12 MFCCs 92.5% 97.5%
included in the training database. Training data of 40
speakers will be supplied in 16 KHz 16-bit PCM format. 12 MFCCs+Centroid 95% 100%
The frame length is 20ms and the frame-shift is 10ms.
Length of speech sequence is about 10s. In the classification
Table.II
process, the number of code vectors in each codebook is 32. COMPARE WITH REF. [6]
Pg 1990