You are on page 1of 3

A Speaker Recognition System Based on VQ

ZHAO Yanling ZHENG Xiaoshi GAO Huixian LI Na


Shandong Computer Science Center
Jinan, Shandong, 250014, China

Abstract-Text-independent speaker recognition is a research entire frequency band. This step consists in filtering the
hotspot. After 12 MFCCs and centroid of each frame were signal with a first-order high-pass filter described as in (1).
extracted as audio features of speakers, codebooks of speakers It rather weakens the influence of the base voice frequency
are obtained by LBG-VQ algorithm. At last, speaker and strengthens the higher frequencies.
identification is done according to Chebyshev distance rather
than Euclidean distance of codebooks between testing and each
H ( z ) = 1 − k * Z −1 , k ∈ [0.9,1] (1)
training sequence. The experiments show the proposed system In which, k is usually equal to 0.95 or 0.97. In this paper,
is simple and workable in speaker identification with high the value of k is set to 0.95.
speed.
I INTRODUCTION B. Framing
In this step, a whole speech signal is divided into some
Speaker recognition can identify a particular person or shorter signal segments – frames. The length of these
verify a person’s claimed identity by audio features of frames is usually about 10-20ms. Sometimes, the frames
speakers, and it is an important part of speech signal may overlap each other, which shortens the step from the
processing. Now, it is one of the current research hotspots in 10-20 ms to a smaller one (seen in Fig.1).
biological feature recognition, since it has good prospects of
applications in police, military affairs, electrical bank, C. Windowing
information service, and so on. Each frame has to be multiplied with a Hamming window
A speaker recognition system can be divided into two in order to keep the continuity of the first and the last points
main tasks: feature extraction and speaker classification [1]. in the frame. If the signal in a frame is denoted by s(n), n =
The selection of features determines the separability of 0,…N-1, then the signal after Hamming windowing is
speakers, and it also has large influence on the classification s(n)*w(n), where w(n) is the Hamming window defined by :
step. Several analytical approaches have been applied to the 2πn
task of speaker classification. Dynamic Time Warping w(n, α ) = (1 − α ) − α × cos( ) n ∈ [0, N − 1] (2)
(DTW), Vector Quantization (VQ) and Hidden Markov N −1
Different value of α corresponds to different curve for
Models (HMM) are three of the most common approaches
[2]. VQ is a coding technique applied to speech data to form the Hamming windows. In practice, α is set to 0.46.
a representative set of features. This set, or codebook, can
be used to represent a given speaker. VQ is essentially time III FEATURE EXTRACTION
independent, giving identical results even if the time
Feature extraction of speech is one of the most important
sequence of the testing features were randomly shuffled.
issues in the field of speaker recognition, and it is
In this paper*, we proposed a new speaker recognition
representative of the speech. In this part, we will extract
system based on VQ, after extracting Mel Frequency
Mel Frequency Cepstral Coefficient (MFCC for short) and
Cepstral Coefficients (MFCCs) and centroids. It is shown
that this system achieves very high recognition accuracy. centroid.
A. MFCC
For speaker recognition, the most commonly used
II PREPROCESSING OF SPEECH SIGNAL acoustic feature is MFCC. The feature takes human perception
In order to improve system performance, preprocessing of sensitivity with respect to frequencies into consideration, thus
speech signal is needed, which is usually including it is best for speaker recognition.
pre-emphasis, framing and windowing.

A. Pre-emphasis
Pre-emphasis is used traditionally to compensate for the
-6dB/octave spectral slope of the speech signal by giving a
+6 dB/octave lift in the appropriate range, so that the
measured spectrum has a similar dynamic range across the

*
Funded by the Natural Science Foundation of Shandong Province Fig.1 Framing
(No.Y2007G44)

978-1-4244-1718-6/08/$25.00 ©2008 IEEE Pg 1988


Feature extraction based on MFCC proceeds right after 1 F
the input speech is pre-emphasized, framed and windowed. C1 = ∑ Xf (6)
F f =1
We shall explain the step-by-step computation of MFCC in
this section, and select the first 12 MFCCs. Where F is the number of frames, X f is the character
(a) Fast fourier transform or FFT: To obtain the magnitude vector of fth frame. Set m=1.
frequency response of each frame. (b) Split into 2m code vectors according to the current
(b) Triangular band-pass filters: We multiple the magnitude codebook C m , where m is the total number of code
frequency response by a set of 24 triangular band-pass vectors of current codebook, and ε > 0 . In this
filters to get the log energy of each triangular band-pass paper ε = 0.01 .
filter. The positions of these filters are equally spaced
along the Mel frequency, described in (3): C m+ = (1 + ε )Cm
 − (7)
mel ( f ) = 1125 * ln(1 + f / 700) (3) C m = (1 − ε )C m
(c) Discrete cosine transform or DCT: In this step, we (c) Classify all the character vectors according to the new
apply DCT on the 24 log energy Ek obtained from the codebook C in terms of the smallest Enclidean distance,
triangular band-pass filters to get L MFCCs. and calculate quantization distortion D n and relative
N mπ (k − 0.5)
C m = ∑ E k * cos( ), m = 1,2, ", L (4) distortion RD n :
k =1 N
Note: N is the number of triangular bandpass filters in
F D n − D n −1
D n = ∑ min( X f , C ), RD n = (8)
(4), L is the number of MFCCs. Usually we set N=24 f =1 Dn
and L=12. So the values of C m are the wanted
MFCCs. If RD n is below ε , stop iterations and C is the wanted
codebook containing 2m code vectors, and go to step
B. Centroid (e). Otherwise, go to the next step.
MFCC alone can be used as the feature for speech (d) Update new code vectors according to the following
recognition. For better performance, we take centroid into equation, and go back to step (c).
consideration. Centroids in [3] are features of major audio 1
signal, and they reflect the information of the basic Cj = ∑ Xi, (9)
N X i∈C j
frequency in each frame. First of all, we should divide each
frame into 32 sections. Then equations to compute these Note: N is the number of character vectors quantized to
features are: Cj
K
2
32 (e) Repeat step (b) ~ (d) until the desired number of code
∑ S i , j (t ) ∑ j * RMS i, j
t =1 j =1 vectors is obtained.
RMS i , j = , Ci = 32
(5)
K
∑ RMS i , j B. Speaker Recognition
j =1
First of all, obtain codebooks of testing sequence and
Where K is the number of data in each section, S i , j (t ) is each training sequence based on LBG-VQ. Suppose C im is
the signal of section j in Frame i, RMS j is the root mean the mth (m=1,2, … ,M) code vectors in codebook i,
square of section j in Frame i, and Ci is the centroid of (i=1,2…,N), calculate Mean Quantization Distortion (MQD
Frame i. for short) between testing codebook and each training
codebook. The corresponding index of minimum MQD is
the recognized person. In this paper Quantization Distortion
IV SPEAKER CLASSIFICATION (QD) is obtained by Chebyshev distance.
A. LBG-VQ QD ( X , Y ) = max X l − Yl ,
VQ (Vector Quantization) technique is widely used in l
text-dependent and text-independent speaker recognition 1 (10)
i
MQD(i ) = ∑ min [QD(C j , C m )]
systems. In 1980, Linde, Buzo, and Gray (LBG) proposed a M j 1≤m≤M
VQ algorithm based on a training sequence to generate In most cases when people say about distance, they refer
codebook [4]. This LBG-VQ algorithm requires an initial to Euclidean distance [5]. Euclidean distance or simply
codebook C1 , which is set as the average of the entire 'distance' is the root of square differences between
training sequence. This code vector is then split into two. So coordinates of a pair of objects.
these two code vectors are split into four and the process is Chebyshev distance between two points is the maximum
repeated until the desired number of code vectors is distance between the points in any single dimension. And it
obtained. The algorithm is summarized as follows. may be appropriate if the difference between points is
(a) Calculate the initial codebook: reflected more by differences in individual dimensions
rather than all the dimensions considered together. In this

Pg 1989
system, it is better than Euclidean distance. Table.I
ACCURACY RATE
V SIMULATION RESULTS
The speaker identification experiments described in this Euclidean distance Chebyshev distance
section are ‘closed-set’ tests in which all test speakers are 12 MFCCs 92.5% 97.5%
included in the training database. Training data of 40
speakers will be supplied in 16 KHz 16-bit PCM format. 12 MFCCs+Centroid 95% 100%
The frame length is 20ms and the frame-shift is 10ms.
Length of speech sequence is about 10s. In the classification
Table.II
process, the number of code vectors in each codebook is 32. COMPARE WITH REF. [6]

A. Feature Extraction Code Accuracy


Centroids can be used to improve recognition Features Measurement
Vectors Rate
performance. In this system, we use 12 MFCCs and
centroid as 13 features of speakers. In order to display the Proposed 12MFCCs
32 Chebyshev 100%
contribution of each MFCC and centroid, first of all, we algorithm +Centroid
ensure that the inaccuracy rate of speaker identification is
0% when all the 13 features are used. Then, we obtain Ref. [6]
12MFCCs
64 Euclidean 97.5%
recognition inaccuracy rate when one of MFCCs is absent +Energy
each time in the case of presence of centroid and not. The
simulation results can be seen in the Fig.2.
From Fig.2, inaccuracy rate is reduced with the help of
centroid as a whole, since most of difference values VI CONCLUSION
represented by ‘*’ is above or on horizontal line. In this paper, we proposed a text-independent speaker
recognition system. Besides 12 MFCCs, centroid of each
B. Measurement frame is extracted as a feature of speakers. In the process of
From lots of experiments, the average accuracy rate gets training, we apply the LBG-VQ algorithm to get codebooks
to 95% when Euclidean distance is used, while it can reach of speakers. At last, Chebyshev distance is used as a
100% when we use Chebyshev distance. Thus, the measurement to identify different speakers. The
performance is better when Chebyshev distance is used, and experiments show it is more efficient than Euclidean
it can be shown in Table.I. distance. As a whole, this proposed system is simple in
design, and it is workable in text-independent speaker
C. Speaker Identification recognition.
This proposed algorithm based on VQ can be used in
text-independent speaker recognition after 12 MFCCs and
centroid are extracted. In order to show the performance of REFERENCES
this proposed algorithm, we compare it with algorithm in [6] [1] Eriksson, T., Kim, S., Hong-Goo Kang, and Chungyong Lee. “An
information-theoretic perspective on feature selection in speaker
with high speed and accuracy based on LBG-VQ. The recognition”. Signal Processing Letters, IEEE, 2005, vol.12, pp.500 –
results can be seen in Table.II. 503.
[2] Yu, K., Mason, J., and Oglesby, J. “Speaker recognition using hidden
Markov models, dynamic time warping and vector quantization”,
Vision, Image and Signal Processing, IEE Proceedings, 1995, vol.
142, pp.313 – 318.
[3] Mesaros, A., and Astola, J. “Inter-dependence of spectral measures for
the singing voice”. International Symposium on Signals, Circuits and
Systems (ISSCS), 2005, vol. 1, pp.307-310.
[4] Zhang Linghua, Yang Zhen, and Zheng Baoyu. “A new method to
train VQ codebook for HMM based speaker identification”. Signal
Processing, 2004 Proceedings. ICSP’04. 2004, vol.1, pp.651 – 654.
[5] Zhenchun Lei, Yingchun Yang, and Zhaohui Wu. “An UBM-Based
Reference Space for Speaker Recognition” 18th International
Conference on Pattern Recognition, 2006. vol. 4, pp.318 – 321.
[6] Yang Shao, Bingzhe Liu, and Zongge Li. “A Speaker Recognition
System Using MFCC Features and Weighted Vector Quantization”.
Computer Engineering and Applications, 2002. vol. 5, pp.127-128.

Fig.2 Inaccuracy rate when nth MFCC is absent

Pg 1990

You might also like