Professional Documents
Culture Documents
Abstract—Speaker recognition is the process of identifying the analysis method, Fourier transform method, linear
speaker's identity by extracting the acoustic features in the prediction analysis, etc.; features in the cepstrum domain are
speaker's audio for recognition. It is mainly the perception and Linear Prediction Cepstrum Coefficient (LPCC) [2], Mel
simulation of the speaker's vocal tract information and the Frequency Cepstrum Coefficient (MFCC) [3]. However, the
human ear's auditory information, which has profound above single voiceprint feature is either short-term or long-
significance in the fields of human daily life and military term feature statistics. The short-term features cannot fully
affairs. In order to solve this problem, this paper proposes a describe the audio scene, while the long-term statistical
speaker recognition algorithm based on Long Short-Term features will lose the local structure information of the
Memory Networks (LSTM). The feature uses a method that
voiceprint signal, which ultimately affects the speaker
combines Linear Predictive Coding (LPC) with Log-Mel
spectrum. The d-vector output through the LSTM network is
recognition effect. Therefore, it is extremely important to
classified using the Softmax loss function. This method is find acoustic features that satisfy both long-term features
applied to the VCTK audio data set. The experimental results and short-term features. In [4], a feature fusion method
show that the recognition rate of this method reaches 94.9%. combining MFCC and LPC is proposed, but this features
also does not reflect the characteristics of an audio spectrum.
Keywords-speaker recognition; LSTM; LPC; log-mel The Log-Mel feature can well meet the above requirements,
spectrum; softmax loss function. that is, it not only contains the features of the MFCC, but
also reflects the features of the audio spectrum. Therefore,
I. INTRODUCTION this paper uses the feature fusion method combining Log-
Mel energy spectrum and LPC as the feature of voiceprint.
Speaker recognition refers to the identification of people In terms of classification model, traditional models such
based on their voices. Because the voice contains a lot of as GMM-UBM model [5], support vector machine (SVM),
acoustic information named voiceprints, and different hidden markov model (HMM) [6], Joint Factor Analysis
people have their unique voiceprints, this makes it possible (JFA) [7][8] model and i-vector [9], [10] model have not
for humans to identify through voiceprints. Unlike achieved good results under the situation of classification of
traditional identification methods such as keys, passwords, a large number of datasets. Therefore, deep neural network
etc., voiceprints are taken from people's own biological is a better choice. Currently, there are Deep Neural
characteristics and are identification methods that can be Networks [11], Convolutional Neural Networks (CNN) [12].
carried anytime, anywhere and cannot be lost or leaked. The In this paper, a classification model that pays more attention
successful application of speaker recognition will make to Long distance features, namely LSTM, is used to extract
human daily life safer and more convenient. For example, d-vector from the output of LSTM.
people can use the mobile phone to remote control their own The research of speaker recognition in this paper mainly
items, such as cars, home appliances, etc., without worrying consists of two parts: the first part is acoustic feature
about being controlled by others; hearing aids can use this to extraction, including audio preprocessing, LPC feature
identify the main dialogue object and specifically amplify extraction, log-mel feature extraction and the fusion of the
the target sound, etc. In addition, speaker recognition can two features. The second part is the establishment of neural
also be used in criminal investigation, military and other network for speaker recognition, including the construction
fields. of LSTM neural network and the selection of loss function.
Speaker recognition is usually divided into text- Softmax loss function is very suitable for text-independent
dependent speaker recognition and text-independent speaker speaker recognition, and LSTM neural network has an
recognition. Although the former is higher than that of latter, outstanding effect on speech classification, so we choose it to
the application scope of text-independent speaker establish a neural network for voice print recognition.
recognition is much larger than the former. Therefore, this
paper studies text-independent speaker recognition. II. ACOUSTIC FEATURE EXTRACTION
The acoustic features extracted by the existing voiceprint Acoustic features play a crucial role in the training of
recognition in the time domain are short-term auto deep neural network for speaker recognition. Before
correlation function, short-term average zero-crossing rate extracting acoustic features, the audio in the datasets needs
[1], etc.; in the frequency domain, there are filter bank
Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
to do some appropriate preprocessing to assist in feature the transfer function of the equivalent of the audio channel
extraction. In this section, we will introduce the audio signal digital filter is shown in (2):
preprocessing and acoustic feature extraction methods.
G (2)
A. Audio Signal Preprocessing H ( z) p
k 1
signal: audio framing, windowing, and removing the
unvoiced audio after framing. Speech signal is non- To model the vocal tract, we assume that the speech
stationary, but information such as correlation matrix must acoustics of the Nth speech sample S[n] can be considered
be meaningful when the signals are stationary. However,
since the process of speech formation is closely related to as a combination of p past speech samples. Therefore, the
the movement of vocal organs in the human body, and the Nth speech sample (S[n]) can be written as (3):
physical movement of the vocal organs more slowly than p
sound vibration velocity, so speech signals have a short S[n] k S[ n k ] G u[n] (3)
smooth features, namely in a relatively short period of time k 1
(20 ~ 30 ms), audio spectrum characteristic can approximate
as constant, So it is necessary to frame the audio, but the where S[n k ] (k 1, 2,3 p) is p past speech samples,
time-variation of the audio signal must still be considered G is the gain coefficient, u[n] is the corresponding nth
when framing. Because pitch between two adjacent frames
speech sample excitation, k is the channel filter
may be changed. At this time, its characteristic parameters
may have a big change, so in order to make the change of coefficient. LPC is often called "inverse filtering" because
characteristic parameters of a smooth, we made the overlap its purpose is to determine the "all-zero filter", which is the
between adjacent frames. Considering the above points, the inverse filter of the channel model. The LPC model
frame length selected in this paper is 25ms, and 15ms estimates the speech samples of the n-th speech sample
overlap between the two frames. Sˆ[n] from the previous speech samples, as show in (4)
After framing, at the time when we extracting LPC
p
features, by comparing the audio predicted by LPC and the
target audio enrolled in the system, we found that the Sˆ[n] ˆ S[n k ]
k 1
k (4)
prediction error at the unvoiced part of the audio is big,
while the prediction error at the voiced part is smaller,
which will be expanded in detail in Section II. B. From this, where ˆ k is the LPC coefficient. The prediction error is as
it can be concluded that the LPC parameters of the unvoiced formula (5):
segment cannot correctly model the vocal tract model, so
the LPC of those segments cannot be regarded as good e[n] S[ n] Sˆ[ n] (5)
audio features. Therefore, this paper chooses to delete those The prediction error of the prediction error is:
voice segments in audio to increase the capability of LPC
feature modeling vocal tract model. The audio in the shows N
the characteristics of periodic signal in voiced section, and
shows the characteristics of random noise in unvoiced
E (e[n])
n 1
2
(6)
319
Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
1.00
x Through the extraction of Log-Mel spectral features [14],
x_hat
0.75
err we will get the output of 40 mel filters through a frame of
0.50
audio signal, and finally take the logarithm to the base of 10
0.25
for these 40 outputs, then we got the Log-Mel energy
0.00
spectrum. Fig. 4 is a conversion diagram of a speaker's
-0.25
audio time domain map to Log-Mel spectrum.
-0.50
320
Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
where ( x) 1 / (1 e )
x
is the sigmoid function, Fig. 7 illustrates the process in terms of features,
embedded vectors, and cosine similarity matrices, where
Sji, j and Sji, k are similar matrices, and eji is the d-vector. different speakers are represented in different colors.
Therefore, we choose Softmax loss function to cooperate In the similarity matrix of Fig. 7, the colored label is the
with neural network. The following article will write the cosine similarity value with weight of one voice of the
core part of the loss function and the algorithm of the speaker to the own embedded vector center of the speaker,
similarity matrix below. and is called the positive label. Colorless label is weighted
cosine similarity value of a speech of the speaker to another
embedded vector center of the speaker, called a negative
label. The purpose of the loss function is to make the
positive tag as large as possible and the negative tag as
small as possible during the training, so as to achieve the
purpose of voice print recognition.
c1 c2 c3
positive
Speaker 1 Spk 1 e11
label
DIM 20
0 50 100 150 200 Speaker 2 Spk 2 e21
Frames
0.8 Feature LSTM
AMP 0.0 extraction Network negative
e31
Figure 6. Diagram of conversion from features to embedding vectors -0.8
0.0 1.0 2.0 Speaker 3 Spk 3 label
Time(seconds)
during batch processing. Where N is the number of different Figure 7. The generation process of similarity matrix
speakers, M is the number of audio sentences contained in
each speaker, and the Log-Mel spectral features and LPC
features are extracted and combined for each sentence, and IV. DATASETS AND EXPERIMENTS
the feature vector xji (1 j N ,1 i M ) represents the This section mainly explains three section. First, the
feature extracted from the i th sentence of the j th speaker. specific parameters of the VCTK data set used in this paper.
Input xji into the neural network, we define the output of the Second, the division of the training set and test set. Third,
the specific steps and experimental results of the experiment
entire network as f ( xji; w) , where w is all coefficients in in this paper.
the neural network (including LSTM layer and linear layer),
then the d-vector is obtained by L2 regularization from the A. Datasets
output of the network, as show in (10): In this experiment, VCTK audio datasets was adopted,
which contained 109 speakers. Each speaker contained 300-
f ( xji; w) 500 audio, and a total of 44,242 audio for model training.
eji (10)
The sampling rate of monoaural audios was 48 kHz, the
f ( xji; w) 2
sampling accuracy was 16 bits and the sound length of
audios are varied from 5 to 10 seconds. Of the 109 speakers,
where eji represents the embedding vector of the i th the ratio of speakers used for training to speakers used for
sentence of the j th speaker, then the center of the testing is approximately 8:2, that is, 90 speakers used for
embedding vector ej1, , ejM of the j th speaker is cj , training and 19 speakers used for testing.
B. Experiments
M
1
cj m ejm
During the experiment, we will carry out voice activity
M
e
m 1
jm (11) detection for the audio after removing unvoiced segment.
the non-mute audio fragments are taken out. These audio
fragments are effective audio fragments.
The similarity matrix Sji , k is defined as the cosine After obtaining all effective audio clips, frame them.
similarity matrix with weight of each embedded vector eji Then extract and combine the features according to III.
to all central ck (1 j, k N ,1 i M ) : Although the dimension axis direction of each audio clip is
80 dimensions, the length of the frame axis direction is not
Sji , k w cos(eji, ck ) b, w 0 (12) uniform, which makes training difficult. Therefore, the
frame axis direction of each effective audio clip feature
where w and b are learnable parameters. We want the should be cropped. We extracted the first 180 frames and
cosine similarity to be proportional to the similarity of the the last 180 frames for saving, as shown in Fig. 8.
vector, so we constrain the weight to be positive, which is In the Fig. 8. is the audio clipping graph of four audio in
one speaker. The audio is divided into effective audio
w0. segments with different Numbers through the detection of
speech activity. One effective audio segment is then clipped
321
Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
into two 180 frames for preservation, so as to achieve the feature used was a fusion feature of 40-dimensional LPC
goal of uniform frame axis length. and 40-dimensional Log-Mel fusion, and the ERR was
When putting audio features into the LSTM for training, 5.10%. Compared with the 40-dimensional Log-Mel feature,
we set N=4, and M=5 to form a batch. Applying Stochastic the reduction of the ERR is 2.79%. The ERR of the 40-
Gradient Descent (SGD) with an initial learning rate of 0.01 dimensional LPC feature was reduced by 4.04%, which
to train the network, and reducing it by half every 100k significantly improved the accuracy of speaker recognition.
steps. For the two hyperparameters (w, b) in the loss
REFERENCES
function, a relatively good initial value according to the
[1] H. Sun, H. Long, Y. Shao and Q. Du, “Speech music classification
experiment is (w, b) = (10, -5). algorithm based on zero-crossing rate and frequency spectrum,”,
Journal of Yunnan University (Natural Science Edition), 2019, 41(5):
180 frames 925-931.
[2] X. Huang, L. Zhang, L. Cao, F. Wang, S. Zhang, “Research on
180 frames Intelligent Speech Recognition Method Based on LPCC in Different
Frequency Bands,”, Electronic Design Engineering, Jan. 2020, 28(2):
Audio1 22-25.
Audio2 [3] Q. Guo, B. Tao, “SOPC design of voiceprint feature extraction based
Audio1 on multi-parameter improved MFCC,”, Qiqihar University, Jun, 2016:
Audio2 Audio3 5-8.
Audio3 [4] A. Chowdhury, R. Arun, “Fusing MFCC and LPC Features using 1D
Audio4 Audio4 Triplet CNN for Speaker Recognition in Severely Degraded Audio
Signals,”, IEEE Transactions on Information Forensics and Security,
December 2020Vol. 15, Issue 1, pp. 1616 - 1629.
Audios before split Audios after split
[5] S. Yuan, C. Sun, H. Yang, “Recognition of Aircraft Engine Sound
Figure 8. Segmentation of effective audio segments Based on GMM-UBM Model,”, International Conference on
Electronic Information Technology and Computer Engineering, 2017,
7(8): 781-787.
In the testing phase, 19 speakers were tested, each with [6] J. Yang, “Research on Recognition Method of Fundamental
Frequency of Musical Notes Based on HMM Model,”, Bulletin of
an average of 3.5 sentence features for registration and an Science and Technology, Nov. 2019, 35(11): 109-112.
average of 3.5 sentences for evaluation. Table 1 shows the
[7] D. A. Reynolds, T. F. Quatieri, R. B. Dunn. “Speaker verification
performance comparison of the algorithm under different sing adapted gaussian mixture models,”. Digital Signal Processing,
characteristics. The first column is the Equal Error Rate 2000, 10(1-3): 19-41.
(EER) when the feature is only 40-dimensional LPC, the [8] M. Hebert. “Text-dependent speaker recognition,”. Heidelberg:
second column is the EER when the feature is only 40- Springer, 2008: 743-762.
dimensional Log-Mel, and the third column is the 80- [9] A. Kanagasundaram, R. Vogt, D. Dean. “i-Vector based speaker
dimensional feature obtained by fusing LPC and Log-Mel. recognition on short utterances,”. Interspeech 2011, Aug 27-31. 2011:
As shown in the Table. 1, the error rate of the feature after 2341-2344.
fusion is smaller than that of LPC and Log-Mel alone, and [10] Z. Wu, S. Pan. “CNN-based continuous voice speaker voiceprint
recognition,”. Telecommunications Science, 2017, (3): 59-66.
the error rate is reduced by more than 2%.
[11] J. Rohdin, A. Silnova, M. Diez, O. Plchot,L. Burget, “End-to-end
TABLE I. ERR UNDER THREE FEATURES DNN based text-independent speaker recognition for long and short
utterances,”, Computer Speech and Language, 2020, 59: 22-25.
Three Features [12] S. Vegad, H. Patel, H. Zhuang and M. Naik, “Audio-Visual Person
LPC (40dim) Log-Mel (40dim) LPC + Log-Mel (80dim) Recognition Using Deep Convolutional Neural Networks,”, Journal
of Biometrics & Biostatistics, 2017, 8(5): 1-5.
ERR(%) 9.14 7.89 5.10 [13] J. S. Erkelens. “Autoregressive modelling for speech coding:
estimation, interpolation and quantisation,”, 1996: 3-10.
[14] F. Sun, M. Wang, Q. Xu, X. Xuan, X. Zhang. “Acoustic Scene
V. CONCULSION Recognition Based on Convolutional Neural Networks,” International
Conference on Signal and Image Processing (ICSIP), 2019: 2-3.
In this paper, an LSTM-based speaker recognition
experiment was performed on the VCTK dataset. The
322
Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.