You are on page 1of 5

2020 IEEE 5th International Conference on Signal and Image Processing

Speaker Recognition Based on Long Short-Term Memory Networks


2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP) | 978-1-7281-6896-8/20/$31.00 ©2020 IEEE | DOI: 10.1109/ICSIP49896.2020.9339289

Qihang Xu, Mingjiang Wang, Changlai Xu, Lu Xu


Department of Electronics and Information Engineering
Harbin Institute of Technology, Shenzhen
Shenzhen, China
e-mail: xqh447181905@163.com

Abstract—Speaker recognition is the process of identifying the analysis method, Fourier transform method, linear
speaker's identity by extracting the acoustic features in the prediction analysis, etc.; features in the cepstrum domain are
speaker's audio for recognition. It is mainly the perception and Linear Prediction Cepstrum Coefficient (LPCC) [2], Mel
simulation of the speaker's vocal tract information and the Frequency Cepstrum Coefficient (MFCC) [3]. However, the
human ear's auditory information, which has profound above single voiceprint feature is either short-term or long-
significance in the fields of human daily life and military term feature statistics. The short-term features cannot fully
affairs. In order to solve this problem, this paper proposes a describe the audio scene, while the long-term statistical
speaker recognition algorithm based on Long Short-Term features will lose the local structure information of the
Memory Networks (LSTM). The feature uses a method that
voiceprint signal, which ultimately affects the speaker
combines Linear Predictive Coding (LPC) with Log-Mel
spectrum. The d-vector output through the LSTM network is
recognition effect. Therefore, it is extremely important to
classified using the Softmax loss function. This method is find acoustic features that satisfy both long-term features
applied to the VCTK audio data set. The experimental results and short-term features. In [4], a feature fusion method
show that the recognition rate of this method reaches 94.9%. combining MFCC and LPC is proposed, but this features
also does not reflect the characteristics of an audio spectrum.
Keywords-speaker recognition; LSTM; LPC; log-mel The Log-Mel feature can well meet the above requirements,
spectrum; softmax loss function. that is, it not only contains the features of the MFCC, but
also reflects the features of the audio spectrum. Therefore,
I. INTRODUCTION this paper uses the feature fusion method combining Log-
Mel energy spectrum and LPC as the feature of voiceprint.
Speaker recognition refers to the identification of people In terms of classification model, traditional models such
based on their voices. Because the voice contains a lot of as GMM-UBM model [5], support vector machine (SVM),
acoustic information named voiceprints, and different hidden markov model (HMM) [6], Joint Factor Analysis
people have their unique voiceprints, this makes it possible (JFA) [7][8] model and i-vector [9], [10] model have not
for humans to identify through voiceprints. Unlike achieved good results under the situation of classification of
traditional identification methods such as keys, passwords, a large number of datasets. Therefore, deep neural network
etc., voiceprints are taken from people's own biological is a better choice. Currently, there are Deep Neural
characteristics and are identification methods that can be Networks [11], Convolutional Neural Networks (CNN) [12].
carried anytime, anywhere and cannot be lost or leaked. The In this paper, a classification model that pays more attention
successful application of speaker recognition will make to Long distance features, namely LSTM, is used to extract
human daily life safer and more convenient. For example, d-vector from the output of LSTM.
people can use the mobile phone to remote control their own The research of speaker recognition in this paper mainly
items, such as cars, home appliances, etc., without worrying consists of two parts: the first part is acoustic feature
about being controlled by others; hearing aids can use this to extraction, including audio preprocessing, LPC feature
identify the main dialogue object and specifically amplify extraction, log-mel feature extraction and the fusion of the
the target sound, etc. In addition, speaker recognition can two features. The second part is the establishment of neural
also be used in criminal investigation, military and other network for speaker recognition, including the construction
fields. of LSTM neural network and the selection of loss function.
Speaker recognition is usually divided into text- Softmax loss function is very suitable for text-independent
dependent speaker recognition and text-independent speaker speaker recognition, and LSTM neural network has an
recognition. Although the former is higher than that of latter, outstanding effect on speech classification, so we choose it to
the application scope of text-independent speaker establish a neural network for voice print recognition.
recognition is much larger than the former. Therefore, this
paper studies text-independent speaker recognition. II. ACOUSTIC FEATURE EXTRACTION
The acoustic features extracted by the existing voiceprint Acoustic features play a crucial role in the training of
recognition in the time domain are short-term auto deep neural network for speaker recognition. Before
correlation function, short-term average zero-crossing rate extracting acoustic features, the audio in the datasets needs
[1], etc.; in the frequency domain, there are filter bank

978-1-7281-6896-8/20/$31.00 ©2020 IEEE 318

Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
to do some appropriate preprocessing to assist in feature the transfer function of the equivalent of the audio channel
extraction. In this section, we will introduce the audio signal digital filter is shown in (2):
preprocessing and acoustic feature extraction methods.
G (2)
A. Audio Signal Preprocessing H ( z)  p

There are three steps in the pre-processing of audio 1  k z k

k 1
signal: audio framing, windowing, and removing the
unvoiced audio after framing. Speech signal is non- To model the vocal tract, we assume that the speech
stationary, but information such as correlation matrix must acoustics of the Nth speech sample S[n] can be considered
be meaningful when the signals are stationary. However,
since the process of speech formation is closely related to as a combination of p past speech samples. Therefore, the
the movement of vocal organs in the human body, and the Nth speech sample (S[n]) can be written as (3):
physical movement of the vocal organs more slowly than p
sound vibration velocity, so speech signals have a short S[n]    k S[ n  k ]  G u[n] (3)
smooth features, namely in a relatively short period of time k 1
(20 ~ 30 ms), audio spectrum characteristic can approximate
as constant, So it is necessary to frame the audio, but the where S[n  k ] (k  1, 2,3 p) is p past speech samples,
time-variation of the audio signal must still be considered G is the gain coefficient, u[n] is the corresponding nth
when framing. Because pitch between two adjacent frames
speech sample excitation,  k is the channel filter
may be changed. At this time, its characteristic parameters
may have a big change, so in order to make the change of coefficient. LPC is often called "inverse filtering" because
characteristic parameters of a smooth, we made the overlap its purpose is to determine the "all-zero filter", which is the
between adjacent frames. Considering the above points, the inverse filter of the channel model. The LPC model
frame length selected in this paper is 25ms, and 15ms estimates the speech samples of the n-th speech sample
overlap between the two frames. Sˆ[n] from the previous speech samples, as show in (4)
After framing, at the time when we extracting LPC
p
features, by comparing the audio predicted by LPC and the
target audio enrolled in the system, we found that the Sˆ[n]  ˆ S[n  k ]
k 1
k (4)
prediction error at the unvoiced part of the audio is big,
while the prediction error at the voiced part is smaller,
which will be expanded in detail in Section II. B. From this, where ˆ k is the LPC coefficient. The prediction error is as
it can be concluded that the LPC parameters of the unvoiced formula (5):
segment cannot correctly model the vocal tract model, so
the LPC of those segments cannot be regarded as good e[n]  S[ n]  Sˆ[ n] (5)
audio features. Therefore, this paper chooses to delete those The prediction error of the prediction error is:
voice segments in audio to increase the capability of LPC
feature modeling vocal tract model. The audio in the shows N
the characteristics of periodic signal in voiced section, and
shows the characteristics of random noise in unvoiced
E  (e[n])
n 1
2
(6)

section. The short-time zero crossing ratio of random noise


signal is more than periodic signal. Thus, the method to By using auto-regressive modelling [13] to minimizing
eliminate those segments is to compare the short-time zero- the energy (E), we can obtain the inverse vocal tract filter
crossing ratio. We choose 0.06 as the threshold of the short- model, and its filter coefficient is defined as the LPC model
time zero-crossing ratio, and then delete the frame which parameter ˆ k . In this paper, the 40-dimensional LPC is
short-time zero-crossing ratio is greater than this threshold. extracted as the input feature.
After the unvoiced segments be removed, the next step The audio of each frame is restored by LPC and the
is to add windows. In this paper, hamming window is entire audio is restored by the overlap-add method. The blue
selected to add a window to the audio signal. The formula of curve in Fig. 2 shows the waveform of the target audio
hamming window is shown in (1). enrolled in the system, and the orange dotted line is the
predicted audio waveform. Correspondingly, subtracting the
W  n   0.54  0.46cos  2 n/(N-1)  , 0  n  N  1 (1) amplitude of the target audio enrolled in the system signal
and the predicted audio signal by (5) to obtain the error, as
After the windowing operation, the preprocessing of shown by the green curve in Fig. 1.
audio finished. As shown in Fig. 1, there are parts in the prediction error
that are similar to the amplitude of the target audio enrolled
B. LPC Feature Extraction in the system, or even larger than the amplitude of the target
The human channel can be modeled by time-varying audio enrolled in the system. This part is divided into the
digital filters and approximated by all-pole filters. Therefore, unvoiced segment mentioned in part A of this paper.

319

Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
1.00
x Through the extraction of Log-Mel spectral features [14],
x_hat
0.75
err we will get the output of 40 mel filters through a frame of
0.50
audio signal, and finally take the logarithm to the base of 10
0.25
for these 40 outputs, then we got the Log-Mel energy
0.00
spectrum. Fig. 4 is a conversion diagram of a speaker's
-0.25
audio time domain map to Log-Mel spectrum.
-0.50

-0.75 Original wave


AMP 0.8 DIM
0 20000 40000 60000 80000 100000 0.4
0.0 20
Figure 1. Restoration of audio and error waveform. -0.4
0
-0.8 0.0 0.5 1.0 1.5 2.0 0 25 50 75 100 125 150 175 200
The target audio enrolled in the system, the predicted Time(seconds) Frames
Figure 4. Time domain map to Log-Mel spectrum
audio signal and the error signal obtained after the above
LPC feature extraction and audio restoration are shown in
Fig. 2: D. Feature Fusion of LPC and Log-Mel Spectrum
As mentioned above, LPC captures the characteristics of
1.00
x the speaker's vocal tract model, and the Log-Mel spectrum
0.75 x_hat
err
captures the auditory characteristics of the human ear, that is,
0.50 the characteristics of voice perception. The nature of the
0.25 captured voice features of the two is complementary.
0.00 Therefore, we combine LPC and Log-Mel spectral features
-0.25 to uniquely represent the speech features of the speaker. In
-0.50
this paper, a feature-level fusion algorithm based on LSTM
-0.75
is designed to combine the speech features in the LPC and
Log-Mel spectral feature spaces. Combine and project into a
0 10000 20000 30000 40000 50000 60000 70000 80000
d-dimensional joint feature space, where the value of d
Figure 2. Predicted waveform after removing unvoiced segments depends on the LSTM network architecture. As shown in
Through this operation, the LPC parameters can more Fig 5:
effectively reflect the characteristics of the vocal tract of the
Frame Level
speaker, making the LPC features more effective in Correspondence
voiceprint recognition. Feature Feature
Axis 40d Log-Mel Axis
spectrum features
C. Log-Mel Feature Extraction 80d Mixed
Feature
40d LPC feature
There is no linear relationship between the level of
sound heard by human ears and the actual frequency, and Frame Axis Frame Axis
the Mel frequency is more in line with the auditory Figure 5. The mapping principles between time domain and Log-Mel
characteristics of human ears, that is, the distribution is domain.
linear below 1000Hz and the logarithmic increase above The speech features obtained by feature fusion are highly
1000Hz. The mapping relationship between Mel frequency differentiated, which improves the performance of voice
and Hz frequency is shown in (7): print recognition.

f III. ESTABLISHMENT OF SPEAKER RECOGNITION NEURAL


f mel  2595  lg(1 
) (7) NETWORK
700 Hz
Log-Mel spectral features are used to model human A. Construction of LSTM Neural Network
auditory perception systems. Therefore, we also choose The neural network structure used in this paper is a three-
Log-Mel spectral features to represent one of the features of layer LSTM structure. The dimension of the embedding
the perceived speech in a given audio sample. The process vector (d-vector) is the same as the projection dimension of
of extracting Log-Mel spectral features is shown in Fig. 3: LSTM. For the text-independent speaker recognition studied
in this paper, we use 768 hidden nodes and the projection
dimension is 256 to completing the conversion from audio
feature to d vector, as shown in Fig. 6.
B. Selection of Loss Function
After experiments, we found the Softmax loss function
(8) is very good for the speaker recognition in this paper:
N
L(e ji )   S ji , j  log  exp( S ji ,k ) (8)
Figure 3. Log-Mel spectrum extraction process k 1

320

Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
where  ( x)  1 / (1  e )
x
is the sigmoid function, Fig. 7 illustrates the process in terms of features,
embedded vectors, and cosine similarity matrices, where
Sji, j and Sji, k are similar matrices, and eji is the d-vector. different speakers are represented in different colors.
Therefore, we choose Softmax loss function to cooperate In the similarity matrix of Fig. 7, the colored label is the
with neural network. The following article will write the cosine similarity value with weight of one voice of the
core part of the loss function and the algorithm of the speaker to the own embedded vector center of the speaker,
similarity matrix below. and is called the positive label. Colorless label is weighted
cosine similarity value of a speech of the speaker to another
embedded vector center of the speaker, called a negative
label. The purpose of the loss function is to make the
positive tag as large as possible and the negative tag as
small as possible during the training, so as to achieve the
purpose of voice print recognition.
c1 c2 c3
positive
Speaker 1 Spk 1 e11
label
DIM 20
0 50 100 150 200 Speaker 2 Spk 2 e21
Frames
0.8 Feature LSTM
AMP 0.0 extraction Network negative
e31
Figure 6. Diagram of conversion from features to embedding vectors -0.8
0.0 1.0 2.0 Speaker 3 Spk 3 label
Time(seconds)

Audio Embedding Similarity


Audio data
This article uses N  M statements to form a batch feature vectors matrix

during batch processing. Where N is the number of different Figure 7. The generation process of similarity matrix
speakers, M is the number of audio sentences contained in
each speaker, and the Log-Mel spectral features and LPC
features are extracted and combined for each sentence, and IV. DATASETS AND EXPERIMENTS
the feature vector xji (1  j  N ,1  i  M ) represents the This section mainly explains three section. First, the
feature extracted from the i th sentence of the j th speaker. specific parameters of the VCTK data set used in this paper.
Input xji into the neural network, we define the output of the Second, the division of the training set and test set. Third,
the specific steps and experimental results of the experiment
entire network as f ( xji; w) , where w is all coefficients in in this paper.
the neural network (including LSTM layer and linear layer),
then the d-vector is obtained by L2 regularization from the A. Datasets
output of the network, as show in (10): In this experiment, VCTK audio datasets was adopted,
which contained 109 speakers. Each speaker contained 300-
f ( xji; w) 500 audio, and a total of 44,242 audio for model training.
eji  (10)
The sampling rate of monoaural audios was 48 kHz, the
f ( xji; w) 2
sampling accuracy was 16 bits and the sound length of
audios are varied from 5 to 10 seconds. Of the 109 speakers,
where eji represents the embedding vector of the i th the ratio of speakers used for training to speakers used for
sentence of the j th speaker, then the center of the testing is approximately 8:2, that is, 90 speakers used for
embedding vector ej1, , ejM  of the j th speaker is cj , training and 19 speakers used for testing.
B. Experiments
M
1
cj  m  ejm  
During the experiment, we will carry out voice activity
M
e
m 1
jm (11) detection for the audio after removing unvoiced segment.
the non-mute audio fragments are taken out. These audio
fragments are effective audio fragments.
The similarity matrix Sji , k is defined as the cosine After obtaining all effective audio clips, frame them.
similarity matrix with weight of each embedded vector eji Then extract and combine the features according to III.
to all central ck (1  j, k  N ,1  i  M ) : Although the dimension axis direction of each audio clip is
80 dimensions, the length of the frame axis direction is not
Sji , k  w  cos(eji, ck )  b, w  0 (12) uniform, which makes training difficult. Therefore, the
frame axis direction of each effective audio clip feature
where w and b are learnable parameters. We want the should be cropped. We extracted the first 180 frames and
cosine similarity to be proportional to the similarity of the the last 180 frames for saving, as shown in Fig. 8.
vector, so we constrain the weight to be positive, which is In the Fig. 8. is the audio clipping graph of four audio in
one speaker. The audio is divided into effective audio
w0. segments with different Numbers through the detection of
speech activity. One effective audio segment is then clipped

321

Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.
into two 180 frames for preservation, so as to achieve the feature used was a fusion feature of 40-dimensional LPC
goal of uniform frame axis length. and 40-dimensional Log-Mel fusion, and the ERR was
When putting audio features into the LSTM for training, 5.10%. Compared with the 40-dimensional Log-Mel feature,
we set N=4, and M=5 to form a batch. Applying Stochastic the reduction of the ERR is 2.79%. The ERR of the 40-
Gradient Descent (SGD) with an initial learning rate of 0.01 dimensional LPC feature was reduced by 4.04%, which
to train the network, and reducing it by half every 100k significantly improved the accuracy of speaker recognition.
steps. For the two hyperparameters (w, b) in the loss
REFERENCES
function, a relatively good initial value according to the
[1] H. Sun, H. Long, Y. Shao and Q. Du, “Speech music classification
experiment is (w, b) = (10, -5). algorithm based on zero-crossing rate and frequency spectrum,”,
Journal of Yunnan University (Natural Science Edition), 2019, 41(5):
180 frames 925-931.
[2] X. Huang, L. Zhang, L. Cao, F. Wang, S. Zhang, “Research on
180 frames Intelligent Speech Recognition Method Based on LPCC in Different
Frequency Bands,”, Electronic Design Engineering, Jan. 2020, 28(2):
Audio1 22-25.
Audio2 [3] Q. Guo, B. Tao, “SOPC design of voiceprint feature extraction based
Audio1 on multi-parameter improved MFCC,”, Qiqihar University, Jun, 2016:
Audio2 Audio3 5-8.
Audio3 [4] A. Chowdhury, R. Arun, “Fusing MFCC and LPC Features using 1D
Audio4 Audio4 Triplet CNN for Speaker Recognition in Severely Degraded Audio
Signals,”, IEEE Transactions on Information Forensics and Security,
December 2020Vol. 15, Issue 1, pp. 1616 - 1629.
Audios before split Audios after split
[5] S. Yuan, C. Sun, H. Yang, “Recognition of Aircraft Engine Sound
Figure 8. Segmentation of effective audio segments Based on GMM-UBM Model,”, International Conference on
Electronic Information Technology and Computer Engineering, 2017,
7(8): 781-787.
In the testing phase, 19 speakers were tested, each with [6] J. Yang, “Research on Recognition Method of Fundamental
Frequency of Musical Notes Based on HMM Model,”, Bulletin of
an average of 3.5 sentence features for registration and an Science and Technology, Nov. 2019, 35(11): 109-112.
average of 3.5 sentences for evaluation. Table 1 shows the
[7] D. A. Reynolds, T. F. Quatieri, R. B. Dunn. “Speaker verification
performance comparison of the algorithm under different sing adapted gaussian mixture models,”. Digital Signal Processing,
characteristics. The first column is the Equal Error Rate 2000, 10(1-3): 19-41.
(EER) when the feature is only 40-dimensional LPC, the [8] M. Hebert. “Text-dependent speaker recognition,”. Heidelberg:
second column is the EER when the feature is only 40- Springer, 2008: 743-762.
dimensional Log-Mel, and the third column is the 80- [9] A. Kanagasundaram, R. Vogt, D. Dean. “i-Vector based speaker
dimensional feature obtained by fusing LPC and Log-Mel. recognition on short utterances,”. Interspeech 2011, Aug 27-31. 2011:
As shown in the Table. 1, the error rate of the feature after 2341-2344.
fusion is smaller than that of LPC and Log-Mel alone, and [10] Z. Wu, S. Pan. “CNN-based continuous voice speaker voiceprint
recognition,”. Telecommunications Science, 2017, (3): 59-66.
the error rate is reduced by more than 2%.
[11] J. Rohdin, A. Silnova, M. Diez, O. Plchot,L. Burget, “End-to-end
TABLE I. ERR UNDER THREE FEATURES DNN based text-independent speaker recognition for long and short
utterances,”, Computer Speech and Language, 2020, 59: 22-25.
Three Features [12] S. Vegad, H. Patel, H. Zhuang and M. Naik, “Audio-Visual Person
LPC (40dim) Log-Mel (40dim) LPC + Log-Mel (80dim) Recognition Using Deep Convolutional Neural Networks,”, Journal
of Biometrics & Biostatistics, 2017, 8(5): 1-5.
ERR(%) 9.14 7.89 5.10 [13] J. S. Erkelens. “Autoregressive modelling for speech coding:
estimation, interpolation and quantisation,”, 1996: 3-10.
[14] F. Sun, M. Wang, Q. Xu, X. Xuan, X. Zhang. “Acoustic Scene
V. CONCULSION Recognition Based on Convolutional Neural Networks,” International
Conference on Signal and Image Processing (ICSIP), 2019: 2-3.
In this paper, an LSTM-based speaker recognition
experiment was performed on the VCTK dataset. The

322

Authorized licensed use limited to: University of Technology Sydney. Downloaded on May 23,2021 at 09:33:13 UTC from IEEE Xplore. Restrictions apply.

You might also like