You are on page 1of 15

Circuits, Systems, and Signal Processing

https://doi.org/10.1007/s00034-019-01092-3

A Speaker Verification Method Based on TDNN–LSTMP

Hui Liu1,2 · Longlian Zhao1,2

Received: 8 May 2018 / Revised: 4 March 2019 / Accepted: 12 March 2019


© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract
In speaker recognition, a robust recognition method is essential. This paper proposes
a speaker verification method that is based on the time-delay neural network (TDNN)
and long short-term memory with recurrent project layer (LSTMP) model for the
speaker modeling problem in speaker verification. In this work, we present the appli-
cation of the fusion of TDNN and LSTMP to the i-vector speaker recognition system
that is based on the Gaussian mixture model-universal background model. By using
a model that can establish long-term dependencies to create a universal background
model that contains a larger amount of speaker information, it is possible to extract
more feature parameters, which are speaker dependent, from the speech signal. We
conducted experiments with this method on four corpora: two in Chinese and two in
English. The equal error rate, minimum detection cost function and detection error
tradeoff curve are used as criteria for system performance evaluation. The exper-
imental results show that the TDNN–LSTMP/i-vector speaker recognition method
outperforms the baseline system on both Chinese and English corpora and has better
robustness.

Keywords Speaker verification · I-vector · TDNN · LSTM · Short utterances

1 Introduction

Speaker verification (SV) is a branch of speaker recognition (SR) that refers to authen-
ticating the claimed identity of a speaker based on a speech signal and an enrolled
speaker record [15]. The result of this task is to “accept” or “reject” a binary collection.

B Longlian Zhao
zsczll@cau.edu.cn
Hui Liu
liuh@cau.edu.cn

1 College of Information and Electrical Engineering, China Agricultural University, Beijing, China
2 Modern Precision Agricultural System Integration Research Key Laboratory, Ministry of Education,
Beijing, China
Circuits, Systems, and Signal Processing

SV systems can broadly be categorized into text-dependent (TD) and text-independent


(TI) types based on the lexical content of the spoken voice. The TD–SV requires the
same lexical content for both enrollment and testing. In contrast, TI–SV has no restric-
tions on the text/phonetic content of speech. Voice is an inherent characteristic of each
speaker, and the language of the speaker is arbitrary. For practical applications, the
TI–SV system is more widely used. Therefore, the research object of this paper is the
TI–SV system. The general research on the SV system includes two aspects: extract-
ing more effective speaker acoustic features from the front end and modeling acoustic
features from the back end to obtain a more effective speaker classifier.
The TI–SR system was originally developed from the Gaussian mixture model
(GMM) model. It expresses the speech acoustic feature sequence as a sufficient
statistic. To solve the problem of sparseness of registered speaker data, the univer-
sal background model (UBM) is proposed. It is trained using all speaker speech data.
A speaker-independent model, in theory, should include the commonalities of all
speakers [17]. Due to the robustness of the channel, the method of factor analysis was
introduced to develop the i-vector method [3, 11]. The i-vector is a more compact
vector that is extracted from the GMM mean supervector, which contains the dif-
ferences between speakers and the differences between channels, and all current SV
systems are based on this framework. The i-vector-based speaker recognition system
mainly includes sufficient statistics extraction, i-vector mapping and likelihood ratio
score calculations. The method first uses GMM–UBM to align speech acoustic feature
sequences and express high-dimensional statistics. Then, it maps the statistics into an
i-vector based on subspace factor analysis. Finally, a probabilistic linear discriminant
analysis (PLDA) back end is used to compute the similarity between i-vectors, where
PLDA significantly improves i-vector speaker performance [4].
Deep neural network (DNN) is also introduced into SV systems because it can learn
complex and high-order features from original features and learn decision boundaries
under a variety of conditions. The combination of automatic speaker recognition and
DNN acoustic models can improve UBM’s ability to model speech content [2, 7, 19],
and [19] proposes that TDNN can effectively establish long-term time dependence in
SV. The UBM is effectively built to improve system performance, but TDNN uses
a fixed window length and cannot fully utilize contextual information. Like TDNN,
recurrent neural networks (RNNs) can also use internal storage to process arbitrary
input sequences to capture temporal dynamics. However, RNNs use a dynamic change
context window of all sequence histories instead of a static fixed-size window on the
sequence. This capability makes RNN more suitable for sequence modeling tasks.
However, as the time interval increases, RNN loses the ability to learn to connect to
information that is far away, that is, the gradient disappears. However, an elegant RNN
architecture, namely the proposed LSTM, successfully solves the vanishing gradient
and exploding gradient problems [1] while training RNNs with back-propagation
through time (BPTT) and enables LSTM to effectively model the long-range context.
Based on this information, this article puts forward an SV system that uses a com-
bined network of TDNN–LSTM as a feature extractor to create a UBM. It solves the
problem that TDNN cannot make full use of context information because of the fixed
window length. It can provide a dynamic context window of all sequence histories,
model long sequences, effectively mine inter-frame a priori information of speech
Circuits, Systems, and Signal Processing

signals, and effectively learn dynamic time-based signals for remote time-dependent
modeling.

2 TDNN–LSTMP Models Description

DNN can model a long span as well as high-dimensional and related features. In
addition to using contexts as input features for modeling temporal dynamics, neural
network architectures can capture long-term dependencies between sequential events.
The TDNN is an architecture with such a design for processing sequential data. The
TDNN is a feedforward network. However, it has a delay in the layer weights that
are associated with the input rights [14]. By adding a set of delays to the input, the
data can be represented at different points in time. This allows the TDNN to have a
limited dynamic response to the time series input data. The TDNN is similar to the
convolutional neural network (CNN), which is convolved only along the time axis.
The TDNN starts with a short context and expands the context of context learning
as the number of hidden layers increases. Therefore, TDNN can better represent the
contextual relationship in time than DNN.
In the feedforward neural network, information can only flow from the bottom
level to the top level without considering the impact of the historical information
calculated by the same layer in the current calculation. The LSTM [20] model not
only enables long-term orbital memory and transient memory but also can simulate
selective forgetting by the human brain. In recent years, LSTM RNNs have been
successfully applied to speech recognition to significantly improve speech recognition
model performance. The LSTM hidden layer consists of a set of reconnect units called
“memory blocks.” Each memory block contains one or more self-connected memory
cells and three gate controllers, namely input, output, and forgetting, to control the
flow of information; that is, they provide continuous memory cells to write, read, and
reset operations. Among a number of variants of LSTM, we use the structure that is
proposed in [18], which is called LSTMP, the architecture is shown in Fig. 1. This
structure reduces the number of neurons in the feedback layer. When the number is
decreased, the number of parameters in the memory neural network makes the acoustic
model more robust. A summary of the LSTMP formula is as follows:

i t  σ (Wi x xt + Wir rt−1 + Wic ct−1 + bi ) (1)

f t  σ (W f x xt + W f r rt−1 + W f c ct−1 + b f ) (2)

ct  f t  ct−1 + i t  tanh(Wcx xt + Wcr rt−1 + bc ) (3)

ot  σ (Wox xt + Wor rt−1 + Woc ct−1 + bo ) (4)

m t  ot  tanh(ct ) (5)

pt  W pm m t (6)
Circuits, Systems, and Signal Processing

Fig. 1 LSTMP architecture. A single memory block is shown for clarity

rt  Wr m m t (7)

yt  ( pt , rt ) (8)

where the W ∗∗ are weight matrices and b∗ are biases, σ is the sigmoid function, it ,
f t , ot , ct , mt , r t and pt are vectors with input gate, forget gate, output gate, cell state,
cell output, the recurrent and projection values, respectively.
The cell input and cell output activation functions, generally tanh.
To improve the modeling capabilities of acoustic models, it has become a trend
to combine different layers of network structure [6, 10]. Therefore, combining the
advantages of TDNN and LSTM network models, in this paper, we use a TDNN–L-
STMP network structure and i-vector to model the speaker space. In the process of
extracting sufficient statistics, this method uses the deep neural network model based
on bound triphone state in speech recognition to replace the UBM in the UBM/i-vector
model to extract the posterior probability of each frame for each category. The dif-
ference between GMM and TDNN–LSTMP network in speaker recognition is that in
GMM, each mixed component represents one category, which is obtained by unsu-
pervised clustering. Each class has no specific meaning and only represents modules
in space. However, in the neural network model based on bound tri-phoneme states,
each output node represents a category, and each class is clustered by decision tree in
speech recognition to get bound tri-phoneme state, which has a clear correspondence
with speech content. DNN is used to calculate the posterior probability of each frame
to each category, and extract the hidden information in the speaker’s voice, so more
information can be mined.
The proposed TDNN–LSTMP network structure is composed of TDNN and
LSTMP hidden layers, and its combination is shown in Fig. 2. The figure shows a
network structure with six hidden layers. In the TDNN layer, we use the sub-sampled
TDNN to get sufficient speech feature information from the input time step, which
allows the input of each layer to be spaced in time. Behind the TDNN hidden layer
Circuits, Systems, and Signal Processing

Fig. 2 Diagram of TDNN–LSTMP configuration

Table 1 Models configuration for Layer Codsntext Layer-type


TDNN–LSTMP
1 [− 1,0,1] TDNN
2 [− 1,0,1] TDNN
3 [− 1,0,1] TDNN
4 [0] LSTMP
5 [− 3,0,3] TDNN
6 [− 3,0,3] TDNN
7 [0] LSTMP
8 [− 3,0,3] TDNN
9 [− 3,0,3] TDNN
10 [0] LSTMP

is a layer of LSTMP, followed by the TDNN layer, which is sequentially distributed.


The details of the TDNN–LSTMP structure layers, time delays, and splicing model
are shown in Table 1. In Table 1, we use {− n, m} to represent a hidden layer, and t
denotes a frame, that is, the input of the hidden layer is the splicing of the output of
the previous layer at the t − n frame and the t + m frame. For example, {− 1, 0, 1}
indicates that the input to the hidden layer is a spliced version of the previous layer
output at time t − 1, t, t + 1. In the experiment, the number of projection layers is set
as half of the number of memory cells, and the number of LSTMP recursive neural
network parameters is reduced to 1/4 of that in the LSTM recursion neural network. In
this paper, the network structure of the nonlinear activation function is a tanh function.
Circuits, Systems, and Signal Processing

Table 2 Corpus statistics Corpus Spks Male Female Utts Language

LibriSpeech 80 40 40 5253 English


Timit 630 438 192 6300 English
Aishell 400 214 186 141,600 Chinese
Thchs-30 40 9 31 13,388 Chinese

3 Experiments and Results

3.1 Datasets

In the experiment, we trained the UBM model, the T-matrix, and the PLDA back end
using the LibriSpeech clean-100 [13] dataset which includes 100 h of clean speech of
28539 English utterances from 251 speakers (125 female and 126 male speakers).
The test set of this paper used four corpora (two English corpora and two Chinese
corpora). The corpus parameters are shown in Table 2. The LibriSpeech is a corpus
of read English speech derived from LibriVoxs audiobooks. Timit [5] is a read speech
corpus, in which speakers come from the 8 major dialect regions of the United States of
America. Aishell [8] is a multi-channel Mandarin corpus, recorded at the condition of
silent office, which covers 11 areas such as smart home, unmanned driving. Thchs-30
[21] is a Mandarin corpus recorded in clean environment, with content selected from
a large amount of news. The all datasets are sampled at 16 kHz.

3.2 Evaluation Criteria

The indicators that are used in this paper are equal error rate (EER), minimum detection
cost function (MinDCF) and detection error tradeoff (DET) curve according to the
2008 NIST speaker recognition evaluation (SRE) [12]. The EER is the error rate when
the false rejection rate is equal to the false acceptance rate. The detection cost function
is defined by the following expression:

 
Cdet  CMiss PMiss|Target Ptarget + CFalseAlarm PFalseAlarm|Nontarget 1 − Ptarget (9)

where CMiss and CFalseAlarm are the cost of missed alarm and false alarm, respectively.
PMiss|Target and PFalseAlarm|Nontarget Where are the missed alarm rate and false alarm
rate for a given threshold θ, and Ptarget is the prior probability of the specified target
speaker. In the NIST SRE task, the parameters of MinDCF are defined as CMiss  1,
CFalseAlarm  10, Ptarget  0.01. The DET curve gives a comprehensive performance
evaluation of the speaker recognition system, and the minDCF is the most important
evaluation indicator of system performance.
The lower the EER and MinDCF are, the closer the DET curve is to the origin and
the better the recognition performance of the system is. The distance between DET
curves can effectively describe the differences between systems.
Circuits, Systems, and Signal Processing

Fig. 3 UBM/i-vector speaker recognition system

3.3 Experimental Setup

This article describes an experiment in which the open-source toolkit Kaldi is used
under the Linux operating system Ubuntu 16.04. As the latest and most popular voice
recognition tool, Kaldi is an open-source toolkit that was developed at the University
of Cambridge and written in C++. In the training process of a neural network, a GPU
can substantially increase the training speed. Thus, a GPU is used to train the neu-
ral network. When using a GPU for neural network training, the number of parallel
processes for training the neural network is set, and a fixed sample from each parallel
neural network training step is extracted; eventually, after completely parallel train-
ing of the neural network is achieved, the mean training time of the neural network
is calculated. One iteration of the neural network structure must be completed many
times. We adopt mel-frequency cepstral coefficients (MFCC) as acoustic feature rep-
resentation for training the neural network [9]. The experimental configuration of the
system structure is described below.
The speaker recognition model based on the UBM/i-vector method [22] is shown
in Fig. 3, where UBM creates sufficient data for the extraction of the i-vector. Based
on the MFCC features, a UBM model is first trained using the expectation maximiza-
tion algorithm. The T-matrix is a low-rank total variability matrix, and the T-matrix
and i-vector are extracted using zeroth-order and first-order Baum-Welch statistics
of the UBM. The PLDA side calculates the similarity scores between the i-vectors.
For a given two-segment speech, assuming that their corresponding i-vectors are
w1 and w2 , respectively, the PLDA model calculates the similarity by the following
formula:

p((w1 , w2 )|θtar )
score  log (10)
p((w1 , w2 )|θnontar )

where θ tar represents the assumption that the w1 and w2 are from the same speaker,
and θ nontar represents the assumption that the w1 and w2 are from different speakers.
The higher the score, the greater the likelihood that the two voices will come from
the same speaker. In this paper, using TDNN–LSTMP to create a UBM by combining
Circuits, Systems, and Signal Processing

the speaker characteristics, the GMM model parameters in the UBM model under this
framework are estimated by DNN clustering, and the update formula is as follows:
(i)
γk  P(k|yi , ) (11)


N
(i)
λk  γk (12)
i1

1 N
(i)
μk  (i)
γk x i (13)
γk i1

1 
N
(i)
Σk  (i)
γk (xi − μk )(xi − μk )T (14)
γk i1

Here, for the given speaker recognition features xi , λk , μk and Σ k are, respectively,
the weighting, mean and covariance of the kth Gaussian. The DNN parameter is
represented by , and P(k|yi , ), i.e., γ (i) , is the posterior probability of triphone k at
frame i given the DNN features yi .

3.3.1 UBM/i-Vector Baseline

The baseline is a standard i-vector system that is based on the GMM–UBM Kaldi
SRE10 V1 [16]. The front end consists of 20 MFCCs with a 25-ms frame length, and
the features are mean-normalized over a 3-s window. All features are 60-dimensional,
including the first-order difference and second- order difference coefficients. Non-
speech segments are removed with energy-based voice activity detection. The UBM
is a 1425-component full-covariance GMM. The GMM–UBM is initialized with a
diagonal covariance matrix for 4 iterations of EM and 4 iterations with a full covari-
ance matrix. The system uses a 100-dimensional i-vector extractor for 5 iterations of
EM. The back end is PLDA scoring.

3.3.2 TDNN–LSTMP/i-Vector

TDNN–LSTMP is used to create a UBM that models phonetic content, which differs
from the baseline in the UBM training steps. The system is based on the DNN posteriors
and speaker features for creating a GMM. In the extraction of the i-vector, both systems
follow Fig. 3. We use the TDNN–LSTMP architecture that is described in Sect. 2. The
output dimension of the TDNN layer is 1024, and the LSTM layer has a cell dimension
of 1024. The input features are 40 MFCCs with a 25-ms frame length.

3.4 Results

To evaluate the effectiveness of the proposed method, three sets of experiments were
performed based on the four corpora mentioned above. Part 1 compares the perfor-
mances of the UBM/i-vector and TDNN–LSTMP/i-vector systems on the LibriSpeech,
Circuits, Systems, and Signal Processing

Table 3 EER (%) values for various corpora

Corpus UBM/i-vector TDNN–LSTMP/i-vector Improvement (%)

LibriSpeech 3.279 2.186 33


Timit 4.291 1.866 57
Aishell 7.250 2.750 62
Thchs-30 8.333 6.667 20

Table 4 MinDCF values for various corpora

Corpus UBM/i-vector TDNN–LSTMP/i-vector Improvement (%)

LibriSpeech 0.0175 0.0113 35


Timit 0.0264 0.0138 48
Aishell 0.0425 0.0153 64
Thchs-30 0.0566 0.0487 14

Timit, Aishell, and Thchs-30 corpora. Part 2 compares the performances on four cor-
pora under various registered speech numbers. Part 3 tests the performance of the
system under various speech duration conditions. Because of the different recording
conditions and methods of the corpora, the four corpora represent different experi-
mental environments.

3.4.1 Performance on the Test Set

Tables 3 and 4 show the system performances of UBM/i-vector and TDNN–LSTMP/i-


vector on the four corpora in terms of EER, minDCF, and TDNN–LSTMP/i-vector’s
relative elevation values. According to the table, for the four corpora, both Chinese
and English, the TDNN–LSTMP/i-vector system outperforms UBM/i-vector. For the
Chinese Aishell corpus, the TDNN–LSTMP/i-vector system obtains values of EER
and minDCF that are 62% and 64% higher, respectively, than those of the UBM/i-
vector baseline. For the other three corpora, the TDNN–LSTMP/i-vector system also
achieves performance improvements.
In Fig. 4, we show the DET curves of the two systems for the four corpora. The
TDNN–LSTMP/i-vector curves are closer than the UBM/i-vector curves for the four
corpora. The origin indicates that the TDNN–LSTMP/i-vector system always outper-
forms the UBM/i-vector system on the four corpora; for the Timit and Aishell corpora,
the difference between the DET curves of the two systems is the largest, which indi-
cates that the system performance improvements on these corpora are particularly
prominent.

3.4.2 Performance in Terms of Number of Utterances Enrolled

Experiment 2 studied the influence of each speaker on two systems with different
numbers of utterances for required enrollment. Tables 5 and 6 present comparisons
Circuits, Systems, and Signal Processing

Librispeech Timit

40 40
UBM UBM
TDNN-LSTMP TDNN-LSTMP
Miss probability (in %)

Miss probability (in %)


20 20

10 10

5 5

2 2

1 1
1 2 5 10 20 40 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %)

Aishell Thchs-30

40 40
UBM UBM
TDNN-LSTMP TDNN-LSTMP
Miss probability (in %)

Miss probability (in %)

20 20

10 10

5 5

2 2

1 1
1 2 5 10 20 40 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %)

Fig. 4 DET curves for various corpora

of the EER and minDCF of the system, respectively, when the number of enrolled
utterances is increased from 1 to 20. For a more intuitive comparison, the corresponding
polyline graph is drawn, as shown in Figs. 5 and 6. Figures 5 and 6 show that the
performance of the two systems increases with the number of registered voices. Once
the number of registered utterances reaches approximately eight, the performance of
the system reaches a stable value, and in the training set, which belongs to the same
LibriSpeech corpus as the test set, when the number of registered utterances is as low
as 1 or 2, the performance of the system is relatively good and the number of registered
utterances does not have a large impact.

3.4.3 Performance Against Shorter Duration

This experiment tests the performance of the system on the Aishell and LibriSpeech
corpora under various speech duration conditions. We show the system performance
under various speech duration conditions in Table 7. We use two systems to test the
Circuits, Systems, and Signal Processing

Table 5 SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for enrollment in terms of
EER (%)

Utts UBM/i-vector TDNN–LSTMP/i-vector

LibriSpeech Timit Aishell Thchs-30 LibriSpeech Timit Aishell Thchs-30

1 5.464 9.142 6.750 10.000 2.732 8.022 5.500 8.333


2 3.825 5.784 6.500 6.667 2.186 4.478 4.250 6.667
3 3.279 4.851 7.250 8.333 2.186 2.985 3.500 5.000
4 3.825 5.224 7.250 8.333 2.186 3.358 3.750 8.333
8 3.825 4.291 6.750 8.333 2.186 1.866 3.000 6.667
12 3.279 4.291 7.250 8.333 2.186 1.866 3.000 6.667
20 3.825 4.291 7.250 8.333 1.639 1.866 2.750 6.667

Table 6 SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for enrollment in terms of
minDCF

Utts UBM/i-vector TDNN–LSTMP/i-vector

LibriSpeech Timit Aishell Thchs-30 LibriSpeech Timit Aishell Thchs-30

1 0.0235 0.0547 0.0446 0.0559 0.0141 0.0396 0.0225 0.0415


2 0.0202 0.0406 0.0424 0.0513 0.0120 0.0242 0.0190 0.0348
3 0.0184 0.0342 0.0396 0.0544 0.0114 0.0192 0.0185 0.0449
4 0.0190 0.0347 0.0384 0.0557 0.0113 0.0178 0.0174 0.0379
8 0.0173 0.0264 0.0397 0.0530 0.0123 0.0138 0.0158 0.0416
12 0.0154 0.0264 0.0403 0.0524 0.0103 0.0138 0.0148 0.0485
20 0.0156 0.0264 0.0402 0.0527 0.0108 0.0138 0.0153 0.0482

system performance under durations of 4 s, 3 s, 2 s, and 1 s. Fig 7 also illustrates the


system’s DET curve. In Table 7, a decrease in system performance can be observed
when the duration is reduced from 4 s to 3 s, 2 s or 1 s. In terms of EER, durations
of 4 s and 3 s seem to achieve satisfactory performance on the two corpora, and these
performances are much better than those achieved with durations of 1 s and 2 s. When
the test speech is not segmented, that is, when the full condition is used, the original
speech of the test set is used. For lengths of 4 s or more, the performance is better
than the system performance when the test speech duration is 1 s to 4 s because the
average duration of the test corpus is slightly longer. At the same time, by comparing
the performances of the two systems, it can be seen that TDNN/i-vector outperforms
the baseline system UBM/i-vector for various duration conditions.
According to the DET curve of the system, when the test speech duration is 1 s,
the performance of the system is poor, and when the duration is increased to 2 s,
the performance improvement is more obvious; when the TDNN–LSTMP/i-vector
system is implemented on the Aishell corpus, there is an increase of more than 60%.
Correspondingly, when the test speech duration is longer, the system performance is
better.
Circuits, Systems, and Signal Processing

UBM/ivector
10
LibriSpeech
9 Timit
Aishell
8 Thchs-30

7
EER(%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The number of enrolled utts
Fig. 5 UBM/i-vector SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for enrollment
in terms of EER

TDNN-LSTMP/ivector
10
LibriSpeech
9 Timit
Aishell
8 Thchs-30

7
EER(%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The number of enrolled utts
Fig. 6 TDNN–LSTMP/i-vector SV performance on the test set, with 1, 2, 3, 4, 8, 12, and 20 utterances for
enrollment in terms of EER

From the above experiments, we can see that the speaker verification system based
on TDNN–LSTMP significantly outperforms the UBM/i-vector-based speaker verifi-
cation system on both Chinese and English data sets. The experiments show that our
system always outperforms the traditional i-vector system on short-duration speaker
recognition tasks. From the experiment, we can also conclude that in the speaker
verification system, when the number of registered sentences is eight, the system’s
recognition performance has already reached an optimal value. This experimental
Circuits, Systems, and Signal Processing

Table 7 SV performance on the test set, with various duration conditions in terms of EER (%)

Corpus System 1s 2s 3s 4s Full

Aishell UBM/i-vector 19.250 10.750 8.250 8.000 7.250


TDNN–LSTMP/i-vector 15.750 5.500 3.250 3.000 2.750
LibriSpeech UBM/i-vector 11.480 12.020 4.918 4.372 3.279
TDNN–LSTMP/i-vector 5.464 3.825 2.186 2.732 2.186

Fig. 7 DET plots for different duration conditions

observation can provide guidance for the practical application design of many speaker
recognition systems in the future, especially for systems that can use only limited
resources.
Circuits, Systems, and Signal Processing

4 Conclusion

This paper studies the TDNN–LSTMP network applied in the speaker recognition task
by using a TDNN–LSTMP fusion network to create a UBM and extract additional
speech signal information related to the speaker. This paper compares SV’s mainstream
method, namely UBM/i-vector, and the proposed TDNN–LSTMP/i-vector method and
performs recognition tasks on four Chinese and English corpora. The experimental
results show that TDNN–LSTMP/i-vector improves the speech modeling capability
and improves the recognition performance. It outperforms the baseline system on
Chinese and English corpora and has better robustness.
Our future work will be to study speaker recognition systems in noisy environments
and multiple system fusions with various features.

Acknowledgements This research reported here was supported by China National Nature Science Funds
(No. 31772064).

References
1. Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult.
IEEE Trans. Neural Netw. 5(2), 157 (2002). https://doi.org/10.1109/72.279181
2. G.E. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large-
vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30 (2011). https://
doi.org/10.1109/TASL.2011.2134090
3. N. Dehak, P.J. Kenny, R. Dehak, Front-end factor analysis for speaker verification. IEEE Trans. Audio
Speech Lang. Process. 19(4), 788 (2011). https://doi.org/10.1109/TASL.2010.2064307
4. D. Garcia-Romero, X. Zhou, C.Y. Espy-Wilson, Multicondition training of Gaussian PLDA models in
i-vector space for noise and reverberation robust speaker recognition, in IEEE International Conference
on Acoustics, Speech and Signal Processing (2012), pp. 4257–4260
5. J.S. Garofolo, L.F. Lamel, W.M. Fisher, Darpa timit acoustic-phonetic continuous speech corpus cdrom.
NIST speech disc 1-1.1. NASA STI/Recon Technical Report N 93 (1993)
6. K.J. Han, S. Hahm, B.H. Kim, Deep learning-based telephony speech recognition in the wild, in
INTERSPEECH (2017), pp. 1323–1327. https://doi.org/10.21437/Interspeech.2017-1695
7. G. Hinton, L. Deng, D. Yu, Deep neural networks for acoustic modeling in speech recognition: the
shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82 (2012). https://doi.org/10.
1109/MSP.2012.2205597
8. J.D. Hui Bu, Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline, in
Orient. COCOSDA 2017 (2017) (Submitted)
9. V. Joshi, N.V. Prasad, S. Umesh, Modified mean and variance normalization: transforming to utterance-
specific estimates. Circuits Syst. Signal Process. 35(5), 1593 (2016). https://doi.org/10.1007/s00034-
015-0129-y
10. S. Karita, A. Ogawa, M. Delcroix, Forward-backward convolutional lstm for acoustic modeling, in
INTERSPEECH (2017), pp. 1601–1605. https://doi.org/10.21437/Interspeech.2017-554
11. P. Kenny, Bayesian speaker verification with heavy tailed priors, in Proceedings of the Odyssey Speaker
and Language Recognition Workshop, Brno, Czech Republic (2010)
12. NIST, The NIST year 2008 speaker recognition evaluation plan. https://www.nist.gov/sites/default/
files/documents/2017/09/26/sre08_evalplan_release4.pdf. Accessed 3 Apr 2008
13. V. Panayotov, G. Chen, D. Povey, Librispeech: an ASR corpus based on public domain audio books, in
IEEE International Conference on Acoustics, Speech and Signal Processing (2015), pp. 5206–5210.
https://doi.org/10.1109/ICASSP.2015.7178964
14. V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling
of long temporal contexts, in INTERSPEECH (2015), pp. 3214–3218
Circuits, Systems, and Signal Processing

15. A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges,
trends and opportunities. IET Biom. 7(2), 91 (2018). https://doi.org/10.1049/iet-bmt.2017.0065
16. D. Povey, A. Ghoshal, G. Boulianne, The kaldi speech recognition toolkit, in IEEE 2011 Workshop on
Automatic Speech Recognition and Understanding. (IEEE Signal Processing Society, 2011)
17. D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker Verification Using Adapted Gaussian Mixture Models
(Academic Press Inc., London, 2000)
18. H. Sak, A. Senior, F. Beaufays, Long short-term memory based recurrent neural network architectures
for large vocabulary speech recognition. In: Interspeech, pp 338–342 (2014). https://arxiv.org/abs/
1402.1128
19. D. Snyder, D. Garcia-Romero, D. Povey, Time delay deep neural network-based universal background
models for speaker recognition. Autom. Speech Recognit. Underst. (2016). https://doi.org/10.1109/
ASRU.2015.7404779
20. L.M. Surhone, M.T. Tennoe, S.F. Henssonow, Long Short Term Memory (Betascript Publishing, Riga,
2010)
21. D. Wang, X. Zhang, Thchs-30: a free chinese speech corpus. Comput. Sci. (2015). https://arxiv.org/
abs/1512.01882
22. Y. Xu, I. Mcloughlin, Y. Song, Improved i-vector representation for speaker diarization. Circuits Syst.
Signal Process. 35(9), 3393 (2016). https://doi.org/10.1007/s00034-015-0206-2

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations

You might also like