Professional Documents
Culture Documents
1
Figure 1: Diagram of the Deep Speaker architecture
2
Table 1: Architecture of ResCNN. “Average” denotes the temporal Table 2: Architecture of the GRU model. “Average” denotes the
pooling layer and “ln” denotes the length normalization layer. A temporal pooling layer, and “ln” denotes the length normalization
bracket describes the structure of a ResBlock as shown in Fig. 2 layer.
layer name structure stride dim # params layer name struct stride dim param
conv64-s 5 × 5, 64 2×2 2048 6K conv64-s 5 × 5, 64 2×2 2048 6K
3 × 3, 64 GRU 1024 cells 1 1024 9.4M
res64 ×3 1×1 2048 41K×6
3 × 3, 64 GRU 1024 cells 1 1024 6.3M
conv128-s 5 × 5, 128 2×2 2048 209K GRU 1024 cells 1 1024 6.3M
3 × 3, 128 average - - 1024 0
res128 ×3 1×1 2048 151K×6
3 × 3, 128 affine 1024 × 512 - 512 500K
conv256-s 5 × 5, 256 2×2 2048 823K ln - - 512 0
3 × 3, 256 triplet - - 512 0
res256 ×3 1×1 2048 594K×6
3 × 3, 256 total 23M
conv512-s 5 × 5, 512 2×2 2048 3.3M
3 × 3, 512
res512 ×3 1×1 2048 2.4M×6
3 × 3, 512
average - - 2048 0
affine 2048 × 512 - 512 1M the ResCNN architecture), reduces dimensionality in both the time
ln - - 512 0 and frequency domains, allowing for faster GRU layer computation.
triplet - - 512 0 Following the convolutional layer are three forward-only GRU lay-
total 24M ers with 1024 units, recurrent in the time dimension. After the GRU
layers, we apply the same average, affine, and length normalization
layers as in the ResCNN model. We also use sequence-wise batch
normalization and clipped-ReLU activation in the whole model.
layer inputs, as described in Figure 2. The ResBlock is defined as The ResCNN and GRU architectures have a similar number of
parameters, 23M-24M, allowing us to better compare their perfor-
h = z(x, Wi ) + x, (1) mances.
where x and h are the input and output of the layers considered,
and z is the stacked nonlinear layer’s mapping function. Note that 3.2 Speaker Embedding
identity shortcut connections of x do not add extra parameters and Frame-level activations are fed into a temporal average layer. Unlike
computational complexity. in the pooling layer in [7], we do not use standard deviation of frame-
Table 1 shows the details of the proposed ResCNN architecture. level outputs. The layer activation h is computed as follows:
As described in Figure 2, the ResBlock contains two convolutional
layers with 3 × 3 filters and 1 × 1 stride. Each block has an iden- T −1
1 X
tical structure, and the skip connection is the identity mapping of x. h= x(t) (3)
Three ResBlocks are stacked in our architecture. When the number T t=0
of channels increases, we use a single convolutional layer with filter where, T is the number of frames in the utterance. An affine
size of 5 × 5 and a stride of 2 × 2. Thus, the the frequency dimen- layer then projects the utterance-level representation into a 512-
sion is kept constant in all convolution layers. We find that speaker dimensional embedding. We normalize embeddings to have unit
recognition performance is not sensitive to the stride in the time di- norm and use cosine similarity between pairs in the objective func-
mension, contrary to [15]’s findings. Notably, when the number of tion:
channels increases, projection shortcuts are not used as in [16], be-
cause they increased the number of parameters without yielding sig- cos(xi , xj ) = xT
i xj (4)
nificant improvement. We adopt sequence-wise batch normalization
(BN) between the convolution and the nonlinearity, following [18]. where, xi and xj are two embeddings.
We use the clipped rectified linear (ReLU) function [18],
3.3 Triplet Loss and Selection
σ(x) = min {max {x, 0} , 20} (2)
We model the probability of embeddings xi and xj belonging to the
as our nonlinearity for all of the network layers. same speaker by their cosine similarity in Equation (4), allowing us
to use the triplet loss function like in FaceNet [11].
3.1.2 GRU Network As shown in Figure 3, triplet loss takes in as input three samples,
an anchor (an utterance from a specific speaker), a positive example
We also experiment with recurrent networks for frame-level feature
(another utterance from the same speaker), and a negative example
extraction because they have worked well for speech recognition
(an utterance from another speaker). We seek to make updates such
[19]. [23] showed that a GRU is comparable to an LSTM with a
that the cosine similarity between the anchor and the positive exam-
properly initialized forget gate bias, and their best variants are com-
ple is larger than the cosine similarity between the anchor and the
petitive with each other. We decided to use GRUs because previous
negative example [11]. Formally,
speech recognition experiments [18] on smaller data sets showed the
GRU and LSTM reach similar accuracy for the same number of pa-
sap an
i − α > si (5)
rameters, but the GRUs were faster to train and less likely to diverge.
The details of the proposed GRU architecture are shown in Ta- where, sap
i is the cosine similarity between the anchor a and the pos-
ble 2. A 5 × 5 filter size, 2 × 2 stride convolution layer (like in itive example p for triplet i, and san
i is the cosine similarity between
3
(a) ACC
Figure 3: The Triplet loss in cosine similarity.
#GPUs 1 4 8 16
Prob(hard) 29.06% 43.29% 45.54% 50.99%
Rel. time cost - +5.47% +6.09% +15.28%
(b) EER
4
Table 4: Statistics of datasets used in the experiments. UIDs and Xi- 4.2 Baseline DNN i-vector System
aoDu consist of Mandarin speech from mobile queries, while MTurk
is a English dataset from Amazon Mechanical Turk. UIDs and Mturk The baseline DNN i-vector model is built based on [4]. Raw audio is
are text independent, while XiaoDu is a text-dependent dataset con- converted to 40-dimensional log mel-filter bank (Fbank) coefficients
sisting of wake-words. and 20-dimensional MFCC with a frame-length of 25ms. Delta and
acceleration are appended to the input, and a frame-level energy-
based VAD selects features corresponding to speech frames.
#spkr #utt #utt/spkr dur/utt A seven-layer DNN that contains 600 input nodes, 1024 nodes
Train250k 249,312 12,202,181 48.94 3.72s in each hidden layer, and 4,682 output nodes is trained with cross
Train50k 46,835 2,236,379 47.75 3.66s entropy using the alignments from a HMM-GMM model. The input
Eva200 200 3,800 19 4.25s layer of the DNN is composed of 15 frames (7 frames on each side
UIDs dataset. of the frame for which predictions are made) where each frame cor-
responds to 40 dimension Fbank coefficients. The DNN is used to
#spkr #utt #utt/spkr dur/utt provide the posterior probability in the proposed framework for the
train 11,558 89,227 7.72 1.56s 4,682 senones defined by a decision tree.
test 844 10,014 11.86 1.48s A 1024 diagonal component UBM is trained in a gender-
independent fashion, along with a 400 dimensional i-vector extrac-
XiaoDu dataset
tor, followed by length normalization and PLDA.
.
#spkr #utt #utt/spkr dur/utt
train 2,174 543,840 250.16 4.16s 4.3 Training Methodology
test 200 4,000 20 4.31s Deep Speaker models are trained using the SpeechDL [18] dis-
MTurk dataset. tributed machine learning system with 16 K40 GPUs. Audio is
converted to 64-dimensional Fbank coefficients, normalized to have
zero mean and unit variance. The same VAD processing as DNN
i-vector baseline system is used here.
As described in Section 3.4, Deep Speaker models are trained
recorded by mobile phone apps, while the third is English speech in two stages: softmax pre-training and triplet loss fine-tuning. In
collected on Amazon Mechanical Turk. UIDs and MTurk are text- both stages, we use synchronous SGD with 0.99 momentum [24],
independent datasets, while XiaoDu is text-dependent. The details with a linear decreasing learning rate schedule from 0.05 to 0.005.
of the three datasets are given as follows. The model is pre-trained for 10 epochs with a minibatch size of 64
• UIDs is collected from anonymized voice search queries. Ta- and fine-tuned with triplet loss for 15 epochs using a minibatch size
ble 4 lists statistics for the utterances, the durations of which of 128. Training pairs are re-shuffled in each epoch. The margin α
mostly range from 3 to 5 seconds. The full training partition, is set to 0.1 in 5 A development dataset is used for hyper-parameter
Train250k, comprises 249,312 speakers, 12,202,181 utterances, tuning and early stopping.
and more than 12,600 hours of speech. The subset Train50k
comprises 46,835 speakers, 2,236,379 utterances, and more than
2,270 hours of speech. The evaluation partition, Eva200, con- 5 Experimental Results
sists of 200 speakers that do not overlap with the training speak-
ers. 380,000 speaker recognition trials were run. 5.1 Speaker-independent Experiments on
• XiaoDu contains Baidu wake-word utterances, “Xiaodu, xi- UIDs
aodu”. The full training dataset comprises 11,558 speakers, First, we compare the DNN-based i-vector system and Deep Speaker
89,227 utterances, and more than 38 hours of speech. The eval- on the UIDs dataset, with the Train50k partition as the training set
uation dataset consists of 844 speakers that do not overlap with and Eva200 partition as the test set. Deep Speaker models are trained
the training speakers. 1,001,400 speaker recognition trials were with different neural network architectures (GRU or ResCNN), and
run. different strategies (softmax, triplet loss, or softmax + triplet loss,
• MTurk contains scripted English utterances collected on Ama- which is softmax pretraining followed by triplet loss fine-tuning).
zon Mechanical Turk. The median utterance length is around The results are listed in Table 5. All Deep Speaker models achieve
4 seconds, and the 25th and 75th percentile lengths are ∼3 notable improvements over the baseline, roughly 50 - 80% relative
seconds and ∼5 seconds. The full training dataset comprises reduction on EER, and 60 - 70% relative improvement on ACC.
2,174 speakers, 543,840 utterances, and more than 620 hours of
speech. The evaluation dataset consists of 200 speakers that do 5.1.1 Softmax Pre-training
not overlap with the training speakers. 400,000 speaker recogni-
tion trials were run. Training using “softmax + triplet” loss achieves the best perfor-
mance, followed by triplet loss-only and softmax-only training, in
Speaker verification and identification trials were constructed by decreasing performance order. In particular, the ResCNN with soft-
randomly picking one anchor positive sample (AP) and 99 anchor max + triplet loss achieves 63.62% and 17.10% relative reduction
negative samples (AN) for each anchor utterance. Then, we com- on EER and 47.53% and 31.33% on error (1−ACC) compared to
puted the cosine similarity between the anchor sample and each of softmax-only and triplet loss-only models. The pre-trained GRU ar-
the non-anchor samples. EER and ACC are used for speaker veri- chitecture achieves 48.89% and 14.24% relative reduction on EER
fication and identification, respectively. Since gender labels are not and 38.05% and 33.59% on identification error compared to the other
available in all our datasets, the evaluation here is not split by gender,
unlike the NIST speaker recognition evaluations (SREs) in [3].
5
Table 5: The performance of Mandarin text-independent speaker Table 6: The system fusion performance of text-independent speaker
recognition task using Train50k as train set and Eva200 as test recognition task using Train50k as train set, Eva200 as test set.
set. Results from different neural network architectures and train-
ing methodologies are reported.
system EER[%] ACC[%]
ResCNN 2.23 90.53
system EER[%] ACC[%] GRU 2.77 89.50
DNN i-vector baseline 13.79 51.72 embedding fusion 2.17 90.95
ResCNN, softmax 6.13 81.95 score fusion 2.07 91.83
ResCNN, triplet 2.69 86.21
ResCNN, softmax (pre-train) + triplet 2.23 90.53
GRU, softmax 5.42 83.05 Table 7: The performance of text-independent speaker recognition
GRU, triplet 3.23 84.19 task using both Train250k and Train50k as training sets and Eva200
GRU, softmax (pre-train) + triplet 2.77 89.50 as the test set. All deep speaker models are trained with softmax
pre-training.
two GRU systems. The results confirm the advantages of end-to-end system EER[%] ACC[%]
training and softmax pre-training. ResCNN on Train50k 2.23 90.53
ResCNN on Train250k 1.83 92.58
5.1.2 Network Architecture GRU on Train50k 2.77 89.50
GRU on Train250k 2.35 90.77
While the GRU architecture outperforms ResCNN with softmax-
only training, ResCNN outperforms GRU layers after triplet loss
fine-tuning. As shown in Table 5, GRU achieves a 11.58% lower
EER and a 6.09% lower error rate compared to ResCNN after soft- It is clear that using tens of millions of samples results in a perfor-
max training. After triplet loss training, ResCNNs had a 19.49% mance boost. Compared to using only around 1/5th the data, the
lower EER and a 10.88% lower error rate than GRUs. As “softmax using the full dataset reduces the identification error and EER by
+ triplet loss” training strategy achieves best performance for both 17.94% and 21.65% for ResCNN and 15.16% and 13.88% for GRU.
GRU and ResCNN, we will use it in the following experiments and
also omit this label for brevity’s sake.
5.1.5 Utterances Number for Enrollment
Time cost is another important consideration for choosing net-
work architectures. We measure the training speed of a model as To investigate the effect of the enrollment utterance count on recog-
the number of minibatches that can be processed per second. In nition tasks, we choose 1 to 5 utterances for each person’s enroll-
this experiment, ResCNN can processes 0.23 minibatches per sec- ment. The speaker embeddings are produced by averaging the en-
ond, while GRU processes 0.44 minibatches per second. The time rollment utterance embeddings. As before, speaker verification and
cost gap could be partially caused by using Deep Speech 2 HPC identification trials were constructed by randomly picking one AP
techniques for the GRU network [18] and not spending comparable and 99 AN speaker embeddings for each anchor utterance. In total,
effort speeding up the ResCNN. 280,000 trials were conducted.
Table 8 shows that the EER decreases and ACC increases as the
5.1.3 System Combination enrollment utterance count increases, though with diminishing re-
turns. These results have implications for production speech recog-
Individual Deep Speaker models perform well separately, but we nition system design choices. In particular, using too many enroll-
anticipate that the ResCNN and GRU systems can benefit from fu- ment utterances would provide minimal performance gains while in-
sion because of their contrasting architectures. To fuse ResCNN and creasing inference time, making new user enrollment more cumber-
GRU, we investigate two methods: embedding fusion and score fu- some, and increasing memory usage.
sion. In the first method, we add the speaker embedding from both
models together, followed by length normalization and cosine score.
In the second method, we first normalize the scores using mean and 5.2 Text-dependent Experiments on XiaoDu
variance calculated from all scores and add them together. Table 6 Table 9 shows the performance of both ResCNN and GRU models
indicates that relative to the best single system (ResCNN), both fused for text-dependent speaker recognition on the XiaoDu dataset. The
system improve the single-system baselines. Especially, the score flag “on Train50k” means the Deep Speaker models are only trained
fusion method gets the best performance, with 7.17% and 13.37% on Train50k, while the flag “finetuned” indicates that we first trained
reductions in EER and error, respectively. the model on Train50k, then used the XiaoDu dataset to fine-tune
with triplet loss for Deep Speaker systems, and i-vector extraction
5.1.4 Amount of Training Data for the DNN i-vector system.
Interestingly, the DNN i-vector baseline system achieves the
Table 7 shows the impact of training dataset size on speaker recog-
best performance when only using XiaoDu to train the models.
nition performance. We do not experiment with the i-vector baseline
There are two possible reasons here. Firstly, the XiaoDu dataset is
here, as it’s too time consuming and computationally expensive. 1
too small to train complex deep models like Deep Speaker, and we
1 For example, the total variance matrix T in the i-vector model is too hard may be overfitting. Secondly, the text-dependent speaker verification
to compute on a big dataset. In practice, people usually train the i-vectors
using subsets of a large dataset. Indeed, we tried training i-vector systems on larger datasets and got no obvious improvements.
6
Table 8: Effect of enrollment utterance count. The performance col- Table 10: The text-independent speaker recognition performance of
umn is formatted as EER/ACC in each cell. All models are trained different systems on MTurk dataset.
on Train50k dataset.
finetuned” models, reducing the EER by 25% and the error rate by
35%.
task constrains the lexicon and phonetic variability, so the i-vector These results indicate that Deep Speaker systems can work well
extractor based on factor analysis can cover the speaker variability not only on Mandarin, but also across other languages. In addition,
in the small dataset better. the representations learned by Deep Speaker transfer well across dif-
To our surprise, we find that Deep Speaker models only trained ferent languages.
on Train50k achieve slightly better performance than models only
trained on XiaoDu. That is to say, models trained using text-
independent datasets can work well in text-dependent tasks. We
5.4 Time Span Experiments on UIDs
believe that the superior performance is a result of the amount of Speaker recognition systems usually struggle with robustness to time
training data. between enrollment and test time. People’s voiceprints change over
Fine-tuning the traditional DNN i-vector system does not signif- time, just like their appearances. We test the robustness of our model
icantly improve performance, while fine-tuned ResCNN and GRU across a wide range of time spans using the Eva200 dataset. In Table
networks outperform the DNN i-vector system, by 16.76% and 11, the first column divides the different time spans, “1 week” means
18.24% relative reduction on EER, and similar ACC. This shows the time span of registration and verification is less than 1 week, “1
that pre-training on large text-independent datasets can aid in data- month” means less than 1 month but longer than 1 week, and “3
constrained text-dependent scenarios. We speculate that the large months” means less than 3 months but longer than 1 month.
datasets can cover a greater diversity of samples and encourage The performance of all systems decrease as the time span be-
model generalization. tween enrollment and test increases, but ResCNN can still achieve
the best performance with the same time span.
5.3 Text-independent Experiments on MTurk
The MTurk experimental results in Table 10 showcase that Deep 6 Conclusion
Speaker works across languages. The flag “on Train50k” means the
Deep Speaker models are only trained on Train50k, while the flag In this paper we present a novel end-to-end speaker embedding
“finetuned” means models are first trained on Train50k, and then scheme, called Deep Speaker. The proposed system directly learns
fine-tuned on the MTurk dataset using triplet loss. Note that this is a mapping from speaker utterances to a hypersphere where cosine
a non-trivial task, since Mandarin and English speech sounds dis- similarities directly correspond to a measure of speaker similar-
parate. We don’t report a fine-tuned result for DNN i-vector baseline ity. We experiment with two different neural network architectures
here. Because Mandarin and English have different phone sets, the (ResCNN and GRU) to extract the frame-level acoustic features. A
ASR-DNN model is difficult to adapt. triplet loss layer based on cosine similarities is proposed for metric
By comparing the different systems trained only on MTurk, the learning, along with a batch-global negative selection across GPUs.
ResCNN and GRU system reduce EER by 12.11% and 9.79% and Softmax pre-training is used for achieving better performance.
error by 15.02% and 19.38% compared to the DNN i-vector sys- The experiments show that the Deep Speaker algorithm signifi-
tem. Interestingly, models trained solely on the Mandarin Train50k cantly improves the text-independent speaker recognition system as
dataset perform fairly well on English speaker classification, even compared to the traditional DNN-based i-vector approach. In the
without fine-tuning. The “finetuned” models outperform “non- Mandarin dataset UIDs, the EER decreases roughly 50% relatively,
7
and error decreases by 60%. In the English dataset MTurk, the equal [11] Schroff, F., Philbin, J. FaceNet: A Unified Embedding for Face
error rate decreases by 30% relatively, and error decreases by 50%. Recognition and Clustering. The IEEE Conference on Com-
Another strength of Deep Speaker is that it can take full advantage of puter Vision and Pattern Recognition (CVPR), 2015, pp. 815-
transfer learning to solve the speaker recognition problems on small 823
data sets, for both text-independent and text-dependent tasks. [12] Zhang, Y., Pezeshki, M., Bengio, Y., Courville, A., Hc, Q. C. .
Future work will focus on better understanding the error cases, Towards End-to-End Speech Recognition with Deep Convolu-
reducing model size, and reducing CPU requirements. We will also tional Neural Networks, Interspeech 2016.
look into ways of improving the long training times.
[13] Yu, Dong, Xiong, Wayne, Droppo, Jasha, Stolcke, Andreas,
Ye, Guoli, Li, Jinyu, and Zweig, Geoffrey. Deep convolutional
7 Acknowledgments neural networks with layer-wise context expansion and atten-
tion. Interspeech 2016
We would like to thank Liang Gao and Yuanqing Lin for their sup- [14] Bi, M., Qian, Y., Yu, K. Very Deep Convolutional Neural Net-
ports and great insights on speaker recognition. We would also like works for LVCSR, Interspeech, 2015.
to thank Sanjeev Satheesh, Adam Coates, and Andrew Ng for useful
discussions and thoughts. Also our work would not have been pos- [15] Wang, Y., Deng, X., Pu, S., Huang, Z. Residual Convolutional
sible without the data support of Hongyun Zeng, Yue Pan, Jingwen CTC Networks for Automatic Speech Recognition. Retrieved
Cao, and Limei Han. from http://arxiv.org/abs/1702.07793
[16] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition. In Proceed-
References ings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770 778, 2016.
[1] W.M. Campbell, D.E. Sturim, D.A. Reynolds, and A.
[17] K. C ho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F.
Solomonoff, SVM Based Speaker Verification using a GMM
Bougares, H. Schwenk, and Y. Bengio. Learning phrase rep-
Supervector Kernel and NAP Variability Compensation, in
resentations using rnn encoder-decoder for statistical machine
IEEE International Conference on Acoustics, Speech, and Sig-
translation. In Conference on Empirical Methods in Natural
nal Processing, Toulouse, 2006, vol. 1, pp. 97100
Language Processing (EMNLP), 2014
[2] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint
[18] Amodei, D., Anubhai, R., Battenberg, E., Carl, C.,
Factor Analysis versus Eigenchannels in Speaker Recognition,
Casper, J., Catanzaro, B., Zhu, Z. Deep-speech 2:
IEEE Transaction on Audio Speech and Language Processing,
End-to-end speech recognition in English and Mandarin.
vol. 15, no. 4, pp. 14351447, May 2007
https://arxiv.org/abs/1512.02595
[3] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,
[19] Sainath, Tara N, Weiss, Ron J, Senior, Andrew, Wilson, Kev-
Front-end factor analysis for speaker verification, IEEE Trans.
inW, and Vinyals, Oriol. Learning the speech front-end with
ASLP, vol. 19, pp. 788798, May 2010.
raw waveform cldnns. Interspeech 2015.
[4] Lei, Y., Scheffer, N., Ferrer, L., McLaren, M. A novel scheme
[20] S. J. D. Prince, Probabilistic linear discriminant analysis for
for speaker recognition using a phonetically-aware deep neu-
inferences about identity, in Proc. International Conference on
ral network. in IEEE International Conference on Acoustics,
Computer Vision (ICCV), Rio de Janeiro, Brazil, 2007.
Speech, and Signal Processing, 2014
[21] Castaldo, F., Alam, M. J., Cernocky, J. H. Full-Covariance
[5] Cumani, S., Laface, P., Torino, P. Probabilistic Linear Discrim-
UBM and Heavy-tailed PLDA in I-vector Speaker Verification,
inant Analysis Of Ivector Posterior Distributions. in IEEE In-
in IEEE International Conference on Acoustics, Speech, and
ternational Conference on Acoustics, Speech, and Signal Pro-
Signal Processing, 2011.
cessing, 2013
[22] S. Ghalehjegh and R. Rose, Deep bottleneck features for i-
[6] Heigold, G., Moreno, I., Bengio, S., Shazeer, N. End-to-end
vector based text-independent speaker verification, in IEEE
text-dependent speaker verification. in IEEE International Con-
Workshop on Automatic Speech Recognition and Understand-
ference on Acoustics, Speech, and Signal Processing, 2016
ing (ASRU). IEEE, 2015, pp. 555 C560.
[7] David S., Pegah G., Daniel P., Daniel G.R., Yishay C, San-
[23] R. Jozefowicz,W. Zaremba, and I. Sutskever. An empirical ex-
jeev K. Neural Network-Based Speaker Embeddings for End-
ploration of recurrent network architectures. In International
To-End Speaker Verification. in IEEE Spoken Language Tech-
Conference on Machine Learning (ICML), 2015
nology Workshop (SLT), 2016
[24] D. R. Wilson and T. R. Martinez. The general inefficiency of
[8] E. Variani, X. Lei, E. McDermott, I. Moreno, and Javier
batch training for gradient descent learning. Neural Networks,
Gonzalez-Dominguez, Deep neural networks for small foot-
16(10):14291451, 2003. 4
print text-dependent speaker verification, in IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Process-
ing, 2014.
[9] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Computation, 9(8):17351780, 1997
[10] P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur,
Acoustic modelling from the signal domain using cnns, Inter-
speech 2016.