Professional Documents
Culture Documents
Abstract the speaker verification module was per-trained and fixed when
In this paper, a novel architecture for speaker recognition is training the speech enhancement. It means that the two modules
proposed by cascading speech enhancement and speaker pro- are not optimised jointly although a single loss function is used.
cessing. Its aim is to improve speaker recognition performance In addition to joint optimisation, attention mechanisms has
when speech signals are corrupted by noise. Instead of individ- also been widely used for speaker identification and verification
[11, 12, 13, 14]. This is because a neural attention mechanism
arXiv:2001.05031v1 [cs.CL] 14 Jan 2020
finally conclusions are drawn in Section 5 inserted. The input of the kth residual block is denoted as
H k ∈ RTk ×Fk ×Ck , and the final refined feature map of the
2. Model Architecture kth residual block is H 000
k . The last residual block is followed
by fully-connected layers, by which the predictions of speaker
2.1. Speech Enhancement Module identities are finally computed using a softmax function.
Figure 1 shows the proposed model architecture consisting of a
2.3. Multi-Stage Attention (MS)
speech enhancement module and a speaker recognition mod-
ule. Suppose the input data to the enhancement module is Figure 2 shows the structure of a multi-stage attention block,
X ∈ RT ×F ×C ) where T , F , and C represent the temporal which runs channel attention, frequency attention, and time
dimension, frequency dimension, and channel dimension, re- attention sequentially. Its mathematics representation can be
spectively. The value of C is a variable. When X is input spec- found in equation 1:
trogram, C is set to one. Besides this, the value of C equals to 0
the number of kernels set in the convolutional neural networks. Hk = αC,k H k
00 0
The speech enhancement module consists of multiple Hk = αF,k H k (1)
Conv-MS blocks, each of which contains a dilated convolution 000 00
Hk = αT ,k Hk
layer followed by a multi-stage attention block. In the attention
block, an attention mechanism is conducted in time, frequency where αC,k , αF,k , and αT ,k represent the implementation of
and channel domains, respectively. channel attention, frequency attention and time attention in the
The output feature map of the dilated convolutional layer is kth attention block.
denoted as H k ∈ RTk ×Fk ×Ck , where k means the kth CONV-
MC block. The output H 000k denotes the refined feature map of 2.3.1. Channel Attention
the kth Conv-MS block, and its dimension is same as H k .
Following the principle of channel attention used in [24, 23],
The output of enhancement module is used as a ratio mask
The working flow of channel attention is shown in Figure 2 (a)
whose values are viewed as the weights corresponding to fre-
and Equation 2.
quency bin and time frame of the spectrogram. The ratio mask
multiplies with the speech spectrogram in order to reduce the C
Hk,max = maxTk ×Fk ×1 (Hk )
interferences caused by unrelated noise signals. C
Hk,avg = avg Tk ×Fk ×1 (Hk )
C (2)
2.2. Speaker Recognition Module Smax = Relu((Hk,max )W0 + b0 )W1
C
The speaker recognition module consists of multiple residual Savg = Relu((Hk,avg )W0 + b0 )W1
convolutional blocks in which a multi-stage attention block is αC,k = Sigmoid(Savg + Smax )
Figure 2: The illustration of the muliti-stage attention (MS) attention block, and the computation of (a): Channel Attention; (b):
Frequency Attention; (c): Time Attention. The input feature map H k of the kth CONV-MS or RES-MS block is operated by channel
attention, frequency and time attention in cascade mode.
where W0 ∈ RCk ×100 , b0 ∈ R1×100 and W1 ∈ R100×Ck math format. In the kth time attention block, a max pool-
are the parameters of the kth channel attention block. ing and an average pooling are firstly applied to channel di-
In the implementation of channel attention, max pooling mension of the input data Hk0 , and the corresponding out-
and average pooling are firstly applied on both time and fre-
C
puts are Hk,max ∈ RTk ×Fk ×1 and Hk,avg
C
∈ RTk ×Fk ×1 ,
C
quency dimension of Hk . Their output Hk,avg ∈ R1×1×Ck C
respectively. Hk,pool ∈ R Tk ×Fk ×2
is obtained by con-
C 1×1×Ck
and Hk,max ∈ R are then used as the input of two catenating the outputs after using poolings. On time dimen-
fully connected layers sharing the same parameters and fol- sion, the same max pooling and average pooling steps are ap-
lowed by Relu activation. The channel attention map αC,k is
C
plied on Hk,pool ∈ RTk ×Fk ×2 and the corresponding out-
finally obtained after a Sigmoid activation is applied to the sum- puts are Hk,avg ∈ R1×Fk ×2 and Hk,max
T T
∈ R1×Fk ×2 .
mation of Savg and Smax . After broadcasting data in αC,k Again, the output after concatenating them on time dimension
to expand the map size same as H k , the attention map is mul- is Hk,pool ∈ R2×Fk ×2 . The frequency attention map αF k
tiplied by the original feature map H k to generate the refined is computed using a convolution operation with a 2-by-7-by-2
feature map H 0k kernel followed by a sigmoid activation. The stride value is 1
on frequency dimension during convolution. The size of αF k
2.3.2. Frequency and Time Attention is then expanded to the same as H 00k by data broadcast. The
frequency refined feature map H 00k is finally obtained by the
The frequency and time attention block have similar working 0
product of αFk and Hk .
structure when processing their three dimensional input except
The computation of time attention is similar to frequency
that where an attention mechanism is applied to, frequency di-
attention, Equation 3 and Figure 2 (c) shows the computation
mension or time dimension.
process. The final time refined feature map is obtained by the
0 multiplication of the previous frequency refined feature map
C
Hk,max = max1×1×Ck (Hk0 )
and the time attention weights αTk.
0
C
Hk,avg = avg 1×1×Ck (Hk0 )
00
C 0
C 0
C 0
C
Hk,max = max1×1×Ck (Hk00 )
Hk,pool = Concat[Hk,avg ; Hk,max ]
00
0 0
C
Hk,avg = avg 1×1×Ck (Hk00 )
T
Hk,max = maxTk ×1×1 (Hk,pool
C
) (3)
00 00 00
C C C
0 0 Hk,pool = Concat[Hk,avg ; Hk,max ]
T
Hk,avg = avg Tk ×1×1 (Hk,pool
C
)
00 00
0 T 0
T 0
F
Hk,max = max1×Fk ×1 (Hk,pool
C
) (4)
Hk,pool = Concat[Hk,avg ; Hk,max ]
F 00 1×Fk ×1 C 00
0 Hk,avg = avg (Hk,pool )
αF
k = Sigmoid(f
2×7
(Hk,pool ))
00 00
00 F F
Figure 2 (b) shows the working flow of the time at- Hk,pool = Concat[Hk,avg ; Hk,max ]
tention block and Equation 3 shows its implementation in 00
αT
k = Sigmoid(f
7×2
(Hk,pool ))
Layer Name Structure Dilation Block Name Structure Output
7x1x48 3x3x64
CONV-MS Block1 1x1 3x3x64
MS RES-MS Block1 150x129
1x7x48 3x3x64
CONV-MS Block2 1x1 MS-ATT
MS
3x3x128
5x5x48 3x3x128
CONV-MS Block3 1x1 RES-MS Block2 75x65
MS 3x3x128
5x5x48 MS-ATT
CONV-MS Block4 1x2
MS 3x3x128
5x5x48 RES-MS Block3 3x3x128 75x65
CONV-MS Block5 1x4 MS-ATT
MS
5x5x48 3x3x256
CONV-MS Block6 1x8 3x3x256
MS RES-MS Block4 38x33
3x3x128
5x5x48
CONV-MS Block7 1x1 MS-ATT
MS 3x3x256
5x5x48 RES-MS Block5 3x3x256 38x33
CONV-MS Block8 2x2
MS MS-ATT
5x5x48 3x3x256
CONV-MS Block9 4x4
MS RES-MS Block6 3x3x256 38x33
5x5x48 MS-ATT
CONV-MS Block10 8x8 3x3x256
MS
1x1x1 RES-MS Block7 3x3x256 38x33
CONV-MS Block11 1x1 MS-ATT
MS
3x3x512
3x3x512
Table 1: The architecture of the SE-Net. SE-Net consists of RES-MS Block8 19x17
3x3x128
11 dilated convolution and multi-stage attention (CONV-MS) MS-ATT
block. Within each block, a dilated convolutional process is Pool 19x1 1x17x512
firstly applied, then a multi-stage attention (MS) block is ap- FC 512
plied.
Table 3: The architecture of SID-Net. SID-Net consists of 8
residual convolution and multi-stage attention (RES-MS) block.
3. Experiments Within each block, the multiple convolutional layers as firstly
used on the input. Then, the output of is passed through a multi-
stage attention (MS) block before residual connection.
Model Description
baseline done by cascading
Voice loss[10]
speech enhancement and speaker recognition modules seconds).
speaker identification baseline using In order to test the robustness of the proposed model, ad-
SID
speaker recognition module (SID-Net) ditional noise from MUSAN dataset is added. MUSAN dataset
joint optimisation of speech enhancement (SE-Net) contains three categories of noises: general noise, music and
SE+SID
and speaker recognition module (SID-Net) babble [26]. The general noise type contains 6 hours of audio,
proposed model using a joint including DTMF tones, dialtones, fax machine noises et.al. The
SE-MS+SID music type contains 42 hours of music recording from different
optimisation and a multi-stage attention (MS)
models in speech enhancement module (SE-Net) categories. The babble type contains 60 hours of speech , in-
proposed model using a joint cluding read speech from public domain, hearings, committees
SE+SID-MS and debates et.al.
optimisation and multi-stage attention models
in speaker recognition module (SID-Net)
3.2. Speaker Identification
Table 2: Descriptions of five models: three baselines In VoxCeleb1 dataset, both training and test set contain the same
(Voice loss, SID, and SE+SID) and two proposed approaches number of speakers (1251 speakers) [25]. The training set con-
(SE-MS+SID and SE+SID-MS). tains 145265 utterances and the test set contains 8251 utter-
ances. In order to reduce possible bias, the MUSAN dataset
is also split into two parts for training and test. This is to ensure
3.1. Data that the noise signals used for training will not be reused for test.
Each training utterance is mixed with a type of noise at one of
In this work, Voxceleb1 [25] dataset is used. Voxceleb1 data are
five SNR levels. For the test set, the same data configuration is
extracted from Youtube videos, which contains 1251 speakers
set. To evaluate the recognition performance, Top-1 and Top-5
with more than 150 thousand utterances collected ”in the wild”.
accuracy are employed [27].
The average length of the audios in the dataset is 7.8 seconds.
Spectrograms are used as the input feature. Spectrograms
3.3. Speaker Verification
are extracted in a sliding window with 25ms window length and
10 ms hop size. 512 FFT elements are taken, results in 257 There contains 148642 utterances (1211 speakers) in the Vox-
dimensional spectrogram (a DC component is concatenated). Celeb1 development dataset, and 4874 utterances (40 speakers)
No normalization techniques are used on the input spectrogram. in the test dataset [25]. For the speaker verification task, there
The time length of the spectrogram is fixed at 300 frames (3.015 are total 37720 test pairs. The same configuration on the data
SID SE+SID SE-MS +SID SE+SID-MS
Noise Type SNR
Top1 (%) Top5 (%) Top1 (%) Top5 (%) Top1 (%) Top5 (%) Top1 (%) Top5 (%)
0 74.1 86.9 76.3 88.9 78.5 90.0 77.7 89.2
5 79.2 90.0 81.1 91.8 83.4 92.1 81.9 91.8
Noise 10 83.2 93.2 86.0 94.7 87.3 95.6 86.7 95.1
15 84.9 94.6 87.3 95.8 89.5 96.7 88.8 96.0
20 87.9 95.4 89.1 96.6 90.9 97.5 90.2 97.0
0 65.8 82.0 67.7 83.7 70.3 84.1 69.5 83.5
5 76.9 89.1 80.0 91.0 81.6 91.5 80.6 90.8
Music 10 83.8 93.5 85.2 94.7 86.3 95.3 85.8 94.7
15 86.1 93.9 88.4 95.6 89.1 96.7 88.2 95.4
20 87.4 94.7 89.1 96.0 90.2 97.1 89.5 96.6
0 62.4 80.2 65.7 81.5 67.5 83.0 66.6 81.9
5 76.2 87.3 78.6 88.9 80.6 89.9 79.3 89.6
Babble 10 81.4 92.2 84.6 93.6 86.6 94.5 85.3 83.2
15 84.0 92.6 86.8 93.9 88.3 94.7 87.6 94.0
20 85.8 92.9 87.1 94.6 89.0 95.5 88.8 95.2
Original 88.5 95.9 89.8 96.5 91.9 97.6 90.8 97.3
Table 4: Speaker Identification Results in different noise type (Noise, Music and Babble) at different SNR (0-20 dB), as well as the
original voxceleb1 test set. Four different scenarios are tested: only use SID-Net (SID); The SE-Net and SID-Net joint system but not
include multi-stage attention (SE+SID); The SE-Net and SID-Net joint system, only SE-Net use multi-stage attention (SE-MS+SID);
The SE-Net and SID-Net joint system, only SID-Net use multi-stage attention (SE+SID-MS).
for speaker recognition task is also set for speaker verification. 3.4. Experiment Setup
To compare with the baseline introduced in [10], the same loss
function and similarity measurement (Cosine) are used. Equal To evaluate our proposed approaches, five models including two
Error Eate (EER) [28] and Detection Cost Function (DCF) [29] baselines and three proposed approaches are to be tested on the
are used as evaluation metrics. DCF is computed as the average data mentioned in Section 3.2 and 3.3. As listed in table 2,
of two minimum DCF score (DCF0.01 and DCF0.001) [29, 30]. SID represents the baseline using the SID-Net, and Voice loss
[10] represents the baseline done by cascading speech enhance-
ment and speaker recognition. SE+SID represents the cascad-
ing structure with a joint optimisation with SE-Net and SID-
Net. SE-MS+SID and SE+SID-MS are the two proposed ap-
proaches using multi-stage attention models in either the speech
enhancement module (SE-Net) or the speaker recognition mod-
ule (SID-Net) besides the joint optimisation used in SE+SID.
Table 5: Speaker Verification Results in different noise type (Noise, Music and Babble) at different SNR (0-20 dB), as well as the
original voxceleb1 test set. Four different scenarios are tested: only use SID-Net (SID); The SE-Net and SID-Net joint system but not
include multi-stage attention (SE+SID); The SE-Net and SID-Net joint system, only SE-Net use multi-stage attention (SE-MS+SID);
The SE-Net and SID-Net joint system, only SID-Net use multi-stage attention (SE+SID-MS). The results of VoiceID Loss [10] is also
listed.
and Top-5 accuracy in comparison with the baseline in all noise nomenon shows the contribution of SE-MS+SID is bigger than
conditions. Even if comparing to SE+SID, there are still slight SE-SID+MS to the final accuracy. The MS module added in
improvements. This case is probably because the use of atten- SE module obtains better recognition results.
tion mechanism can highlight the speaker related information
and reduce the interference caused by irrelevant noise signals.
For the task of speaker verification, table 5 shows similar
results when implementing all five models on the test data. It SE-MS+SID-MS
is clear that SE+SID using a joint optimisation can performs Noise Type SNR
TOP-1(%) Top-5(%) EER (%) DCF
better than Voice loss using only a pre-trained speaker iden- 0 79.2 91.0 15.62 0.899
tification model instead of a joint optimisation. In compari- 5 84.2 92.7 11.61 0.798
son with speaker identification task, the verification improve- Noise 10 88.2 96.2 9.05 0.707
ments obtained using SE-MS+SID and SE+SID-MS are rela- 15 90.1 96.9 8.01 0.621
tively smaller. This is probably because for speaker verifica-
20 91.4 97.5 6.98 0.604
tion, the similarity between speaker embeddings learned from
0 71.7 86.0 15.35 0.881
speaker model is computed using Cosine function instead of
5 82.3 92.2 10.84 0.780
directly being computed using the trained speaker recognition
Music 10 87.2 96.1 8.77 0.709
model. In addition, for speaker verification, the use of atten-
tion model in speech enhancement module can yield slight bet- 15 90.0 97.2 7.55 0.615
ter performances in almost all conditions except when speeches 20 90.8 97.6 7.09 0.601
are corrupted by Babbel noise at 0dB and 5dB SNR levels. For 0 68.6 85.1 37.44 0.998
this case, a possible reason is that the ”Babble” noise signals are 5 81.8 90.4 26.28 0.996
relative complication due to its speaker/speech like characteris- Babble 10 87.4 95.2 16.25 0.899
tics. The use of attention model in speaker recognition module 15 89.1 95.5 10.74 0.784
(SID-Net) might be more suitable to extract speaker relevant in- 20 89.7 96.1 8.23 0.662
formation than using an attention model in the speech enhance- Original 92.3 98.0 6.12 0.511
ment model when acoustic environment is poor.
Figure 3 shows the linear combination of speaker identifi- Table 6: Speaker Identification and Speaker Verification results
cation results (Top-1 Accuracy) of SE-MS+SID and SE+SID- of SE-MS+SID-MS.
MS. In the combination results, α denotes the combination pa-
rameter of SE-MS+SID and 1 − α denotes that of SE+SID-
MS. The two baseline when α equals to zero and one are also
shown. When α equals to one, the linear combination is the Table 6 shows the speaker identification and verification re-
same scenario that only using SE-MS+SID; When α equals to sults of using MS block on both SE module and SID module
zero, the linear combination is the same scenario that only us- (SE-MS+SID-MS). It is obvious that the performance of SSE-
ing SE+SID-MS. It it clear from the figure that when alpha MS+SID-MS for both speaker identification and speaker verifi-
becomes larger, the final results becomes better. This phe- cation out-performs that using MS on only one of the modules.
5. Conclusion and Future Work EURASIP journal on advances in signal processing, vol.
2016, no. 1, pp. 101, 2016.
In this paper, a joint optimisation by cascading the speech en-
hancement network and speaker recognition network was im- [10] Suwon Shon, Hao Tang, and James Glass, “Voiceid
plemented in order to improve speaker identification and ver- loss: Speech enhancement for speaker verification,” arXiv
ification performance in different poor acoustic environments. preprint arXiv:1904.03601, 2019.
Furthermore, a multi-stage attention model also used in either [11] FA Rezaur rahman Chowdhury, Quan Wang, Igna-
the speech enhancement or speaker recognition module to high- cio Lopez Moreno, and Li Wan, “Attention-based mod-
light speaker relevant information. It is clear that the use of els for text-dependent speaker verification,” in ICASSP.
speech enhancement can yield better performances than the IEEE, 2018.
only use of speaker identification model. Moreover, joint op- [12] Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and
timisation and the use of attention model can further increase Daniel Povey, “Self-attentive speaker embeddings for
the robustness of our system against the interferences caused by text-independent speaker verification.,” in Interspeech,
different types of noise. 2018.
In the future work, Different order of the channel, fre-
quency and time attentions in MS module will be investigated. [13] Miquel India, Pooyan Safari, and Javier Hernando, “Self
More advanced speech enhancement technologies and training multi-head attention for speaker recognition,” arXiv
strategy such as adversarial training will be investigated. Post- preprint arXiv:1906.09890, 2019.
processing techniques for speaker embeddings such as PLDA [14] Nguyen Nang An, Nguyen Quang Thanh, and Yanbing
back-end will also be taken into account. Liu, “Deep cnns with self-attention for speaker identifica-
tion,” IEEE Access, 2019.
6. References [15] Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Hitoshi
Yamamoto, and Takafumi Koshinaka, “Attention mecha-
[1] Arnab Poddar, Md Sahidullah, and Goutam Saha,
nism in speaker recognition: What does it learn in deep
“Speaker verification with short utterances: a review of
speaker embedding?,” in 2018 IEEE Spoken Language
challenges, trends and opportunities,” IET Biometrics,
Technology Workshop (SLT). IEEE, 2018.
2017.
[16] Niko Moritz, Takaaki Hori, and Jonathan Le Roux, “Trig-
[2] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez gered attention for end-to-end speech recognition,” in
Moreno, and Javier Gonzalez-Dominguez, “Deep neural ICASSP. IEEE, 2019.
networks for small footprint text-dependent speaker veri-
fication,” in ICASSP. IEEE, 2014. [17] Seyedmahdad Mirsamadi, Emad Barsoum, and Cha
Zhang, “Automatic speech emotion recognition using re-
[3] David Snyder, Daniel Garcia-Romero, Gregory Sell, current neural networks with local attention,” in ICASSP.
Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro- IEEE, 2017.
bust dnn embeddings for speaker recognition,” in ICASSP.
[18] Yuanyuan Zhang, Jun Du, Zirui Wang, Jianshu Zhang, and
IEEE, 2018.
Yanhui Tu, “Attention based fully convolutional network
[4] Simon Leglaive, Umut Şimşekli, Antoine Liutkus, Lau- for speech emotion recognition,” in APSIPA ASC. IEEE,
rent Girin, and Radu Horaud, “Speech enhancement with 2018.
variational autoencoders and alpha-stable distributions,” [19] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,
in ICASSP 2019-2019 IEEE International Conference on Kyunghyun Cho, and Yoshua Bengio, “Attention-based
Acoustics, Speech and Signal Processing (ICASSP). IEEE, models for speech recognition,” in Advances in neural
2019, pp. 541–545. information processing systems, 2015.
[5] Mostafa Sadeghi, Simon Leglaive, Xavier Alameda- [20] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,
Pineda, Laurent Girin, and Radu Horaud, “Audio-visual “Neural machine translation by jointly learning to align
speech enhancement using conditional variational auto- and translate,” arXiv:1409.0473, 2014.
encoder,” arXiv preprint arXiv:1908.02590, 2019.
[21] Minh-Thang Luong, Hieu Pham, and Christopher D Man-
[6] Inseon Jang, ChungHyun Ahn, Jeongil Seo, and Younseon ning, “Effective approaches to attention-based neural ma-
Jang, “Enhanced feature extraction for speech detection in chine translation,” arXiv:1508.04025, 2015.
media audio.,” in INTERSPEECH, 2017, pp. 479–483. [22] Jianpeng Cheng, Li Dong, and Mirella Lapata, “Long
[7] Gholamreza Farahani, Seyed Mohammad Ahadi, and Mo- short-term memory-networks for machine reading,” in
hammad Mehdi Homayounpour, “Robust feature extrac- Proceedings of the 2016 Conference on Empirical Meth-
tion of speech via noise reduction in autocorrelation do- ods in Natural Language Processing, 2016, pp. 551–561.
main,” in International Workshop on Multimedia Con- [23] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and
tent Representation, Classification and Security. Springer, In So Kweon, “Cbam: Convolutional block attention
2006, pp. 466–473. module,” in ECCV, 2018.
[8] Lara Nahma, Pei Chee Yong, Hai Huyen Dam, and Sven [24] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation
Nordholm, “An adaptive a priori snr estimator for percep- networks,” in Proceedings of the IEEE conference on
tual speech enhancement,” EURASIP Journal on Audio, computer vision and pattern recognition, 2018, pp. 7132–
Speech, and Music Processing, vol. 2019, no. 1, pp. 7, 7141.
2019.
[25] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman,
[9] Rui Yao, ZeQing Zeng, and Ping Zhu, “A priori snr es- “Voxceleb: a large-scale speaker identification dataset,”
timation and noise estimation for speech enhancement,” arXiv preprint arXiv:1706.08612, 2017.
[26] David Snyder, Guoguo Chen, and Daniel Povey, “Mu-
san: A music, speech, and noise corpus,” arXiv preprint
arXiv:1510.08484, 2015.
[27] Mahdi Hajibabaei and Dengxin Dai, “Unified hypersphere
embedding for speaker recognition,” arXiv preprint
arXiv:1807.08312, 2018.
[28] Jyh-Min Cheng and Hsiao-Chuan Wang, “A method of
estimating the equal error rate for automatic speaker ver-
ification,” in 2004 International Symposium on Chinese
Spoken Language Processing. IEEE, 2004, pp. 285–288.
[29] David A Van Leeuwen and Niko Brümmer, “An intro-
duction to application-independent evaluation of speaker
recognition systems,” in Speaker classification I, pp. 330–
353. Springer, 2007.
[30] Weidi Xie, Arsha Nagrani, Joon Son Chung, and An-
drew Zisserman, “Utterance-level aggregation for speaker
recognition in the wild,” in ICASSP 2019-2019 IEEE In-
ternational Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2019, pp. 5791–5795.
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[32] Diederik P Kingma and Jimmy Lei Ba, “Adam: Amethod
for stochastic optimization,” .