Svsnet: An End-To-End Speaker Voice Similarity Assessment Model

1
SVSNet: An End-to-end Speaker Voice Similarity

Assessment Model
Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
Abstract—Neural evaluation metrics derived for numerous are widely used in many speech processing tasks, such as
speech generation tasks have recently attracted great attention. speaker verification (SV) [4], [5], VC [6], [7], speech synthesis
In this paper, we propose SVSNet, the first end-to-end neural [8], [9], and speech enhancement [10], we believe that the
network model to assess the speaker voice similarity between
raw waveform contains the most complete information for
arXiv:2107.09392v1 [eess.AS] 20 Jul 2021
natural speech and synthesized speech. Unlike most neural eval-

uation metrics that use hand-crafted features, SVSNet directly similarity prediction for two reasons. First, for most hand-
takes the raw waveform as input to more completely utilize crafted features, the phase information is ignored. However,
speech information for prediction. SVSNet consists of encoder, many studies have shown that phase can provide useful
co-attention, distance calculation, and prediction modules and is information [11], [12], [13]. Second, to compute hand-crafted
trained in an end-to-end manner. The experimental results on
the Voice Conversion Challenge 2018 and 2020 (VCC2018 and features, prior knowledge is required about feature extraction
VCC2020) datasets show that SVSNet notably outperforms well- specifications, such as window size, shift length, and feature
known baseline systems in the assessment of speaker similarity dimension. Improper specifications can lead to ineffective
at the utterance and system levels. features, which can result in poor prediction performance. For
Index Terms—neural evaluation metrics, speech similarity (2), since our goal is to predict the similarity score for a pair
assessment, voice conversion of utterances, we need to handle the asymmetry issue, which
may be caused by two situations. First, the two utterances
may be different lengths. Second, when the paired utterance
I. I NTRODUCTION
input is switched to a different order, SVSNet should output
T He speech generated in voice conversion (VC) tasks

remains challenging to effectively evaluate. In most
studies, both objective and subjective evaluation results are
the same prediction score. To solve the asymmetry problem,
we design a special co-attention mechanism. Our experimental
results on the VCC2018 [14] and VCC2020 [15] datasets show
reported to compare the performance of VC systems. For that SVSNet can predict the similarity score of a VC system
objective evaluation [1], measurements borrowed from the quite accurately. As per our knowledge, this is the first deep
speaker recognition task are usually used. For subjective eval- learning-based model for similarity assessment for VC tasks.
uation, a listening test is usually conducted. Compared with
objective evaluation, subjective evaluation incurs more time II. R ELATED W ORKS
and cost. Moreover, to reach unbiased results, a large amount
of subjective tests must be carried out [2]. However, since the A. Neural Evaluation Metrics
target users of VC are humans, the subjective evaluation results Conventional evaluation metrics are generally derived on
are more important than the objective counterparts. In our the basis of signal processing and human auditory theories.
previous study [3], we proposed MOSNet, which can predict For example, perceptual evaluation of speech quality (PESQ)
the mean opinion score (MOS) of human subjective ratings [16] and short-time objective intelligibility (STOI) [17] are
of speech quality and naturalness. MOSNet is formed by commonly used to evaluate the quality and intelligibility of
a Convolutional Neural Nnetwork-Bidirectional Long Short- processed speech. The normalized covariance measure (NCM)
Term Memory (CNN-BLSTM) architecture. The results of [18] and its extensions [19], [20] have been shown to be
large-scale human evaluation of Voice Conversion Challenge effective in measuring the intelligibility of normal speech
2018 (VCC2018) demonstrated that MOSNet achieves a high and vocoded speech. In addition, some parametric distances
correlation with human MOS ratings at the system level and are often used to measure the difference between paired
a fair correlation at the utterance level. voices, such as speech distortion index (SDI) [21], mel-cepstral
In our previous study [3], we also slightly modified MOSNet distance (MCD) [22], cepstrum distance (Cep) [23], segmental
to predict the similarity scores. The preliminary results showed signal-to-noise ratio (SSNR) improvement [24], and scale-
that the predicted similarity scores were fairly correlated with invariant source-to-noise ratio (SI-SNR) [25]. Several studies
human similarity ratings. In this work, to further improve the have indicated that these objective evaluation metrics may
similarity prediction, we propose a novel assessment model not truly reflect human perception [22]. Therefore, subjective
called SVSNet, which has two features: (1) To more accu- listening evaluations are usually reported in speech generation
rately characterize speech signals, SVSNet directly takes the studies. Unbiased subjective results, however, require a large
speech waveform as input. (2) SVSNet adopts a co-attention number of tests, covering a wide range of listeners (gender,
mechanism to deal with length mismatch and switching of age, and hearing ability) and test samples, which makes
paired utterances. For (1), although hand-crafted features listening tests challenging in terms of time and cost.
2
Frame-wise
To address the above issues, several neural evaluation Representation
Distance between
Utterances
metrics have been proposed. For speech enhancement tasks, E RT
Dis DT,R
Quality-Net [26], DNSMOS [27],and STOI-Net [28] were XT
R̂R
proposed as non-intrusive tools for measuring speech quality CAT Pred S
R̂T
and intelligibility. For VC, MOSNet [3] and MBNet [29] were Dis DR,T
E RR
proposed to measure the naturalness of converted speech. Mit-
XR
tag and Möller [30] proposed a model for the text-to-speech
synthesis task on the TU Berlin / Kiel University database. To (a)
the best of our knowledge, no one has previously established
x4
SincNet rSWC BLSTM
neural evaluation metrics for the similarity assessment of VC
tasks, which is the main focus of this study.
(b)
B. Similarity Prediction Conv RConv
x2
The similarity prediction task resembles an SV task, which Conv DConv GTU Conv MaxPooling3
aims to determine whether the input speech is pronounced by a x7
claimed speaker. For most SV systems, the test utterance and

(c)
the enrollment utterance are first converted into embedding
vectors through a neural network (NN) model, and then a Fig. 1: (a) The architecture of SVSNet. E, CAT, Dis, and Pred
similarity score between the two embedding vectors is cal- blocks denote the Encoder, Co-attention, Distance, and Predic-
culated on the basis of a distance measurement function, such tion modules, respectively. (b) The Encoder module. (c) The
as cosine distance or other NN models [4], [31], [32]. The rSWC module. DConv denotes a dilated convolutional layer,
major difference between the similarity prediction task and and RConv denotes a normal convolutional layer followed by
the SV task is that the SV task uses two class labels (same a ReLU layer.
speaker or different speakers), while multi-class labels based
on human judgements are used in SVSNet.
“Prediction” module. We study two types of prediction mod-
C. Waveform Modeling ules: regression-based and classification-based. Their outputs
are a continuous score and a score-level category, respectively.
Recently, several approaches have been proposed to in-
corporate waveform modeling into speech processing tasks, A. Encoder
such as speech recognition [33], speech enhancement [34],
Figure 1(b) shows the architecture of the encoder in SVS-
[35], speech separation [36], [25], speech vocoding [37],
Net. First, the input waveform is processed by SincNet, which
[38], [39] , and SV [40], [41]. The main idea of these
contains K learnable band-pass filters, to decompose the
waveform modeling methods is that traditional hand-crafted
input signal to K subband signals. The K subband signals
feature extraction techniques can be substituted by NN models
are then processed by four stacked residual-skipped-WaveNet
in a data-driven manner. To effectively model speech wave-
convolution (rSWC) layers and a BLSTM layer. Fig. 1(c)
forms, a dilated architecture has been proposed to increase
shows the rSWC layer. The core of the rSWC layers is the
the reception field with the same number of model parameters
convolutional layers with dilation sizes of (1, 2, 4, 8, 16, 32,
[38]. Meanwhile, SincNet [42] processes the raw waveform
and 64), followed by a gated tanh unit (GTU) [45]. In addition,
with a set of parameterizable band-pass filters, where only the
the maxpooling layer with a stride size of 3 is to downsample
low and high cutoff frequencies of the band-pass filters are the
the feature sequence. As shown in Fig. 1(a), given the test
parameters to be learned. Learning data-dependent and task-
utterance XT and the reference utterance XR , the encoder
dependent filters provides greater flexibility than fixed feature
outputs RT and RR , respectively.
processing procedures. The effectiveness of SincNet has been
demonstrated in several studies [42], [43], [44] B. Co-attention Module
A critical requirement of similarity prediction is symmetry.
III. P ROPOSED SVSN ET That is, when the input order is switched, the model should
Figure 1(a) shows the SVSNet architecture. The encoder (E) predict the same similarity score. To meet this requirement, we
module (shared by two inputs) encodes the waveforms of the derive a novel co-attention model to align the representation
test and reference utterances into frame-wise representations of the other input with that of one input:
(RT and RR ). Unlike the attention module in [32], which R̂R = Attention(RT , RR , RR ),
only aligns the test utterance with the enrollment utterance (1)
in one direction, to maintain the symmetry, the “Co-attention” R̂T = Attention(RR , RT , RT ).
module aligns the two representations in two directions. Then, We used the scaled dot-product attention mechanism [46].
two distances, namely DT,R (between RT and R̂R ) and With the co-attention module, two pairs of aligned representa-
DR,T (between RR and R̂T ), are computed by the “Distance” tion sequences are obtained, namely (RT , RˆR ) and (RR , RˆT ),
module and used to calculate the final similarity score by the which are then fed to the distance calculation module.
3
C. Distance Calculation and Prediction Modules TABLE I: Results of 2-class prediction via regression.
We extend the attentive pooling used in SV [47] to our utterance-level system-level
Method
ACC LCC SRCC MSE LCC SRCC MSE
work. We average the representations of an utterance over time SVSNet(R) 0.771 0.513 0.488 0.162 0.943 0.926 0.007
to obtain the utterance embedding and compute the 1-norm MOSNet(R) 0.696 0.453 0.455 0.197 0.913 0.871 0.032
distance of each dimension of two means:
TABLE II: Results of 2-class prediction via classification.
DT,R = ||M ean(RT ) − M ean(RˆR )||1 ,
(2) utterance-level system-level
DR,T = ||M ean(RR ) − M ean(RˆT )||1 . Method
SVSNet(C) 0.767 0.470 0.470 0.233 0.942 0.945 0.011
Then, the two distances are fed to the prediction module to MOSNet(C) 0.670 0.329 0.329 0.336 0.918 0.879 0.012
obtain the similarity score:
ŜT =σ(flin2 (ρReLU (flin1 (DT,R )))), TABLE III: Results of 4-class prediction via regression.
(3) utterance-level system-level
ŜR =σ(flin2 (ρReLU (flin1 (DR,T )))), Method
where σ(.) denotes an activation function, flin1 (.) and flin2 (.) SVSNet(R) 0.438 0.575 0.572 0.860 0.966 0.910 0.008
SVSNet(R)conv 0.403 0.567 0.564 0.862 0.938 0.895 0.012
denote two linear layers, and ρReLU (.) denotes a rectified
linear unit (ReLU) activation function. The number of nodes TABLE IV: Results of 4-class prediction via classification.
of the final linear output layer is 1 for the regression model
utterance-level system-level
and 2 or 4 for the classification model, i.e., ŜT , ŜR ∈ R1 for Method
regression, and ∈ R2 or R4 for classification. The activation SVSNet(C) 0.482 0.558 0.559 1.153 0.941 0.871 0.022
function is an identity function for the regression model and SVSNet(C)conv 0.467 0.548 0.550 1.191 0.910 0.845 0.035
a softmax function for the classification model. Finally, the
prediction module obtains the final score by Ŝ = (ŜT + ŜR )/2.
Spearman’s rank correlation coefficient (SRCC) [49], and
MSE at both utterance and system levels. The utterance-level
D. Model Training evaluation was calculated from the predicted score and the
SVSNet is trained on a set of reference-test utterance corresponding ground-truth score for each pair of utterances.
pairs with corresponding human labeled similarity scores. The system-level evaluation was calculated on the basis of
We implemented two versions of SVSNet by using two the average predicted score and the corresponding average
types of prediction modules: regression and classification. The ground-truth score for each system. When treating similarity
corresponding SVSNet models are termed SVSNet(R) and prediction as a classification problem, we considered two
SVSNet(C), respectively. Given the ground-truth similarity designs: 2-class classification and 4-class classification. For
score S and the predicted similarity score Ŝ, the mean squared 4-class classification, the original labels were used as the
error (MSE) loss is used to train SVSNet(R), and the cross ground-truth. For 2-class classification, the ground-truth scores
entropy (CE) loss is used to train SVSNet(C). 1 and 2 were merged into label 1 (same speaker), and the
scores 3 and 4 were merged into label 2 (different speakers).
IV. E XPERIMENTS When treating similarity prediction as a regression task and
A. Experimental Setup evaluating performance on the basis of ACC, the outputs of
SVSNet(R) and MOSNet(R) were rounded and clipped to the
Since 2016, the VC challenge (VCC) has been held three
nearest integer (i.e., 1 or 2).
times. The task is to modify an audio waveform so that it
Since two different sampling rates (22,050 and 16,000 Hz)
sounds as if it was from a specific target speaker other than the
were used in the VCC2018 dataset, we reduced the sampling
source speaker. In each challenge, a large-scale crowdsourced
rate of all utterances to 16,000 Hz. For the encoder, the number
human perception evaluation was conducted to test the quality
of output channels of SincNet, the output size of the WaveNet
and similarity of the converted utterances. In VCC2018, there
convolutional layers, and the hidden size of BLSTM were
were 23 VC systems. A similarity evaluation was conducted on
64, 64, and 256, respectively. The hidden size of the linear
20,996 converted-natural utterance pairs, and 30,864 speaker
layers in the distance module was 128, and the output size
similarity assessments were obtained. Each pair was evaluated
was 1, 2, or 4 for the scalar output, 2-class output, and 4-class
by 1 to 7 subjects, with a score ranging from 1 (same
output, respectively. We used the Adam optimizer to train the
speaker) to 4 (different speakers). The detailed description of
model. The learning rate, β1 , and β2 were 1e-4, 0.5, and 0.999,
the corpus, listeners and evaluation methods can be found in
respectively. The batch size was set to 5. The model parameters
[14]. In this study, the dataset was divided into 24,864 pairs
were initialized by Xavier Uniform.
for training and 6,000 pairs for testing.
We used MOSNet [3] as the baseline. MOSNet was origi-
nally proposed for quality assessment, but a modified version B. Experimental Results
was used for similarity assessment. Like SVSNet, the models First, we compare SVSNet with the baseline MOSNet.
via regression and classification are termed MOSNet(R) and Tables I and II report the results of 2-class similarity prediction
MOSNet(C), respectively. Performance was evaluated in terms via regression and classification, respectively. From the tables,
of accuracy (ACC), linear correlation coefficient (LCC) [48], we can see that SVSNet consistently outperforms MOSNet in
4
TABLE V: Experiment Results of SVSNet on VCC2020.

Utterance-level System-level
Method
LCC SRCC MSE LCC SRCC MSE
SVSNet(R) 0.400 0.355 0.747 0.819 0.841 0.361
SVSNet(C) 0.260 0.236 1.302 0.775 0.765 0.244
MOSNet(R) 0.158 0.117 3.267 0.630 0.504 2.964
MOSNet(C) 0.218 0.199 1.651 0.723 0.736 0.211
x-vector 0.640 0.601 - 0.844 0.911 -
SVSNet(R)-fusion 0.767 0.764 0.395 0.971 0.968 0.182
SVSNet(C)-fusion 0.791 0.780 0.318 0.971 0.968 0.052
(a) SVSNet(C) (b) SVSNet(R)
Fig. 2: The scatter plots of SVSNet system-level predictions.
test, given a converted utterance, the reference utterance was
always the same. Therefore, different evaluation scores might
all evaluation metrics, but the improvement is relatively small be given to the same test-reference listening pair. In our
in the utterance-level prediction. We can also note that both experiments, we ignored this and simply used all pairs and
SVSNet and MOSNet perform better in the regression mode corresponding score labels to train the model to increase
than in the classification mode. diversity of the training data, thereby enhancing the model
Next, we study the effect of waveform processing. For a fair robustness. In the evaluation phase, we used the average score
comparison, we replaced the SincNet in SVSNet(R) and SVS- of each pair to calculate the results. There are 31 submitted
Net(C) with an ordinary convolutional layer with a kernel size systems for the intra-language task and 28 submitted systems
of 1, which has the same number of parameters as SincNet. for the cross-language task. To investigate the effects of corpus
The corresponding models are termed SVSNet(R)conv and mismatch, we adopted the VCC2018 training dataset used in
SVSNet(C)conv . We performed 4-class similarity prediction [1] as the training data and the full VCC2020 dataset as the
tests in both regression and classification modes. The results test data. Please note that most systems used conventional
are shown in Tables III and IV, respectively. Obviously, in vocoders in VCC2018, but used neural vocoders in VCC
all metrics, SVSNet(R) and SVSNet(C) are always better than 2020. Thus, the corpus mismatch is quite significant. The
SVSNet(R)conv and SVSNet(C)conv , respectively. prediction results are shown in Table V. From the table, the
Then, we investigate which of regression and classification scores of both SVSNet and MOSNet are lower than those
is more suitable for similarity prediction. By comparing Table I reported earlier due to corpus mismatch, while SVSNet still
with Table II, both SVSNet and MOSNet perform better in the outperforms MOSNet. Following Das et al. [1], we tested
scalar regression mode than in the classification mode except the performance with another prediction model formed by
for ACC. By comparing Table III with Table IV, we also note a cosine similarity measure based on 128-dimensional linear
similar trends. The results indicate that regression is more discriminant analysis (LDA) reduced x-vectors. The results
suitable for the similarity prediction task than classification. show that with an extra and massive dataset for pretraining,
We also compare the results of 2-class and 4-class classi- the x-vector system outperforms both SVSNet and MOSNet.
fications. From Tables I and III, SVSNet(R) performs better Finally, we constructed a fusion system that concatenates
in terms of ACC and MSE under the 2-class condition than 1-norm distance between two x-vectors to Eq. 2 on right
under the 4-class condition. This is reasonable, because when hand side. From Table V, the fusion model yields further
the label type is increased from 2 to 4, the prediction difficulty improvements over the SVSNet and x-vector systems.
increases accordingly. On the other hand, finer labels (4-class) Finally, Fig. 2 shows the system-level 4-class prediction
enable the model to output smoother prediction scores. Thus, results of SVSNet(R) and SVSNet(C) for VCC2018 and
SVSNet(R) yields better LCC and SRCC scores under the 4- VCC2020. From the figure, we can see SVSNet achieves good
class condition than under the 2-class condition, as shown in prediction performance for both VCC2018 and VCC2020.
Tables I and III. In VCC2018, each converted-natural utterance
pair was manually labeled with a score ranging from 1 (same V. C ONCLUSIONS
speaker) to 4 (different speakers). From Tables III and IV, the In this paper, we have proposed SVSNet, an end-to-end
high LCC (0.966 via regression and 0.941 via classification) neural similarity assessment model. The results of experiments
and SRCC (0.910 via regression and 0.871 via classification) on the large-scale human perception evaluation results in
scores indicate that the predicted ranking of the 23 submitted VCC2018 and VCC2020 show that SVSNet, benefiting from
systems by SVSNet is very close to that of human evaluation. the SincNet and the residual-skipped-WaveNet architecture,
performs better than the previous model MOSNet in terms
Voice Conversion Challenge 2020 (VCC2020), the next edi- of linear correlation coefficient (LCC), Spearman’s rank cor-
tion of VCC2018, includes two tasks, namely intra-language relation coefficient (SRCC), and mean squared error (MSE). It
VC and cross-language VC. The intra-language task consists is also found that directly using the waveform as input without
of 16 source-target speaker pairs, and the cross-language task discarding the phase information will increase the prediction
consists of 24 source-target speaker pairs. Each pair contained ability of our model. In the future, we plan to consider
5 converted utterances, and each converted utterance was the theory of human perception to design a perception-based
evaluated by 12 subjects (for intra-language) and 8 subjects objective function to build a more robust mean opinion score
(for cross-language). It is worth noting that in the listening (MOS) and similarity prediction model.
5
R EFERENCES [24] S. R. Quackenbush, “Objective measures of speech quality,” Englewood

Cliffs, NJ: Prentice-Hall, 1986.
[1] R. K. Das, T. Kinnunen, W.-C. Huang, Z.-H. Ling, J. Yamagishi, [25] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–
Z. Yi, X. Tian, and T. Toda, “Predictions of subjective ratings and frequency magnitude masking for speech separation,” IEEE/ACM Trans-
spoofing assessments of voice conversion challenge 2020 submissions,” actions on Audio, Speech, and Language Processing, vol. 27, no. 8,
in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion pp. 1256–1266, 2019.
Challenge, 2020. [26] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An
[2] M. Wester, C. Valentini-Botinhao, and G. E. Henter, “Are we using end-to-end non-intrusive speech quality assessment model based on
enough listeners? no! an empirically-supported critique of interspeech BLSTM,” in Proc. Interspeech, 2018.
2014 tts evaluations,” in Proc. Interspeech, 2015. [27] C. K. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-intrusive
[3] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and perceptual objective speech quality metric to evaluate noise suppressors,”
H.-M. Wang, “MOSNet: Deep learning based objective assessment for in Proc. ICASSP, 2020.
voice conversion,” in Proc. Interspeech, 2019. [28] R. E. Zezario, S.-W. Fu, C.-S. Fuh, Y. Tsao, and H.-M. Wang, “STOI-
[4] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, Net: A deep learning based non-intrusive speech intelligibility assess-
“X-Vectors: Robust DNN embeddings for speaker recognition,” in Proc. ment model,” in Proc. APSIPA ASC, 2020.
ICASSP, 2018. [29] Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin, “MBNet:
[5] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- MOS prediction for synthesized speech with mean-bias network,” in
Dominguez, “Deep neural networks for small footprint text-dependent Proc. ICASSP, 2021.
speaker verification,” in Proc. ICASSP, 2014. [30] G. Mittag and S. Möller, “Deep learning based assessment of synthetic
[6] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice speech naturalness,” in Proc. Interspeech, 2020.
conversion from unaligned corpora using variational autoencoding [31] C. Zeng, X. Wang, E. Cooper, and J. Yamagishi, “Attention back-end
wasserstein generative adversarial networks,” in Proc. Interspeech, 2017. for automatic speaker verification with multiple enrollment utterances,”
[7] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice arXiv preprint arXiv:2104.01541, 2021.
conversion from non-parallel corpora using variational auto-encoder,” in [32] Y. Zhang, M. Yu, N. Li, C. Yu, J. Cui, and D. Yu, “Seq2Seq attentional
Proc. APSIPA ASC, 2016. siamese neural networks for text-dependent speaker verification,” in
[8] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis Proc. ICASSP, 2019.
with transformer network,” in Proc, AAAI, 2019. [33] T. Parcollet, M. Morchid, and G. Linares, “E2E-SincNet: Toward fully
[9] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, end-to-end speech recognition,” in Proc. ICASSP, 2020.
Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., “Natural TTS synthesis by [34] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement
conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, generative adversarial network,” in Proc. Interspeech, 2017.
2018. [35] M. P. Shifas, N. Adiga, V. Tsiaras, and Y. Stylianou, “A non-causal fftnet
architecture for speech enhancement,” in Proc. Interspeech, 2019.
[10] A. Pandey and D. Wang, “A new framework for cnn-based speech
[36] N. Zeghidour and D. Grangier, “Wavesplit: End-to-End speech separa-
enhancement in the time domain,” IEEE/ACM Transactions on Audio,
tion by speaker clustering,” arXiv preprint arXiv:2002.08933, 2020.
Speech, and Language Processing, vol. 27, no. 7, pp. 1179–1188, 2019.
[37] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast
[11] E. Loweimi, J. Barker, and T. Hain, “Source-filter separation of speech
waveform generation model based on generative adversarial networks
signal in the phase domain,” in Proc. Interspeech, 2015.
with multi-resolution spectrogram,” in Proc. ICASSP, 2020.
[12] E. Loweimi, J. Barker, O. Saz-Torralba, and T. Hain, “Robust source-
[38] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
filter separation of speech signal in the phase domain.,” in Proc.
N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A gener-
Interspeech, 2017.
ative model for raw audio,” in Proc. ISCA Speech Synthesis Workshop,
[13] E. Loweimi, Robust Phase-based Speech Signal Processing From 2016.
Source-Filter Separation to Model-Based Robust ASR. PhD thesis, 2018. [39] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo,
[14] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, A. de Brébisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative
T. Kinnunen, and Z. Ling, “The Voice Conversion Challenge 2018: adversarial networks for conditional waveform synthesis,” in Proc.
Promoting development of parallel and nonparallel methods,” in Proc. NeurIPS, 2019.
Odyssey, 2018. [40] J.-w. Jung, H.-S. Heo, J.-h. Kim, H.-j. Shim, and H.-J. Yu, “RawNet:
[15] Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Advanced end-to-end deep neural network using raw waveforms for
Z.-H. Ling, and T. Toda, “Voice Conversion Challenge 2020: Intra- text-independent speaker verification,” in Proc. Interspeech, 2019.
lingual semi-parallel and cross-lingual voice conversion,” in Proc. Joint [41] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “A complete
Workshop for the Blizzard Challenge and Voice Conversion Challenge, end-to-end speaker verification system using deep neural networks: From
2020. raw signals to verification result,” in Proc. ICASSP, 2018.
[16] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual [42] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform
evaluation of speech quality (PESQ)-a new method for speech quality with SincNet,” in Proc. SLT, 2018.
assessment of telephone networks and codecs,” in Proc. ICASSP, 2001. [43] C.-L. Liu, S.-W. Fu, Y.-J. Li, J.-W. Huang, H.-M. Wang, and Y. Tsao,
[17] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short- “Multichannel speech enhancement by raw waveform-mapping us-
time objective intelligibility measure for time-frequency weighted noisy ing fully convolutional networks,” IEEE/ACM Transactions on Audio,
speech,” in Proc. ICASSP, 2010. Speech, and Language Processing, vol. 28, pp. 1888–1900, 2020.
[18] J. Ma, Y. Hu, and P. C. Loizou, “Objective measures for predicting [44] M. Ravanelli, T. Parcollet, and Y. Bengio, “The PyTorch-Kaldi speech
speech intelligibility in noisy conditions based on new band-importance recognition toolkit,” in Proc. ICASSP, 2019.
functions,” The Journal of the Acoustical Society of America, vol. 125, [45] A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves,
no. 5, pp. 3387–3405, 2009. and K. Kavukcuoglu, “Conditional image generation with PixelCNN
[19] F. Chen and P. C. Loizou, “Analysis of a simplified normalized covari- decoders,” in Proc. NeurIPS, 2016.
ance measure based on binary weighting functions for predicting the [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
intelligibility of noise-suppressed speech,” The Journal of the Acoustical Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc.
Society of America, vol. 128, no. 6, pp. 3715–3723, 2010. NeurIPS, 2017.
[20] F. Chen and P. C. Loizou, “Contributions of cochlea-scaled entropy [47] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling
and consonant-vowel boundaries to prediction of speech intelligibility for deep speaker embedding,” in Proc. Interspeech, 2018.
in noise,” The Journal of the Acoustical Society of America, vol. 131, [48] K. Pearson, “Notes on the history of correlation,” Biometrika, vol. 13,
no. 5, pp. 4104–4113, 2012. no. 1, pp. 25–45, 1920.
[21] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the [49] C. Spearman, “The proof and measurement of association between two
noise reduction Wiener filter,” IEEE Transactions on Audio, Speech, things.,” The American Journal of Psychology, vol. 39, no. 5, pp. 1137–
and Language Processing, vol. 14, no. 4, pp. 1218–1234, 2006. 1150, 1904.
[22] R. Kubichek, “Mel-cepstral distance measure for objective speech qual-
ity assessment,” in Proc. PACRIM, 1993.
[23] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures
for speech enhancement,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 16, no. 1, pp. 229–238, 2007.

Svsnet: An End-To-End Speaker Voice Similarity Assessment Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Svsnet: An End-To-End Speaker Voice Similarity Assessment Model

Uploaded by

Copyright:

Available Formats

1

SVSNet: An End-to-end Speaker Voice Similarity

natural speech and synthesized speech. Unlike most neural eval-

T He speech generated in voice conversion (VC) tasks

proposed as non-intrusive tools for measuring speech quality CAT Pred S

aims to determine whether the input speech is pronounced by a x7

claimed speaker. For most SV systems, the test utterance and

TABLE V: Experiment Results of SVSNet on VCC2020.

R EFERENCES [24] S. R. Quackenbush, “Objective measures of speech quality,” Englewood

You might also like