Professional Documents
Culture Documents
Abstract—Neural evaluation metrics derived for numerous are widely used in many speech processing tasks, such as
speech generation tasks have recently attracted great attention. speaker verification (SV) [4], [5], VC [6], [7], speech synthesis
In this paper, we propose SVSNet, the first end-to-end neural [8], [9], and speech enhancement [10], we believe that the
network model to assess the speaker voice similarity between
raw waveform contains the most complete information for
arXiv:2107.09392v1 [eess.AS] 20 Jul 2021
Frame-wise
To address the above issues, several neural evaluation Representation
Distance between
Utterances
metrics have been proposed. For speech enhancement tasks, E RT
Dis DT,R
Quality-Net [26], DNSMOS [27],and STOI-Net [28] were XT
R̂R
R̂T
and intelligibility. For VC, MOSNet [3] and MBNet [29] were Dis DR,T
E RR
proposed to measure the naturalness of converted speech. Mit-
XR
tag and Möller [30] proposed a model for the text-to-speech
synthesis task on the TU Berlin / Kiel University database. To (a)
the best of our knowledge, no one has previously established
x4
SincNet rSWC BLSTM
neural evaluation metrics for the similarity assessment of VC
tasks, which is the main focus of this study.
(b)
B. Similarity Prediction Conv RConv
x2
The similarity prediction task resembles an SV task, which Conv DConv GTU Conv MaxPooling3
C. Distance Calculation and Prediction Modules TABLE I: Results of 2-class prediction via regression.
We extend the attentive pooling used in SV [47] to our utterance-level system-level
Method
ACC LCC SRCC MSE LCC SRCC MSE
work. We average the representations of an utterance over time SVSNet(R) 0.771 0.513 0.488 0.162 0.943 0.926 0.007
to obtain the utterance embedding and compute the 1-norm MOSNet(R) 0.696 0.453 0.455 0.197 0.913 0.871 0.032
distance of each dimension of two means:
TABLE II: Results of 2-class prediction via classification.
DT,R = ||M ean(RT ) − M ean(RˆR )||1 ,
(2) utterance-level system-level
DR,T = ||M ean(RR ) − M ean(RˆT )||1 . Method
ACC LCC SRCC MSE LCC SRCC MSE
SVSNet(C) 0.767 0.470 0.470 0.233 0.942 0.945 0.011
Then, the two distances are fed to the prediction module to MOSNet(C) 0.670 0.329 0.329 0.336 0.918 0.879 0.012
obtain the similarity score:
ŜT =σ(flin2 (ρReLU (flin1 (DT,R )))), TABLE III: Results of 4-class prediction via regression.
(3) utterance-level system-level
ŜR =σ(flin2 (ρReLU (flin1 (DR,T )))), Method
ACC LCC SRCC MSE LCC SRCC MSE
where σ(.) denotes an activation function, flin1 (.) and flin2 (.) SVSNet(R) 0.438 0.575 0.572 0.860 0.966 0.910 0.008
SVSNet(R)conv 0.403 0.567 0.564 0.862 0.938 0.895 0.012
denote two linear layers, and ρReLU (.) denotes a rectified
linear unit (ReLU) activation function. The number of nodes TABLE IV: Results of 4-class prediction via classification.
of the final linear output layer is 1 for the regression model
utterance-level system-level
and 2 or 4 for the classification model, i.e., ŜT , ŜR ∈ R1 for Method
ACC LCC SRCC MSE LCC SRCC MSE
regression, and ∈ R2 or R4 for classification. The activation SVSNet(C) 0.482 0.558 0.559 1.153 0.941 0.871 0.022
function is an identity function for the regression model and SVSNet(C)conv 0.467 0.548 0.550 1.191 0.910 0.845 0.035
a softmax function for the classification model. Finally, the
prediction module obtains the final score by Ŝ = (ŜT + ŜR )/2.
Spearman’s rank correlation coefficient (SRCC) [49], and
MSE at both utterance and system levels. The utterance-level
D. Model Training evaluation was calculated from the predicted score and the
SVSNet is trained on a set of reference-test utterance corresponding ground-truth score for each pair of utterances.
pairs with corresponding human labeled similarity scores. The system-level evaluation was calculated on the basis of
We implemented two versions of SVSNet by using two the average predicted score and the corresponding average
types of prediction modules: regression and classification. The ground-truth score for each system. When treating similarity
corresponding SVSNet models are termed SVSNet(R) and prediction as a classification problem, we considered two
SVSNet(C), respectively. Given the ground-truth similarity designs: 2-class classification and 4-class classification. For
score S and the predicted similarity score Ŝ, the mean squared 4-class classification, the original labels were used as the
error (MSE) loss is used to train SVSNet(R), and the cross ground-truth. For 2-class classification, the ground-truth scores
entropy (CE) loss is used to train SVSNet(C). 1 and 2 were merged into label 1 (same speaker), and the
scores 3 and 4 were merged into label 2 (different speakers).
IV. E XPERIMENTS When treating similarity prediction as a regression task and
A. Experimental Setup evaluating performance on the basis of ACC, the outputs of
SVSNet(R) and MOSNet(R) were rounded and clipped to the
Since 2016, the VC challenge (VCC) has been held three
nearest integer (i.e., 1 or 2).
times. The task is to modify an audio waveform so that it
Since two different sampling rates (22,050 and 16,000 Hz)
sounds as if it was from a specific target speaker other than the
were used in the VCC2018 dataset, we reduced the sampling
source speaker. In each challenge, a large-scale crowdsourced
rate of all utterances to 16,000 Hz. For the encoder, the number
human perception evaluation was conducted to test the quality
of output channels of SincNet, the output size of the WaveNet
and similarity of the converted utterances. In VCC2018, there
convolutional layers, and the hidden size of BLSTM were
were 23 VC systems. A similarity evaluation was conducted on
64, 64, and 256, respectively. The hidden size of the linear
20,996 converted-natural utterance pairs, and 30,864 speaker
layers in the distance module was 128, and the output size
similarity assessments were obtained. Each pair was evaluated
was 1, 2, or 4 for the scalar output, 2-class output, and 4-class
by 1 to 7 subjects, with a score ranging from 1 (same
output, respectively. We used the Adam optimizer to train the
speaker) to 4 (different speakers). The detailed description of
model. The learning rate, β1 , and β2 were 1e-4, 0.5, and 0.999,
the corpus, listeners and evaluation methods can be found in
respectively. The batch size was set to 5. The model parameters
[14]. In this study, the dataset was divided into 24,864 pairs
were initialized by Xavier Uniform.
for training and 6,000 pairs for testing.
We used MOSNet [3] as the baseline. MOSNet was origi-
nally proposed for quality assessment, but a modified version B. Experimental Results
was used for similarity assessment. Like SVSNet, the models First, we compare SVSNet with the baseline MOSNet.
via regression and classification are termed MOSNet(R) and Tables I and II report the results of 2-class similarity prediction
MOSNet(C), respectively. Performance was evaluated in terms via regression and classification, respectively. From the tables,
of accuracy (ACC), linear correlation coefficient (LCC) [48], we can see that SVSNet consistently outperforms MOSNet in
4