You are on page 1of 7

MAIN-VC: Lightweight Speech Representation

Disentanglement for One-shot Voice Conversion


B
Pengcheng Li1,2† , Jianzong Wang1† , Xulong Zhang1 , Yong Zhang1 , Jing Xiao1 , Ning Cheng1
1
Ping An Technology (Shenzhen) Co., Ltd.
2
University of Science and Technology of China
pechola.lee@outlook.com, jzwang@188.com, zhangxulong@ieee.org,
yzhang3272-c@my.cityu.edu.hk, {xiaojing661, chengning211}@pingan.com.cn

Abstract—One-shot voice conversion aims to change the timbre speaker-dependent information is time-invariant, remaining
arXiv:2405.00930v1 [cs.SD] 2 May 2024

of any source speech to match that of the unseen target speaker constant in one utterance, whereas speaker-independent infor-
with only one speech sample. Existing methods face difficulties mation is time-variant. This allows neural networks to sepa-
in satisfactory speech representation disentanglement and suffer
from sizable networks as some of them leverage numerous rately learn representations of content and speaker information
complex modules for disentanglement. In this paper, we propose from the speech. The disentanglement-based VC models typi-
a model named MAIN-VC to effectively disentangle via a cally adopt an autoencoder structure, integrating approaches
concise neural network. The proposed model utilizes Siamese like bottleneck features [11], instance normalization [12],
encoders to learn clean representations, further enhanced by the vector quantization [13], etc. into encoders to extract desired
designed mutual information estimator. The Siamese structure
and the newly designed convolution module contribute to the speech representations. The training usually involves recon-
lightweight of our model while ensuring performance in diverse struction of Mel-spectrogram. During training, the same Mel-
voice conversion tasks. The experimental results show that spectrogram segments are fed into the decoupling encoders
the proposed model achieves comparable subjective scores and to obtain various speech representations, then the decoder
exhibits improvements in objective metrics compared to existing integrates the extracted speech representations and reconstruct
methods in a one-shot voice conversion scenario.
Index Terms—voice conversion, speech representation learning the Mel-spectrogram. After the training is completed, the
content information from the source speech and the speaker
information from the target speech is decoupled respectively
I. I NTRODUCTION and then fused for synthesizing the converted voice, achieving
the transferring of speaker style.
Voice conversion (VC) is the style transfer task in the field
However, the disentanglement of speech representation is
of speech. It modifies the source speech’s style to the target
complicated [14], how to effectively eliminate unnecessary in-
speaker’s speaking style. Voice conversion can be applied
formation while preventing damage to the wanted information
in smart devices, entertainment industry, and even privacy
that needs to be disentangled, and how to extract clean repre-
security area [1], [2].
sentations without including other information, are challenging
Early voice conversion relies heavily on parallel training
aspects to master. Most of the previous voice conversion mod-
data [3], while parallel data collecting and frame-level align-
els suffer unsatisfactory disentangling performance. Besides
ment procedures consume much time. So researchers have
the dependence of some models on the fine-tuned bottleneck
turned attention to non-parallel voice conversion, enabling the
[11], [15], the reasons for deficient disentanglement include
possibility of many-to-many and any-to-any conversion tasks.
that different encoders disentangle independently during the
Inspired by image style transfer in computer vision, a series
representation learning process, lacking communication with
of methods based on generative adversarial network (GAN)
the others, resulting in the existence of redundant information
have been applied to voice conversion [4]–[7]. These GAN-
in disentangled representations. Moreover, most of these VC
based VC methods treat each speaker as a different domain
models are sizable, as some of them improve disentangling
and consider voice conversion as migration across speaker
capability through stacking various network modules, making
domains. But GAN training is unstable and poses difficulties
it difficult to deploy the models on low-computing mobile
in convergence.
devices.
Recently, disentanglement-based VC [8]–[10] has been an
In this paper, we propose a model based on speech rep-
area of active research for its flexibility in any-to-any voice
resentation disentanglement, named Mutual information en-
conversion. The disentanglement-based VC method is built
hanced Adaptive Instance Normalization VC (MAIN-VC).
on the assumption that speech composition contains both
To enhance the disentangling ability of the encoders, we
speaker-dependent information (e.g. timbre and speaking style)
integrate the Siamese structure as well as data augmentation
and speaker-independent information (e.g. content). Generally,
into the speaker representation learning module. An easy-to-
† Equal contribution. train constrained mutual information estimator is introduced
B Corresponding author. to build the communication between the learning processes of
content and speaker information, which reduces the overlap and marginal distribution of the variables. However, high-
between unrelated representations. With the assistance of these dimensional random variables from neural networks often own
modules, the disentangling ability of the encoders no longer high complexity and complicated distribution. Only samples
rely on complicated networks. This makes it possible for can be obtained while distribution is difficult to derive. It is
model lightweight and deployment on low-computing devices. difficult to accurately obtain the mutual information between
The contributions of this paper can be summarized as follows: them. Therefore, neural MI estimation is proposed to address
• We introduce the speaker information learning module this issue.
which reduces the interference of time-varying informa- MINE [27] treats MI as the Kullback-Leibler divergence
tion to extract clean speaker embeddings. (KL divergence) between the joint and marginal distributions,
• We propose the constrained mutual information estimator and sets a function to fit this KL divergence. The fitting process
restricted by upper and lower bounds and leverage it to is completed via a trainable neural network. InfoNCE [28]
enhance disentangling capability, contributing to improve maximizes the MI between a learned representation and the
the quality of one-shot VC. data and estimates the lower bound of MI based on noise
• We design the proposed model to be lightweight while contrastive estimation [29]. Representative works on mutual
ensuring the quality of voice conversion. A convolu- information upper bound estimation include L1Out [30] and
tion block called APC with a low parameter count and CLUB [31]. The former leverages the Monte Carlo approx-
large receptive field is also introduced to commit to imation to approximate the marginal distribution and then
lightweight. derives the upper bound of MI. CLUB [31] fits the conditional
probability between two variables. It designs positive sample
II. R ELATED W ORKS pairs and negative sample pairs to construct a contrastive
probability log ratio between two conditional distributions.
A. Speech Representation Disentanglement
When the approximate condition probability is close enough
Speech representation disentanglement (SRD) decomposes to the truth (i.e. the KL divergence is small enough), the fitted
speech components into diverse representations. It has been mutual information gradually approaches the true MI from
used in speaker recognition [16], speaker diarization [17], above, thus obtaining the upper bound of MI.
speech emotion recognition [18], speech enhancement [19], In practical applications, the estimation of the lower bound
automatic speech synthesis [20], etc. of the MI is maximized when it is necessary to strengthen the
SRD also conveniently provides an opportunity for the correlation between two types of variables, while the upper
purpose of voice conversion, altering the speaker identity bound of the MI is minimized when reducing the correlation
information in speech to achieve conversion. The voice conver- between variables.
sion methods based on SRD have gained widespread attention
in recent years. AUTOVC [11] leverages an autoencoder with III. M ETHODOLOGY
a carefully tuned bottleneck to remove speaker information. A A. Overall Architecure of MAIN-VC
pre-trained speaker encoder is utilized to provide speaker iden- The architecture of MAIN-VC is depicted in Fig. 1.
tity information. S PEECH S PLIT [15] and S PEECH S PLIT 2.0 MAIN-VC adopts an instance normalization- (IN-)based
[21] design multiple encoders with signal processing to de- method for speech style transferring. A speaker information
couple finer-grained speech components such as rhythm, con- learning module consists of Siamese encoders with a data aug-
tent and pitch. Another series of works [12], [22] removes mentation module built to obtain clean speaker representation,
time-invariant speaker information through instance normal- while the constrained mutual information estimator between
ization to get content representation, then extracted speaker speaker representation and content representation is leveraged
representation is fused into the content representation in an to enhance the disentanglement. The pre-trained vocoder only
adaptive manner. Vector quantization (VQ) is also leveraged appears during the inference stage. The constrained mutual
to complete speech representation disentanglement. In vector information estimator and the Siamese encoder both take part
quantization-based methods [23]–[26], latent representation is in the training stage to assist in the training of the content
quantized into discrete codes, which are highly related to encoder and the speaker information learning module. They
phonemes, and the differences between the discrete codes and are not involved in the conversion process during the inference
the vectors before quantization are considered as the speaker stage after the training is completed.
information.
SRD-base VC method shows its flexibility in a one-shot B. IN-based Speech Style Transfer
voice conversion scenario but places high demands on the When given frame-level linguistic feature Z, the content
model’s disentangling ability. encoder and speaker representation learning module extract
content representation zC and speaker representation zS re-
B. Neural Mutual Information Estimation spectively. To mitigate the impact of time-invariant informa-
Mutual information (MI) measures the degree of corre- tion like timbre on the extraction of content representation, the
lation or inclusion between two variables. The calculation content encoder removes speaker information from the mid
of MI generally requires calculating the joint distribution feature map M (processed from linguistic feature Z) through
Shape: (n_mel , Ts) (Ls , ) (n_mel , Tc)

Encoder
μ
mel.

speaker
Encoder
speaker
TS AdaIN
σ

Decoder

Vocoder
SILM speaker emb. mel.
CMI

Encoder
content
mel.
Loss

content emb.
(n_mel , Tc) (Lc , )

Fig. 1. Architecture of MAIN-VC and the training objectives.

instance normalization without affine transformation (denoted C. Speaker Information Learning Module
as INwoAT ) as Eq. 1 shows, Although SRD-based VC models do not require parallel
W corpus for training, typically using multi-speaker datasets
1 X [w]
µ(M[c] ) = M where the same speaker has multiple utterances, to fully utilize
W w=1 [c]
v this characteristic of multi-speaker datasets, we generalize the
u
u 1 XW time-invariance within utterance to the utterance-invariance
[w] (1)
σ(M[c] ) = t (M − µ(M[c] ))2 across utterance from the same speaker.
W w=1 [c] In our proposed method, the speaker information learning
[w] module (SILM) based on utterance-invariance consists of
M[c] − µ(M[c] )
INwoAT (M[c] ) =
[w] Siamese encoders (denoted as EncS and Enc′S ) and a time
σ(M[c] ) shuffle unit (TS). TS processes the input data (i.e. target
[w] speech) before it enters EncS , it rearranges the order of
where M[c] represents the w-th element of the c-th channel
frame-level segments along time dimension to disrupt time-
of feature map M . Content representation zC can be extracted
variant information (i.e. content information) while preserving
from INwoAT (Msrc ) where Msrc is the feature map obtained
time-invariant information (i.e. speaker information). During
from the linguistic feature Zsrc of the source speech, while the
the training stage, the same linguistic feature Z is fed into
global speaker representation zS is the tuple (α, β) obtained
both EncS of SILM and content encoder EncC , then the
from the linguistic feature of the target speech Ztgt via
decoder Dec performs reconstruction with the outputs of the
speaker encoder. Once we collect the content representation zC
encoders as shown in Eq. 3. Different from the previous works,
and speaker representation zS , the decoder integrates speaker
the parallel time-variant content information is completely
information via its internal adaptive instance normalization
disrupted in the proposed method.
(AdaIN) [32] layers:
[w] Ẑ = Dec(EncC (Z), EncS (TS(Z))) (3)
[w]
MC [c] − µ(MC [c] )
AdaIN(MC [c] , zS [c] ) = β[c] · + α[c] (2) ′
Linguistic feature Z of another utterance from the same
σ(MC [c] )
speaker is fed into Enc′S of SILM after reconstruction. Based
where MC is the feature map obtained from content rep- on the utterance-invariance of speaker information, we use
resentation zC via the internal blocks of the decoder, and different utterances from the same speaker for the considera-
zS [c] (composed of α[c] and β[c] ) denotes the c-th channel tion that speech inflection and recording conditions affect the
of the speaker representation. Essentially, AdaIN layers adap- extraction of speaker representation. The Siamese loss shown
tively adjust the affine parameters of IN according to speaker in Eq. 4 helps to improve the cosine similarity between these
information to align the channel-wise statistics of content two speaker representations:
with those of speaker style to achieve style transferring.
Lsiamese = 1 − cos(EncS (TS(Z)), Enc′S (Z ′ )) (4)
Extracting continuous speaker representations from arbitrary
target utterances enables the model applicability in any-to-any In the inference stage, the Siamese structure is disregarded,
VC scenarios. since only one utterance from the target speaker is required.
D. Mutual Information Enhanced Disentanglement
We design the constrained mutual information (CMI) es- InstanceNormalization
×2
ReLU
timator with upper and lower bounds to estimate mutual
DenseBlock Conv1d
information. Estimating MI between complex random vari- IN
Conv1dBlock
ables is challenging and always unstable to train in practice, Conv1dBlock APC
simultaneously estimating both upper and lower bound of concatenate IN
Conv1dBlock Conv1dBlock
it and constraining them mutually help to enhance the MI

Encc
Conv1d Conv1d Conv1d
AtrousPyramid AtrousPyramid
estimator’s robustness and prediction accuracy. The estimation Conv.
size=3
dilation=1
size=3
dilation=2
... size=3
dilation=n
Conv.
of the lower bound can serve as guidance and a baseline
for the upper bound estimation, resulting in a more trainable
mutual information estimator, CMI. In MAIN-VC, the upper Fig. 2. The structure of the encoders and the details of APC.
bound of MI between content representation zC and speaker
representation zS is derived from the variational contrastive
log-ratio upper bound (vCLUB) [31] as: kernels of different sizes (size from 1 × 1 to 8 × 8 in the
baseline method [12]) which contain large kernels. APC not
Iupper (zC ; zS ) =EP(zC ,zS ) [log(Qθ (zC |zS ))] only reduces the parameter count significantly but also expands
(5) the receptive field.
− EP(zC ) EP(zS ) [log(Qθ (zC |zS ))] What’s more, parameter sharing in the Siamese structure
where P (zC |zS ) is the actual conditional distribution, while of SILM also prevents the network from growing. CMI only
Qθ (zC |zS ) represents its variational estimation obtained via a appears during training and makes no time consumption during
trainable network. The lower bound of MI can be estimated inference. We adjust the layers and channels of the encoders to
through MINE [27], which employs a function Tθ with train- strike a balance between performance and model lightweight.
able parameters θ to approximate the KL divergence between F. Loss Function
two distributions, i.e. the joint distribution P (zC , zS ) and the
product of the two marginal distributions P (zC ) ⊗ P (zS ), in The loss function of the proposed model contains recon-
order to predict the lower bound of I(zC ; zS ), as: struction loss, KL divergence loss, MI loss and Siamese loss.
Reconstruction loss is the L1 loss between the source log-
Mel-spectrogram Z ∈ RM ×T and the output of decoder
Ilower (zC ; zS ) =EP(zC ,zS ) [Tθ (zC , zS )] Ẑ ∈ RM ×T (as shown in Eq. 3 and Eq. 8). The KL divergence
(6)
− log(EP(zC )⊗P(zS ) [e Tθ (zC ,zS ) ]) between the posterior distribution P (zC |z) and the prior
distribution P (z) serves as the KL loss, shown in Eq. 9.
CMI is trained using the learning losses of the upper and
lower bound estimation networks, as well as the gap between
Lrecon = ∥Z − Ẑ∥1 (8)
their estimated MI values. We minimize the estimated upper
bound for further disentanglement of the speech representa-
2
tions. When N utterances are given, and the length of content LKL = EZ∼P (Z),zC ∼P (zC |Z) [∥EncC (Z) ∥22 ] (9)
representation extracted from each cropped utterance is L, then
CMI estimates MI loss as: Siamese loss and MI loss are already defined in Eq. 4
and Eq. 7 respectively. Finally, the total loss combines the
N
1 X XX
N L
Qθ zC m,l |zS n
 aforementioned losses with weights λ1 , λ2 and λ3 , as:
LM I = 2 log  (7)
N L m=1 n=1 Qθ zC n,l |zS n L = Lrecon + λ1 LKL + λ2 Lsiamese + λ3 LM I (10)
l=1

It’s worth noting that SILM’s enhancement of speaker IV. E XPERIMENTS AND R ESULTS
representation disentangling can also facilitate CMI’s effect
on content disentangling implicitly, as the cleaner speaker A. Experimental Setup
representation obtained from SILM, the more accurate content- To evaluate MAIN-VC, a comparative experiment and an
irrelevant information that CMI needs to eliminate during ablation study are conducted on VCTK [35] corpus. 19 speak-
content disentangling. ers are randomly selected as unseen speakers for one-shot
conversion tasks, and their speech is excluded during training.
E. Model Lightweighting All the utterances are resampled to 16,000 Hz during the
Inspired by atrous spatial pyramid pooling (ASPP) [33], we experiment, and the Mel-spectrograms are computed through
design a convolution block named atrous pyramid convolution short-time Fourier transform for training. During the training
(APC) as shown in Fig. 2. APC contains a series of small set sampling, each utterance is randomly cropped into several
kernels (3 × 3 in MAIN-VC) with different dilation factors 128-frame segments. Two segments from different utterances
[34] and a residual-connect concatenating the convolution spoken by the same speaker are then paired to form a training
results, replacing the sizable convolution block in previous data sample, where one is used for reconstruction, and the
works. The previous convolution block consists of a set of other is fed into the Siamese encoder of SILM.
TABLE I
C OMPARISON OF DIFFERENT METHODS FOR MANY- TO - MANY AND ONE - SHOT VC ( WITH 95% CONFIDENCE INTERVAL ).
Many-to-Many One-Shot
VC Method
MCD↓ MOS↑ VSS↑ MCD↓ MOS↑ VSS↑
AdaIN-VC [12] 7.64 ± 0.24 3.11 ± 0.13 2.82 ± 0.19 7.38 ± 0.14 3.04 ± 0.21 2.45 ± 0.16
AUTOVC [11] 6.68 ± 0.21 2.76 ± 0.16 2.45 ± 0.23 8.08 ± 0.15 2.68 ± 0.17 2.39 ± 0.14
VQMIVC [24] 6.24 ± 0.13 3.20 ± 0.14 3.32 ± 0.12 5.59 ± 0.10 3.16 ± 0.18 3.02 ± 0.18
LIMI-VC [38] 7.27 ± 0.17 3.12 ± 0.13 3.01 ± 0.26 7.88 ± 0.18 2.97 ± 0.22 2.74 ± 0.24
MAIN-VClarge 5.17 ± 0.14 3.41 ± 0.21 3.35 ± 0.12 5.31 ± 0.09 3.33 ± 0.17 3.23 ± 0.17
MAIN-VC 5.28 ± 0.11 3.44 ± 0.12 3.25 ± 0.16 5.42 ± 0.13 3.24 ± 0.18 3.29 ± 0.11

It’s worth noting that MAIN-VC requires at least two each model can be freely replaced with an appropriate one.
forward and backward passes in each training iteration, as the The inference time is measured and averaged using the same
CMI module needs extra training. In other words, the training utterance pairs with each model on the same device (1× Tesla
process involves the following two steps in each iteration: V100).
1) Training the integrated CMI network, the gap between
C. Experimental Results
upper and down bounds is combined with the learning
losses. This step can be repeated several times in one Table I showcases the comparison of MAIN-VC and
iteration (five times in our experiment). baseline methods in different voice conversion scenarios. The
2) Computing the four loss components to obtain the to- proposed non-lightweight model (denoted as MAIN-VClarge )
tal for the training of the entire model (one forward is also taken into consideration. For many-to-many VC,
and backward propagation, excluding the pre-trained MAIN-VC achieves comparable performance with baseline
vocoder). methods. MAIN-VC outperforms other methods in one-shot
MAIN-VC is trained with the Adam optimizer [36] with VC, especially on the objective metric. As for the lightweight
β1 = 0.9, β2 = 0.99, ϵ = 10−6 , lr = 1 × 10−4 . Another of each model, the comparison of the parameter count and
Adam optimizer with lr = 2 × 10−4 is built to train the CMI inference time cost (exclude vocoder) is presented in Table
network. During the inference stage, a pre-trained WaveRNN II. MAIN-VC reaches 59% of reduction in the parameter
[37] is used as the vocoder to generate waveform from count and 33% of reduction in the inference cost, compared
Mel-spectrogram. We design the weights in the total loss to the lightest baseline methods respectively. The proposed
function Eq. 10 based on the numerical characteristics of each model has an obvious advantage in parameter count thanks to
component, λ1 and λ3 gradually increase to 1 as the number the proposed APC module and the parameter-sharing strategy,
of iterations progresses, while λ2 is set to 1. We compare also inferences less costly due to its lightweight network.
our proposed models with AdaIN-VC [12], AUTOVC [11], Overall, MAIN-VC strikes a balance between lightweight
VQMIVC [24] and LIMI-VC [38]. Our code is available at design and performance, obtaining a light network structure
https://github.com/PecholaL/MAIN-VC. while ensuring the quality of voice conversion.
Fig. 3 showcases Mel-spectrograms of utterances in a one-
B. Metrics and Measurement shot VC task, including the source speech, ground-truth speech
The following subjective and objective metrics are used to (i.e. the same content to source speech spoken by the target
evaluate the performance of different voice conversion models. speaker, not the target speech in the inference), converted
Mean opinion score (MOS) is a common subjective metric speeches via MAIN-VC. The details of the spectrograms
which assesses the naturalness of speech, with scores ranging show that the source speaker and target speaker own dif-
from 1 to 5, where higher scores indicate higher speech quality. ferent acoustic characteristics (as in Fig. 3(a) and Fig. 3(b),
Voice similarity score (VSS), a subjective metric, evaluates the distributions and shapes of the harmonic from different
the similarity between the converted speech and the ground- speakers are quite different). MAIN-VC can extract the
truth speech from the target speaker. Higher scores indicate feature of the target speaker and construct the converted Mel-
higher similarity, representing better VC performance. spectrogram that is close to the ground-truth. MAIN-VC
Mel-cepstral distortion (MCD) is an objective metric to is also capable of performing cross-lingual voice conversion
measure the difference between the converted speech and the (XVC), another one-shot VC scenario, where the source and
target ground-truth speech in the aspect of Mel-cepstral. target utterances are in different languages. We conduct XVC
Both many-to-many and one-shot (any-to-any) VC are con- on AISHELL [39] (Mandarin speech corpus) and VCTK [35]
ducted while refraining from using parallel data during the (English speech corpus). More audio samples are available at
inference stage. 4 pairs of source/target speech are input into https://largeaudiomodel.com/main-vc/.
each VC model for evaluation in two scenarios respectively.
Then 8 participants are invited to score the speech samples. D. Ablation Study
For the measurement of lightweight, we choose parameter To validate the effects of SILM and CMI on disentan-
count and inference time cost as the metrics. The measure- glement, we conduct ablation experiments. We set up the
ments for both do not include the vocoders, as the vocoder in following models:
(a) source speech (b) ground-truth speech (d) converted speech via MAIN-VC

Fig. 3. The Mel-spectrogram samples of the one-shot VC task "Please call Stella".

TABLE II
C OMPARISON OF PARAMETER COUNT AND INFERENCE TIME COST
(w/o VOCODER ).

VC Method Param. Count↓ Infer. Cost↓


AdaIN-VC [12] 9.04M 100%
AUTOVC [11] 28.27M 50.94%
VQMIVC [24] 29.20M 145.28%
LIMI-VC [38] 3.19M 49.06%
MAIN-VC 1.31M 22.64%

(a) MAIN-VC (b) w/o CMI


a) MAIN-VC: the proposed model.
b) M1 (w/o CMI): remove the constrained mutual informa-
tion estimator from the proposed method.
c) M2 (w/o CMIlower ): only use the estimator provided by
CLUB [31] to assess the upper of mutual information.
d) M3 (w/o Siamese encoder): remove the Siamese structure
and time shuffle unit in SILM.
Resemblyzer is employed for fake speech detection to assess
the conversion quality. Resemblyzer simultaneously scores the
fake (i.e. the outputs of VC model in the experiment) and real (c) w/o CMIlower (d) w/o siamese Enc.
utterances of the target speaker after learning the characteris-
tics of the target speaker from another 6 bonafide utterances.
Fig. 4. The visualization of speaker representations extracted from 8 unseen
A higher score indicates more similar timbre and better speech speakers’ utterances (4 males and 4 females, 100 utterances from each
quality. The results are presented in Table III. CMI and SILM speaker).
contribute to MAIN-VC’s synthesis of converted speech that
is more confusing for fake detection, and the lower bound of
CMI also shows its beneficial effect on the converted speech. V. C ONCLUSION
In this paper, we propose a lightweight voice conver-
TABLE III
C OMPARISON FOR ABLATION STUDY. sion model to address the challenges of VC. The designed
speaker information learning module and constrained mutual
information estimator effectively enhance the disentanglement
Method Detection Score↑
capability and improve the conversion quality. From another
MAIN-VC 0.74 ± 0.01 aspect, with the facilitation of the introduced modules, the
M1 (w/o CMI) 0.62 ± 0.04
M2 (w/o CMIlower ) 0.69 ± 0.03 disentangling capability of the encoders no longer relies on
M3 (w/o Siamese Enc.) 0.70 ± 0.03 heavy network structure, so we implement model lightweight-
ing while ensuring the model’s performance. The experimental
results show that MAIN-VC can complete one-shot voice
To visually assess the disentanglement capability of each
conversion task and achieves comparable or even better results
model, the scatter diagrams of speaker representations via t-
in various VC scenarios compared to existing methods, while
SNE [40] are shown in Fig. 4. The inclusion of CMI and
maintaining a lightweight structure.
the Siamese structure in SILM results in more clustered
representations of the same speaker, while the removal of the VI. ACKNOWLEDGEMENT
lower bound in CMI leads to looser clustering and inaccu- Supported by the Key Research and Development Pro-
rate speaker representation (erroneous clustering in the detail gram of Guangdong Province (grant No. 2021B0101400003)
plots), indicating a deterioration in disentangling performance. and Corresponding author is Xulong Zhang (zhangxu-
So SILM and CMI facilitate MAIN-VC to achieve better long@ieee.org).
disentanglement capability and conversion performance.
R EFERENCES Conference of the International Speech Communication Association,
2023.
[1] R. Yuan, Y. Wu, J. Li, and J. Kim, “Deid-vc: Speaker de-identification [21] C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speech-
via zero-shot pseudo voice conversion,” in the 23rd Annual Conference split2.0: Unsupervised speech disentanglement for voice conversion
of the International Speech Communication Association, 2022, pp. without tuning autoencoder bottlenecks,” in IEEE International Confer-
2593–2597. ence on Acoustics, Speech and Signal Processing, 2022, pp. 6332–6336.
[2] B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, [22] Y. Chen, D. Wu, T. Wu, and H. Lee, “Again-vc: A one-shot voice
and E. Vincent, “Evaluating voice conversion-based privacy protection conversion using activation guidance and adaptive instance normaliza-
against informed attackers,” in IEEE International Conference on Acous- tion,” in IEEE International Conference on Acoustics, Speech and Signal
tics, Speech and Signal Processing, 2020, pp. 2802–2806. Processing, 2021, pp. 5954–5958.
[3] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice con- [23] D. Wu, Y. Chen, and H. Lee, “Vqvc+: One-shot voice conversion
version and its challenges: From statistical modeling to deep learning,” by vector quantization and u-net architecture,” in the 21st Annual
IEEE/ACM Transactions on Audio, Speech, and Language Processing, Conference of the International Speech Communication Association,
vol. 29, pp. 132–157, 2021. 2020, pp. 4691–4695.
[4] T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conver- [24] D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng,
sion using cycle-consistent adversarial networks,” in the 26th European “Vqmivc: Vector quantization and mutual information-based unsuper-
Signal Processing Conference, 2018, pp. 2100–2104. vised speech representation disentanglement for one-shot voice con-
version,” in the 22nd Annual Conference of the International Speech
[5] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2:
Communication Association, 2021, pp. 1344–1348.
Rethinking conditional methods for stargan-based voice conversion,” in
[25] H. Yang, L. Deng, Y. T. Yeung, N. Zheng, and Y. Xu, “Streamable
the 20th Annual Conference of the International Speech Communication
speech representation disentanglement and multi-level prosody modeling
Association, 2019, pp. 679–683.
for live one-shot voice conversion,” in the 23rd Annual Conference of
[6] Y. A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsuper-
the International Speech Communication Association, 2022, pp. 2578–
vised, non-parallel framework for natural-sounding voice conversion,” in
2582.
the 22nd Annual Conference of the International Speech Communication
[26] H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Avqvc: One-
Association, 2021, pp. 1349–1353.
shot voice conversion by vector quantization with applying contrastive
[7] M. Chen, Y. Zhou, H. Huang, and T. Hain, “Efficient non-autoregressive learning,” in IEEE International Conference on Acoustics, Speech and
gan voice conversion using vqwav2vec features and dynamic convolu- Signal Processing, 2022, pp. 4613–4617.
tion,” arXiv preprints arXiv:2203.17172, 2022. [27] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio,
[8] M. Luong and V. Tran, “Many-to-many voice conversion based feature A. Courville, and D. Hjelm, “Mutual information neural estimation,”
disentanglement using variational autoencoder,” in the 22nd Annual in the 35th International Conference on Machine Learning, ser. Pro-
Conference of the International Speech Communication Association, ceedings of Machine Learning Research, vol. 80, 2018, pp. 531–540.
2021, pp. 851–855. [28] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with
[9] R. Xiao, H. Zhang, and Y. Lin, “Dgc-vector: A new speaker embedding contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
for zero-shot voice conversion,” in IEEE International Conference on [29] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new
Acoustics, Speech and Signal Processing, 2022, pp. 6547–6551. estimation principle for unnormalized statistical models,” in the 13th
[10] Z. Liu, S. Wang, and N. Chen, “Automatic speech disentanglement for International Conference on Artificial Intelligence and Statistics, 2010,
voice conversion using rank module and speech augmentation,” in the pp. 297–304.
24th Annual Conference of the International Speech Communication [30] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker,
Association, 2023, pp. 2298–2302. “On variational bounds of mutual information,” in the 36th International
[11] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, Conference on Machine Learning, ser. Proceedings of Machine Learning
“Autovc: Zero-shot voice style transfer with only autoencoder loss,” in Research, vol. 97, 2019, pp. 5171–5180.
the 36th International Conference on Machine Learning, vol. 97, 2019, [31] P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A
pp. 5210–5219. contrastive log-ratio upper bound of mutual information,” in the 37th
[12] J. Chou and H. Lee, “One-shot voice conversion by separating speaker International Conference on Machine Learning, ser. Proceedings of
and content representations with instance normalization,” in the 20th Machine Learning Research, vol. 119, 2020, pp. 1779–1788.
Annual Conference of the International Speech Communication Associ- [32] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with
ation, 2019, pp. 664–668. adaptive instance normalization,” in IEEE International Conference on
[13] D. Wu and H. Lee, “One-shot voice conversion by vector quantization,” Computer Vision, 2017, pp. 1510–1519.
in IEEE International Conference on Acoustics, Speech and Signal [33] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
Processing, 2020, pp. 7734–7738. “Deeplab: Semantic image segmentation with deep convolutional nets,
[14] S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, and L. Carin, “Improving atrous convolution, and fully connected crfs,” IEEE Transactions on
zero-shot voice style transfer via disentangled representation learning,” Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
in the 9th International Conference on Learning Representations, 2021. 2018.
[15] K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. D. Cox, [34] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual networks,” in
“Unsupervised speech decomposition via triple information bottleneck,” IEEE Conference on Computer Vision and Pattern Recognition, 2017,
in the 37th International Conference on Machine Learning, vol. 119, pp. 636–644.
2020, pp. 7836–7846. [35] J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English
[16] T. Liu, K. A. Lee, Q. Wang, and H. Li, “Disentangling voice and content multi-speaker corpus for cstr voice cloning toolkit,” 2016.
with self-supervision for speaker recognition,” in the 37th Conference [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
on Neural Information Processing Systems, 2023. in the 3rd International Conference on Learning Representations, 2015.
[17] S. H. Mun, M. H. Han, C. Moon, and N. S. Kim, “Eend-demux: End-to- [37] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,
end neural speaker diarization via demultiplexed speaker embeddings,” E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and
arXiv preprint arXiv:2312.06065, 2023. K. Kavukcuoglu, “Efficient neural audio synthesis,” in the 35th Interna-
[18] S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. Schuller, tional Conference on Machine Learning, ser. Proceedings of Machine
“Survey of deep representation learning for speech emotion recognition,” Learning Research, vol. 80, 2018, pp. 2415–2424.
IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1634– [38] L. Huang, T. Yuan, Y. Liang, Z. Chen, C. Wen, Y. Xie, J. Zhang,
1654, 2021. and D. Ke, “Limi-vc: A light weight voice conversion model with
[19] N. Hou, C. Xu, E. S. Chng, and H. Li, “Learning disentangled feature mutual information disentanglement,” in IEEE International Conference
representations for speech enhancement via adversarial training,” in on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
IEEE International Conference on Acoustics, Speech and Signal Pro- [39] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transforming mandarin asr
cessing, 2021, pp. 666–670. research into industrial scale,” CoRR, vol. abs/1808.10583, 2018.
[20] W. Wang, Y. Song, and S. Jha, “Generalizable zero-shot speaker adaptive [40] L. van der Maaten and G. Hinton, “Viualizing data using t-sne,” Journal
speech synthesis with disentangled representations,” in the 24th Annual of Machine Learning Research, vol. 9, pp. 2579–2605, 11 2008.

You might also like