Professional Documents
Culture Documents
MAIN-VC: Lightweight Speech Representation Disentanglement For One-Shot Voice Conversion
MAIN-VC: Lightweight Speech Representation Disentanglement For One-Shot Voice Conversion
Abstract—One-shot voice conversion aims to change the timbre speaker-dependent information is time-invariant, remaining
arXiv:2405.00930v1 [cs.SD] 2 May 2024
of any source speech to match that of the unseen target speaker constant in one utterance, whereas speaker-independent infor-
with only one speech sample. Existing methods face difficulties mation is time-variant. This allows neural networks to sepa-
in satisfactory speech representation disentanglement and suffer
from sizable networks as some of them leverage numerous rately learn representations of content and speaker information
complex modules for disentanglement. In this paper, we propose from the speech. The disentanglement-based VC models typi-
a model named MAIN-VC to effectively disentangle via a cally adopt an autoencoder structure, integrating approaches
concise neural network. The proposed model utilizes Siamese like bottleneck features [11], instance normalization [12],
encoders to learn clean representations, further enhanced by the vector quantization [13], etc. into encoders to extract desired
designed mutual information estimator. The Siamese structure
and the newly designed convolution module contribute to the speech representations. The training usually involves recon-
lightweight of our model while ensuring performance in diverse struction of Mel-spectrogram. During training, the same Mel-
voice conversion tasks. The experimental results show that spectrogram segments are fed into the decoupling encoders
the proposed model achieves comparable subjective scores and to obtain various speech representations, then the decoder
exhibits improvements in objective metrics compared to existing integrates the extracted speech representations and reconstruct
methods in a one-shot voice conversion scenario.
Index Terms—voice conversion, speech representation learning the Mel-spectrogram. After the training is completed, the
content information from the source speech and the speaker
information from the target speech is decoupled respectively
I. I NTRODUCTION and then fused for synthesizing the converted voice, achieving
the transferring of speaker style.
Voice conversion (VC) is the style transfer task in the field
However, the disentanglement of speech representation is
of speech. It modifies the source speech’s style to the target
complicated [14], how to effectively eliminate unnecessary in-
speaker’s speaking style. Voice conversion can be applied
formation while preventing damage to the wanted information
in smart devices, entertainment industry, and even privacy
that needs to be disentangled, and how to extract clean repre-
security area [1], [2].
sentations without including other information, are challenging
Early voice conversion relies heavily on parallel training
aspects to master. Most of the previous voice conversion mod-
data [3], while parallel data collecting and frame-level align-
els suffer unsatisfactory disentangling performance. Besides
ment procedures consume much time. So researchers have
the dependence of some models on the fine-tuned bottleneck
turned attention to non-parallel voice conversion, enabling the
[11], [15], the reasons for deficient disentanglement include
possibility of many-to-many and any-to-any conversion tasks.
that different encoders disentangle independently during the
Inspired by image style transfer in computer vision, a series
representation learning process, lacking communication with
of methods based on generative adversarial network (GAN)
the others, resulting in the existence of redundant information
have been applied to voice conversion [4]–[7]. These GAN-
in disentangled representations. Moreover, most of these VC
based VC methods treat each speaker as a different domain
models are sizable, as some of them improve disentangling
and consider voice conversion as migration across speaker
capability through stacking various network modules, making
domains. But GAN training is unstable and poses difficulties
it difficult to deploy the models on low-computing mobile
in convergence.
devices.
Recently, disentanglement-based VC [8]–[10] has been an
In this paper, we propose a model based on speech rep-
area of active research for its flexibility in any-to-any voice
resentation disentanglement, named Mutual information en-
conversion. The disentanglement-based VC method is built
hanced Adaptive Instance Normalization VC (MAIN-VC).
on the assumption that speech composition contains both
To enhance the disentangling ability of the encoders, we
speaker-dependent information (e.g. timbre and speaking style)
integrate the Siamese structure as well as data augmentation
and speaker-independent information (e.g. content). Generally,
into the speaker representation learning module. An easy-to-
† Equal contribution. train constrained mutual information estimator is introduced
B Corresponding author. to build the communication between the learning processes of
content and speaker information, which reduces the overlap and marginal distribution of the variables. However, high-
between unrelated representations. With the assistance of these dimensional random variables from neural networks often own
modules, the disentangling ability of the encoders no longer high complexity and complicated distribution. Only samples
rely on complicated networks. This makes it possible for can be obtained while distribution is difficult to derive. It is
model lightweight and deployment on low-computing devices. difficult to accurately obtain the mutual information between
The contributions of this paper can be summarized as follows: them. Therefore, neural MI estimation is proposed to address
• We introduce the speaker information learning module this issue.
which reduces the interference of time-varying informa- MINE [27] treats MI as the Kullback-Leibler divergence
tion to extract clean speaker embeddings. (KL divergence) between the joint and marginal distributions,
• We propose the constrained mutual information estimator and sets a function to fit this KL divergence. The fitting process
restricted by upper and lower bounds and leverage it to is completed via a trainable neural network. InfoNCE [28]
enhance disentangling capability, contributing to improve maximizes the MI between a learned representation and the
the quality of one-shot VC. data and estimates the lower bound of MI based on noise
• We design the proposed model to be lightweight while contrastive estimation [29]. Representative works on mutual
ensuring the quality of voice conversion. A convolu- information upper bound estimation include L1Out [30] and
tion block called APC with a low parameter count and CLUB [31]. The former leverages the Monte Carlo approx-
large receptive field is also introduced to commit to imation to approximate the marginal distribution and then
lightweight. derives the upper bound of MI. CLUB [31] fits the conditional
probability between two variables. It designs positive sample
II. R ELATED W ORKS pairs and negative sample pairs to construct a contrastive
probability log ratio between two conditional distributions.
A. Speech Representation Disentanglement
When the approximate condition probability is close enough
Speech representation disentanglement (SRD) decomposes to the truth (i.e. the KL divergence is small enough), the fitted
speech components into diverse representations. It has been mutual information gradually approaches the true MI from
used in speaker recognition [16], speaker diarization [17], above, thus obtaining the upper bound of MI.
speech emotion recognition [18], speech enhancement [19], In practical applications, the estimation of the lower bound
automatic speech synthesis [20], etc. of the MI is maximized when it is necessary to strengthen the
SRD also conveniently provides an opportunity for the correlation between two types of variables, while the upper
purpose of voice conversion, altering the speaker identity bound of the MI is minimized when reducing the correlation
information in speech to achieve conversion. The voice conver- between variables.
sion methods based on SRD have gained widespread attention
in recent years. AUTOVC [11] leverages an autoencoder with III. M ETHODOLOGY
a carefully tuned bottleneck to remove speaker information. A A. Overall Architecure of MAIN-VC
pre-trained speaker encoder is utilized to provide speaker iden- The architecture of MAIN-VC is depicted in Fig. 1.
tity information. S PEECH S PLIT [15] and S PEECH S PLIT 2.0 MAIN-VC adopts an instance normalization- (IN-)based
[21] design multiple encoders with signal processing to de- method for speech style transferring. A speaker information
couple finer-grained speech components such as rhythm, con- learning module consists of Siamese encoders with a data aug-
tent and pitch. Another series of works [12], [22] removes mentation module built to obtain clean speaker representation,
time-invariant speaker information through instance normal- while the constrained mutual information estimator between
ization to get content representation, then extracted speaker speaker representation and content representation is leveraged
representation is fused into the content representation in an to enhance the disentanglement. The pre-trained vocoder only
adaptive manner. Vector quantization (VQ) is also leveraged appears during the inference stage. The constrained mutual
to complete speech representation disentanglement. In vector information estimator and the Siamese encoder both take part
quantization-based methods [23]–[26], latent representation is in the training stage to assist in the training of the content
quantized into discrete codes, which are highly related to encoder and the speaker information learning module. They
phonemes, and the differences between the discrete codes and are not involved in the conversion process during the inference
the vectors before quantization are considered as the speaker stage after the training is completed.
information.
SRD-base VC method shows its flexibility in a one-shot B. IN-based Speech Style Transfer
voice conversion scenario but places high demands on the When given frame-level linguistic feature Z, the content
model’s disentangling ability. encoder and speaker representation learning module extract
content representation zC and speaker representation zS re-
B. Neural Mutual Information Estimation spectively. To mitigate the impact of time-invariant informa-
Mutual information (MI) measures the degree of corre- tion like timbre on the extraction of content representation, the
lation or inclusion between two variables. The calculation content encoder removes speaker information from the mid
of MI generally requires calculating the joint distribution feature map M (processed from linguistic feature Z) through
Shape: (n_mel , Ts) (Ls , ) (n_mel , Tc)
Encoder
μ
mel.
speaker
Encoder
speaker
TS AdaIN
σ
Decoder
Vocoder
SILM speaker emb. mel.
CMI
Encoder
content
mel.
Loss
content emb.
(n_mel , Tc) (Lc , )
instance normalization without affine transformation (denoted C. Speaker Information Learning Module
as INwoAT ) as Eq. 1 shows, Although SRD-based VC models do not require parallel
W corpus for training, typically using multi-speaker datasets
1 X [w]
µ(M[c] ) = M where the same speaker has multiple utterances, to fully utilize
W w=1 [c]
v this characteristic of multi-speaker datasets, we generalize the
u
u 1 XW time-invariance within utterance to the utterance-invariance
[w] (1)
σ(M[c] ) = t (M − µ(M[c] ))2 across utterance from the same speaker.
W w=1 [c] In our proposed method, the speaker information learning
[w] module (SILM) based on utterance-invariance consists of
M[c] − µ(M[c] )
INwoAT (M[c] ) =
[w] Siamese encoders (denoted as EncS and Enc′S ) and a time
σ(M[c] ) shuffle unit (TS). TS processes the input data (i.e. target
[w] speech) before it enters EncS , it rearranges the order of
where M[c] represents the w-th element of the c-th channel
frame-level segments along time dimension to disrupt time-
of feature map M . Content representation zC can be extracted
variant information (i.e. content information) while preserving
from INwoAT (Msrc ) where Msrc is the feature map obtained
time-invariant information (i.e. speaker information). During
from the linguistic feature Zsrc of the source speech, while the
the training stage, the same linguistic feature Z is fed into
global speaker representation zS is the tuple (α, β) obtained
both EncS of SILM and content encoder EncC , then the
from the linguistic feature of the target speech Ztgt via
decoder Dec performs reconstruction with the outputs of the
speaker encoder. Once we collect the content representation zC
encoders as shown in Eq. 3. Different from the previous works,
and speaker representation zS , the decoder integrates speaker
the parallel time-variant content information is completely
information via its internal adaptive instance normalization
disrupted in the proposed method.
(AdaIN) [32] layers:
[w] Ẑ = Dec(EncC (Z), EncS (TS(Z))) (3)
[w]
MC [c] − µ(MC [c] )
AdaIN(MC [c] , zS [c] ) = β[c] · + α[c] (2) ′
Linguistic feature Z of another utterance from the same
σ(MC [c] )
speaker is fed into Enc′S of SILM after reconstruction. Based
where MC is the feature map obtained from content rep- on the utterance-invariance of speaker information, we use
resentation zC via the internal blocks of the decoder, and different utterances from the same speaker for the considera-
zS [c] (composed of α[c] and β[c] ) denotes the c-th channel tion that speech inflection and recording conditions affect the
of the speaker representation. Essentially, AdaIN layers adap- extraction of speaker representation. The Siamese loss shown
tively adjust the affine parameters of IN according to speaker in Eq. 4 helps to improve the cosine similarity between these
information to align the channel-wise statistics of content two speaker representations:
with those of speaker style to achieve style transferring.
Lsiamese = 1 − cos(EncS (TS(Z)), Enc′S (Z ′ )) (4)
Extracting continuous speaker representations from arbitrary
target utterances enables the model applicability in any-to-any In the inference stage, the Siamese structure is disregarded,
VC scenarios. since only one utterance from the target speaker is required.
D. Mutual Information Enhanced Disentanglement
We design the constrained mutual information (CMI) es- InstanceNormalization
×2
ReLU
timator with upper and lower bounds to estimate mutual
DenseBlock Conv1d
information. Estimating MI between complex random vari- IN
Conv1dBlock
ables is challenging and always unstable to train in practice, Conv1dBlock APC
simultaneously estimating both upper and lower bound of concatenate IN
Conv1dBlock Conv1dBlock
it and constraining them mutually help to enhance the MI
Encc
Conv1d Conv1d Conv1d
AtrousPyramid AtrousPyramid
estimator’s robustness and prediction accuracy. The estimation Conv.
size=3
dilation=1
size=3
dilation=2
... size=3
dilation=n
Conv.
of the lower bound can serve as guidance and a baseline
for the upper bound estimation, resulting in a more trainable
mutual information estimator, CMI. In MAIN-VC, the upper Fig. 2. The structure of the encoders and the details of APC.
bound of MI between content representation zC and speaker
representation zS is derived from the variational contrastive
log-ratio upper bound (vCLUB) [31] as: kernels of different sizes (size from 1 × 1 to 8 × 8 in the
baseline method [12]) which contain large kernels. APC not
Iupper (zC ; zS ) =EP(zC ,zS ) [log(Qθ (zC |zS ))] only reduces the parameter count significantly but also expands
(5) the receptive field.
− EP(zC ) EP(zS ) [log(Qθ (zC |zS ))] What’s more, parameter sharing in the Siamese structure
where P (zC |zS ) is the actual conditional distribution, while of SILM also prevents the network from growing. CMI only
Qθ (zC |zS ) represents its variational estimation obtained via a appears during training and makes no time consumption during
trainable network. The lower bound of MI can be estimated inference. We adjust the layers and channels of the encoders to
through MINE [27], which employs a function Tθ with train- strike a balance between performance and model lightweight.
able parameters θ to approximate the KL divergence between F. Loss Function
two distributions, i.e. the joint distribution P (zC , zS ) and the
product of the two marginal distributions P (zC ) ⊗ P (zS ), in The loss function of the proposed model contains recon-
order to predict the lower bound of I(zC ; zS ), as: struction loss, KL divergence loss, MI loss and Siamese loss.
Reconstruction loss is the L1 loss between the source log-
Mel-spectrogram Z ∈ RM ×T and the output of decoder
Ilower (zC ; zS ) =EP(zC ,zS ) [Tθ (zC , zS )] Ẑ ∈ RM ×T (as shown in Eq. 3 and Eq. 8). The KL divergence
(6)
− log(EP(zC )⊗P(zS ) [e Tθ (zC ,zS ) ]) between the posterior distribution P (zC |z) and the prior
distribution P (z) serves as the KL loss, shown in Eq. 9.
CMI is trained using the learning losses of the upper and
lower bound estimation networks, as well as the gap between
Lrecon = ∥Z − Ẑ∥1 (8)
their estimated MI values. We minimize the estimated upper
bound for further disentanglement of the speech representa-
2
tions. When N utterances are given, and the length of content LKL = EZ∼P (Z),zC ∼P (zC |Z) [∥EncC (Z) ∥22 ] (9)
representation extracted from each cropped utterance is L, then
CMI estimates MI loss as: Siamese loss and MI loss are already defined in Eq. 4
and Eq. 7 respectively. Finally, the total loss combines the
N
1 X XX
N L
Qθ zC m,l |zS n
aforementioned losses with weights λ1 , λ2 and λ3 , as:
LM I = 2 log (7)
N L m=1 n=1 Qθ zC n,l |zS n L = Lrecon + λ1 LKL + λ2 Lsiamese + λ3 LM I (10)
l=1
It’s worth noting that SILM’s enhancement of speaker IV. E XPERIMENTS AND R ESULTS
representation disentangling can also facilitate CMI’s effect
on content disentangling implicitly, as the cleaner speaker A. Experimental Setup
representation obtained from SILM, the more accurate content- To evaluate MAIN-VC, a comparative experiment and an
irrelevant information that CMI needs to eliminate during ablation study are conducted on VCTK [35] corpus. 19 speak-
content disentangling. ers are randomly selected as unseen speakers for one-shot
conversion tasks, and their speech is excluded during training.
E. Model Lightweighting All the utterances are resampled to 16,000 Hz during the
Inspired by atrous spatial pyramid pooling (ASPP) [33], we experiment, and the Mel-spectrograms are computed through
design a convolution block named atrous pyramid convolution short-time Fourier transform for training. During the training
(APC) as shown in Fig. 2. APC contains a series of small set sampling, each utterance is randomly cropped into several
kernels (3 × 3 in MAIN-VC) with different dilation factors 128-frame segments. Two segments from different utterances
[34] and a residual-connect concatenating the convolution spoken by the same speaker are then paired to form a training
results, replacing the sizable convolution block in previous data sample, where one is used for reconstruction, and the
works. The previous convolution block consists of a set of other is fed into the Siamese encoder of SILM.
TABLE I
C OMPARISON OF DIFFERENT METHODS FOR MANY- TO - MANY AND ONE - SHOT VC ( WITH 95% CONFIDENCE INTERVAL ).
Many-to-Many One-Shot
VC Method
MCD↓ MOS↑ VSS↑ MCD↓ MOS↑ VSS↑
AdaIN-VC [12] 7.64 ± 0.24 3.11 ± 0.13 2.82 ± 0.19 7.38 ± 0.14 3.04 ± 0.21 2.45 ± 0.16
AUTOVC [11] 6.68 ± 0.21 2.76 ± 0.16 2.45 ± 0.23 8.08 ± 0.15 2.68 ± 0.17 2.39 ± 0.14
VQMIVC [24] 6.24 ± 0.13 3.20 ± 0.14 3.32 ± 0.12 5.59 ± 0.10 3.16 ± 0.18 3.02 ± 0.18
LIMI-VC [38] 7.27 ± 0.17 3.12 ± 0.13 3.01 ± 0.26 7.88 ± 0.18 2.97 ± 0.22 2.74 ± 0.24
MAIN-VClarge 5.17 ± 0.14 3.41 ± 0.21 3.35 ± 0.12 5.31 ± 0.09 3.33 ± 0.17 3.23 ± 0.17
MAIN-VC 5.28 ± 0.11 3.44 ± 0.12 3.25 ± 0.16 5.42 ± 0.13 3.24 ± 0.18 3.29 ± 0.11
It’s worth noting that MAIN-VC requires at least two each model can be freely replaced with an appropriate one.
forward and backward passes in each training iteration, as the The inference time is measured and averaged using the same
CMI module needs extra training. In other words, the training utterance pairs with each model on the same device (1× Tesla
process involves the following two steps in each iteration: V100).
1) Training the integrated CMI network, the gap between
C. Experimental Results
upper and down bounds is combined with the learning
losses. This step can be repeated several times in one Table I showcases the comparison of MAIN-VC and
iteration (five times in our experiment). baseline methods in different voice conversion scenarios. The
2) Computing the four loss components to obtain the to- proposed non-lightweight model (denoted as MAIN-VClarge )
tal for the training of the entire model (one forward is also taken into consideration. For many-to-many VC,
and backward propagation, excluding the pre-trained MAIN-VC achieves comparable performance with baseline
vocoder). methods. MAIN-VC outperforms other methods in one-shot
MAIN-VC is trained with the Adam optimizer [36] with VC, especially on the objective metric. As for the lightweight
β1 = 0.9, β2 = 0.99, ϵ = 10−6 , lr = 1 × 10−4 . Another of each model, the comparison of the parameter count and
Adam optimizer with lr = 2 × 10−4 is built to train the CMI inference time cost (exclude vocoder) is presented in Table
network. During the inference stage, a pre-trained WaveRNN II. MAIN-VC reaches 59% of reduction in the parameter
[37] is used as the vocoder to generate waveform from count and 33% of reduction in the inference cost, compared
Mel-spectrogram. We design the weights in the total loss to the lightest baseline methods respectively. The proposed
function Eq. 10 based on the numerical characteristics of each model has an obvious advantage in parameter count thanks to
component, λ1 and λ3 gradually increase to 1 as the number the proposed APC module and the parameter-sharing strategy,
of iterations progresses, while λ2 is set to 1. We compare also inferences less costly due to its lightweight network.
our proposed models with AdaIN-VC [12], AUTOVC [11], Overall, MAIN-VC strikes a balance between lightweight
VQMIVC [24] and LIMI-VC [38]. Our code is available at design and performance, obtaining a light network structure
https://github.com/PecholaL/MAIN-VC. while ensuring the quality of voice conversion.
Fig. 3 showcases Mel-spectrograms of utterances in a one-
B. Metrics and Measurement shot VC task, including the source speech, ground-truth speech
The following subjective and objective metrics are used to (i.e. the same content to source speech spoken by the target
evaluate the performance of different voice conversion models. speaker, not the target speech in the inference), converted
Mean opinion score (MOS) is a common subjective metric speeches via MAIN-VC. The details of the spectrograms
which assesses the naturalness of speech, with scores ranging show that the source speaker and target speaker own dif-
from 1 to 5, where higher scores indicate higher speech quality. ferent acoustic characteristics (as in Fig. 3(a) and Fig. 3(b),
Voice similarity score (VSS), a subjective metric, evaluates the distributions and shapes of the harmonic from different
the similarity between the converted speech and the ground- speakers are quite different). MAIN-VC can extract the
truth speech from the target speaker. Higher scores indicate feature of the target speaker and construct the converted Mel-
higher similarity, representing better VC performance. spectrogram that is close to the ground-truth. MAIN-VC
Mel-cepstral distortion (MCD) is an objective metric to is also capable of performing cross-lingual voice conversion
measure the difference between the converted speech and the (XVC), another one-shot VC scenario, where the source and
target ground-truth speech in the aspect of Mel-cepstral. target utterances are in different languages. We conduct XVC
Both many-to-many and one-shot (any-to-any) VC are con- on AISHELL [39] (Mandarin speech corpus) and VCTK [35]
ducted while refraining from using parallel data during the (English speech corpus). More audio samples are available at
inference stage. 4 pairs of source/target speech are input into https://largeaudiomodel.com/main-vc/.
each VC model for evaluation in two scenarios respectively.
Then 8 participants are invited to score the speech samples. D. Ablation Study
For the measurement of lightweight, we choose parameter To validate the effects of SILM and CMI on disentan-
count and inference time cost as the metrics. The measure- glement, we conduct ablation experiments. We set up the
ments for both do not include the vocoders, as the vocoder in following models:
(a) source speech (b) ground-truth speech (d) converted speech via MAIN-VC
Fig. 3. The Mel-spectrogram samples of the one-shot VC task "Please call Stella".
TABLE II
C OMPARISON OF PARAMETER COUNT AND INFERENCE TIME COST
(w/o VOCODER ).