Icassp40776 2020 9053737

ARTIFICIAL BANDWIDTH EXTENSION USING
CONDITIONAL VARIATIONAL AUTO-ENCODERS AND ADVERSARIAL LEARNING
Pramod Bachhav, Massimiliano Todisco and Nicholas Evans
EURECOM, Sophia Antipolis, France

{bachhav,todisco,evans}@eurecom.fr
ABSTRACT Probabilistic deep generative models such as variational auto-

encoders (VAEs) and their conditional variant (CVAEs) are capable
Artificial bandwidth extension (ABE) algorithms have been devel- of modeling complex data distributions. In contrast to bottleneck
oped to estimate missing highband frequency components (4-8kHz) features learned by stacked auto-encoders (SAEs), the latent rep-
to improve quality of narrowband (0-4kHz) telephone calls. Most resentation is probabilistic and can be used to generate new data.
ABE solutions employ deep neural networks (DNNs) due to their Inspired by their successful use in image processing [15, 16], they
well-known ability to model highly complex, non-linear relation- have become increasingly popular in numerous fields of speech pro-
ship between narrowband and highband features. Generative mod- cessing (e.g., speech modelling and transformation [17], voice con-
els such as conditional variational auto-encoders (CVAEs) are capa- version [18]) and neural machine translation (NMT) [19, 20]. The
ble of modelling complex data distributions via latent representation performance of CVAEs can further be improved by their combina-
learning. This paper reports their application to ABE. CVAEs, form tion with generative adversarial networks (GANs) [21]. Despite their
of directed, graphical models, are exploited to model the probabil- capability of modelling data distributions, CVAEs have not been in-
ity distribution of highband features conditioned on narrowband fea- vestigated in regression tasks such as ABE.
tures. While CVAEs are trained with the standard mean square crite- Inspired by the approaches presented in [21,22] for image gener-
rion (MSE), their combination with adversarial learning give further ation task, the work reported in this paper aims to explore the use of
improvements. When compared to results obtained with the base- generative modelling techniques to further improve performance of
line approach, the wideband PESQ is improved significantly by 0.21 a baseline DNN. In particular, we exploit CVAEs to model distribu-
points. The performance is also compared on an automatic speech tion of HB features where the conditioning variable of the CVAE is
recognition (ASR) task on the TIMIT dataset where word error rate derived from NB features via an auxillary neural network. The per-
(WER) is decreased by an absolute value of 0.3%. formance of the proposed CVAE architecture is further improved via
Index Terms— variational auto-encoder, generative adversarial its combination with a GAN. The novel contributions of this work
network, latent variable, artificial bandwidth extension, speech qual- are; (i) the application of CVAEs to ABE for estimation of miss-
ity ing HB features from available NB features; (ii) the combination of
CVAE with a probabilistic encoder in the form of an auxillary neu-
ral network and their joint optimisation; (iii) adversarial training of
1. INTRODUCTION the proposed CVAE architecture to further improve the ABE perfor-
mance.
Legacy narrowband (NB) networks and devices typically support The remainder of this paper is organised as follows. Section 2
bandwidths of 0-4kHz. Today’s wideband (WB) networks support describes a baseline ABE algorithm. Section 3 explains the proposed
bandwidths of 50Hz-8kHz and thus provide improved speech qual- CVAE scheme and its combination with GAN for ABE. Experimen-
ity. While the transition from NB to WB networks will require tal setup and results are described in Section 4 and conclusions are
significant investments and time [1], artificial bandwidth extension presented in Section 5.
(ABE) algorithms have been developed to improve speech qual-
ity when WB devices are used with NB devices or infrastructure.
ABE methods estimate missing highband (HB) frequency compo- 2. BASELINE ABE ALGORITHM
nents above 4kHz from available NB components, typically using a
regression model learned from WB training data. Fig. 1 illustrates the baseline ABE system. It is identical to the
ABE algorithms use either a classical source-filter model [2, source-filter model based approach presented in [23]. The algorithm
3] or operate directly on complex short-term spectral estimates [4, is described in brief in two blocks: estimation and resynthesis.
5]. Estimation is usually performed via a regression approach using During estimation, a NB speech frame sNB of 30 ms duration
conventional Gaussian mixture [2, 6, 7] and hidden Markov mod- with a sampling rate of 16kHz is processed using a 512-point FFT in
els [8, 9]. Several other approaches exploit superiority of deep neu- order to extract 128-dimensional NB log power spectrum (LPSNB )
ral networks (DNNs) to model non-linear relationship between NB coefficients xNB . Mean and variance normalisation (mvnx ) is then
and HB components using Gaussian Bernoulli restricted boltzmann applied to obtain xNB
mvn . After concatenation with the coefficients ob-
machines (GBRBMs), deep recurrent-neural networks (RNNs) with tained from 2 neighbouring frames, the resulting 640-dimensional
Long short-term memory (LSTM) cells [10, 11], recurrent temporal concatenated vector xNB conc 2 is then fed to a DNN to estimate 10-
HB
restricted Boltzmann machines (RTRBMs) [12]. Some approaches dimensional normalised HB features ŷmvn consisting of first 10 linear
perform ABE via direct modelling and generation of time-domain prediction cepstral coefficients (LPCCs). Inverse mean and variance
waveforms [13, 14]. normalisation (mvn−1 HB
y ) is then applied, giving HB features ŷ . The
978-1-5090-6631-5/20/$31.00 ©2020 IEEE 6924 ICASSP 2020

Estimation concatenation The variational lower bound on the conditional likelihood
sNB DNN
pθ (y|x) is then given by:
mvn−1 LPCCs
LPSNB mvnx y
mapping to LPCs
HB
x NB NB
xmvn NB
xconc_2
HB
ymvn y log pθ (y|x) ≥ L(θ, φ; x, y)
𝑔HB , aHB
= −DKL [qφ (z|y)||pθ (z)] + Eqφ (z|y) [log pθ (y|x, z)] (1)
SLPNB PSNB PSWB IFFT Levinson Durbin PSHB
1
𝑔NB , aNB 𝑔WB , aWB where DKL (·) acts as a regulariser which can be computed analyti-
cos(2πfM k) cally. In practice, the prior p(z) is assumed to be a centred isotropic
Analysis uHB Synthesis multivariate Gaussian N (z; 0, I) with no free parameters. The sec-
filter filter
uWB s WB ond term is approximated via sampling by L1 L (l)
P
2
uNB D 3 l=1 log pθ (y|x, z )
Resynthesis using L samples drawn from the recognition network qφ (z|y). Sam-
pling is performed using a differentiable deterministic mapping such
Fig. 1. A block diagram of the baseline ABE system. Diagram that z(l) = gφ (y, (l) ) = µz + (l) σz where (l) ∼ N (0, I).
adapted from [23]. µz = µ(y; φ) and σz = σ(y; φ) are outputs of the recognition
network qφ (z|y). This is called the reparameterization trick.
The output distribution pθ (y|x, z) in Eq. 1 is chosen to be Gaus-
HB LP coefficients ĝ HB , âHB are then calculated from the estimated sian with mean µx and covariance matrix σ 2 ∗ I, i.e., pθ (y|x, z) =
HB LPCCs (yHB ) via recursion. N (µx , σ 2 ∗ I) where µx is output of the decoder DNN µ(x, z; θ).
Resynthesis is performed in three steps. First (box in Fig. 1), Therefore second term in Eq. 1 can be re-written as:
LP parameters aNB , g NB are obtained from speech frame sNB via
L
selective linear prediction (SLPNB ) to get the NB power spectrum 1 X ky − µ(x, z(l) ; θ)k2
PSNB . This is then concatenated with the HB power spectrum PSHB Eqφ (z|y) [log pθ (y|x, z)] = C − (2)
L α
(obtained from estimated HB LP parameters ĝ HB , âHB ), giving the l=1
WB power spectrum PSWB , and hence estimated WB LP parameters where C is a constant that can be ignored during optimisation. The
ĝ WB , âWB . Second (box 2), the HB excitation ûHB is estimated from scalar α = 2σ 2 can be seen as a weighting factor between the KL-
the spectral translation of the NB excitation uNB with fM = 8 kHz divergence and the reconstruction term. In practice, L = 1 samples
(which corresponds to spectral folding around 4kHz). NB and HB are used per datapoint [24]. The lower bound L(·) forms the objec-
excitation components are then combined to obtain the extended WB tive function which can be optimized with respect to parameters θ
excitation ûWB . Finally (box 3), ûWB is filtered using a synthesis and φ using a stochastic gradient descent algorithm.
filter defined by ĝ WB and âWB in order to resynthesise speech frame
ŝWB . A conventional overlap and add (OLA) technique is used to
produce extended WB speech. 3.2. Generative adversarial networks
In regression framework, a GAN consists of two adversarial net-
3. APPLICATION OF CVAE AND GAN FOR ABE works, a generator G and a discriminator D. Generator maps an
input sample x to an output sample y. D takes form of a binary
classifier which predicts the probability D(x) that a given sample x
In this section we describe how CVAEs can be used for estimation
belongs to the training distribution and not the distribution modelled
of HB features from input NB features in an ABE task. The CVAEs
by G. D is thus trained to maximise the probability of assigning
are trained in an adversarial fashion to deliver improvements in ABE
a correct label to both training samples and samples from G [25].
performance.
This adversarial learning process is formulated as a minimax game
between G and D given according to the following objective func-
3.1. Conditional variational auto-encoders tion:
A conditional variational auto-encoder (CVAE) is a conditional, gen- min max V (G, D) = Ey [log(D(x))] + Ex [log(1 − D(G(x)))]
G D
erative model of the form pθ (y, z|x) = pθ (z)pθ (y|x, z). For a (3)
given input observation x, a latent variable z is drawn from a prior During training, G tries to fool D by generating samples close to the
distribution pθ (x) from which the posterior distribution pθ (y|x, z) training data so that D classifies G’s output as real. D then updates
generates the output y [15, 16]. its parameters in order to classify samples generated by G as fake.
CVAE maximises the conditional likelihood pθ (y|x, z) via the Both G and D are trained iteratively until the GAN converges to a
use of recognition/inference model qφ (z|y)1 (also referred to as good estimator of the true data distribution [25].
probabilistic encoder) which estimates the parameters φ of the pos-
terior distribution over all possible values of the latent variables z 3.3. Application to ABE
that may have generated the given datapoint y. The probabilistic
decoder pθ (y|x, z) then produces a distribution with parameters This section describes the proposed scheme for optimisation of
θ over all possible values of y for given z and x. For simplicity, CVAEs specifically tailored to ABE. The scheme is illustrated in
it is assumed that the approximate (qφ (z|y)) and true posteriors Fig. 2(a). Parallel training data consisting of NB and WB utterances
(pθ (z|y)) are diagonal multivariate Gaussian distributions whose is processed in frames of 30ms duration with 15ms overlap. Input
respective parameters φ and θ are computed using two different data x = xNBconc 2 consists of NB LPS coefficients with memory (as
DNNs. described in Section 2). The output data y = yHB mvn consists of first
10 LPCCs extracted from parallel HB data via SLP.
1 In our formulation, we assume that the latent variable z is dependent The CVAE is then trained to model the distribution of the out-
only on the output variable y i.e., qφ (z|x, y) = qφ (z|y) put y conditioned on the input x as follows. The HB data y is fed
6925
𝝁 𝒛y 𝐲 = 𝝁𝐲 y 𝐺 y = 𝜇y 𝐷 fake
𝐳𝐲
y 𝐷 real
x
𝐲
Fig. 3. An illustration of an adversarial training where G network is
formed using CVAE architecture shown in Fig. 2(a).
𝑞∅y (zy |y) 𝝈𝒛 y 𝜖 ~ 𝑁(𝟎, 𝐈) 𝑝𝜃y (y|𝑧x , 𝑧y ) for estimation of y where zy is sampled from the prior distribution
𝐳𝐱 pθy (zy ) = N (0, I) during testing (or estimation phase). It can
𝜖 ~ 𝑁(𝟎, 𝐈)
be noted that there exists a discrepancy during reconstruction and
estimation phases of CVAEs since y is not available during estima-
𝝈𝐳 𝐱 𝝁𝐳x tion [20].
𝑞∅x (zx |x) 3.4. Combining CVAE with GAN

NB
x = xconc_2
HB
In ABE framework, the reconstruction error term in the objective
y = ymvn x function given in Eq. 4 can be replaced by GAN based objective
loss function (described in Section 3.2) [1, 26]. In this work, this
(a) is achieved by forming a generator network G using the proposed
𝐳𝐱 CVAE scheme (which is described in Section 3.3 and shown in
𝜇z x 𝑝𝜃y (zx , zy ) y = 𝜇y Fig. 2(a)). As shown in Fig. 3, G then trained to reconstruct ŷ from
x 𝑞∅x (zx |x) the inputs x and y. D is then optimised to classify the generated (ŷ)
𝜎𝑧x and real (y) samples correctly. Both G and D are trained simulta-
𝐳𝐲 ∼ 𝑝𝜃y zy = 𝑁(𝟎, 𝐈) neously. The objective function of the CVAE-GAN scheme is thus
𝜖 ~ 𝑁(𝟎, 𝐈) given according to:
(b) L(θy , φy , φx ; zx , y) = −DKL [qφy (zy |y)||pθy (zy )]

− min max V (G, D) + λL1 (G) (5)
Fig. 2. (a) The proposed CVAE-DNN scheme during training (or G D
reconstruction) phase and (b) testing (or prediction) phase. The error term L1 (G) = Ey ky − ŷk1 is added in order to minimise
the distance between generated (ŷ) and real (y) samples, the trick
that helps to stabilise GAN training [27]. After adversarial train-
to the encoder qφy (zy |y) (top-left network in Fig. 2(a)) in order ing of the CVAE, a DNN (denoted as CVAE-GAN) is formed from
to predict the mean µzy and log-variance log(σz2y ) of the approxi- G network in a similar fashion as described in 3.3 and shown in
mate posterior distribution qφy (zy |y). The predicted parameters are Fig. 2(b)).
then used to obtain the latent representation zy ∼ qφy (zy |y) of the
output variable y via the reparameterization trick (see Section 3.1). 4. EXPERIMENTAL SETUP AND RESULTS
The encoder qφx (zx |x) (bottom of Fig. 2(a)) is fed with input data
x in order to predict the mean µzx and log-variance log(σz2x ) that This section describes the databases used for ABE experiments,
represent the posterior distribution qφx (zx |x). The latent variable baseline algorithm, configuration details of CVAE and GAN ar-
zx ∼ qφx (zx |x) is then used as the CVAE conditioning variable. chitectures, and results. Experiments are designed to compare the
After concatenation, zx and zy are fed to the decoder pθy (y|zx , zy ) performance of ABE systems that use CVAE-DNN and CVAE-GAN
(top-right network) in order to predict the mean µy = µ(zx , zy ; θ) with that uses a DNN trained with conventional MSE criterion.
of the output variable y. Finally, the entire network is trained to
learn parameters φx , φy and θy jointly. From Eqs. 1 and 2, the 4.1. Database
equivalent variational lower bound under optimisation is given by:
The dataset by Valentini et al. [28] was used for training and vali-
log pθy (y|zx ) ≥ L(θy , φy , φx ; zx , y) = dation. The training set consists of 11572 sentences spoken by 28
English speakers at a sampling rate of 48kHz. Parallel NB and WB
L
1 X ky − µ(zx , zy (l) ; θy )k2 speech signals were created by downsampling the original files to
− DKL [qφy (zy |y)||pθy (zy )]−
L α 8 and 16kHz respectively. While 80% of feature vectors extracted
l=1
(4) from the training set were used for training DNN models, remaining
20% samples were used for validation. The acoustically-different
It is expected that, during optimisation of Eq. 4, parameters φx , φy TIMIT core test subset [29] was used for testing.
and θy are jointly updated so that the framework learns to recon-
struct HB data y from the input data x. 4.2. CVAE configuration and training
Finally after training (i.e. reconstruction phase), the encoder
The CVAE architecture2 is implemented using the Keras toolkit [30].
qφx (zx |x) and the decoder pθy (y|zx , zy ) (signified by the red com-
Encoders qφx (zx |x) and qφy (zy |y) consist of one hidden layer with
ponents in Fig. 2(a)) networks are used to form a DNN (denoted
as CVAE-DNN) with two stochastic layers zx and zy . The pro- 2 Implementation is available at https://github.com/
posed CVAE-DNN scheme is illustrated in Fig. 2(b). It can be used bachhavpramod/bandwidth_extension
6926
Table 1. Objective assessment results. Lower values of RMS-LSD Table 2. ASR WER performance in % on the TIMIT core test subset.
(in dB) reflect better performance whereas higher values of segSNR Regression
(in dB) and MOS-LQOWB values indicate better performance. mono tri1 tri2 tri3
method
Regression method RMS-LSD segSNR WB-PESQ up-NB 39.4 35.7 31.9 27.2
DNN 8.90 17.09 3.35 DNN 35.4 29.3 26.2 24.4
CVAE-DNN 9.11 17.54 3.41 CVAE-DNN 35.1 29.2 25.9 24.3
CVAE-GAN 10.21 17.81 3.56 CVAE-GAN 34.5 28.8 26.1 24.1
WB 32.2 25.5 23.7 21.3
128 units, and 640 and 10 units for input layers respectively. Their
outputs are Gaussian-distributed latent variable layers zx and zy
consisting of 64 units for the means µzx , µzy and log-variances However, obtaining reliable quality estimates via subjective tests is
σzx , σzy . The decoder pθy (y|zx , zy ) consist of one hidden layer time-consuming and expensive. Thus, we evaluated merit of the pro-
with 128 units and an output layer with 10 units respectively. All posed methods in terms of WER performance for an ASR task.
hidden layers have tanh activation units whereas Gaussian parameter Four GMM-HMM based ASR systems were trained (using the
layers have linear activation units. The modelling of log-variances Kaldi toolkit [36]) on WB speech signals obtained from the TIMIT
avoids the estimation of negative variances. training set using monophone (mono) and triphone (tri1, tri2 and
Training is performed in order to minimise the negative con- tri3) modelling schemes. Note that the aim of this experimental setup
ditional log-likelihood in Eq. 4 using the Adam stochastic optimi- is to investigate performance of different ABE algorithms using sim-
sation technique [31] with an initial learning rate of 10−3 and hy- ple ASR systems. The performance of the ASR systems was tested
perparameters β1 = 0.9, β2 = 0.999 and = 10−8 . The hy- using the core TIMIT test subset. First, all WB test signals were pro-
perparameter α (in Eq. 4) is set to 10 (i.e., dimension of HB fea- cessed to produce upsampled NB (up-NB) signals at a sampling rate
ture vectors y) according to our previous investigations reported of 16kHz. The up-NB signals were then bandwidth-extended using
in [32]. Networks are initialised according to the approach described DNN, CVAE-DNN and CVAE-GAN based ABE systems. ASR was
in [33] so as to improve the rate of convergence. To discourage over- then performed for these test signals using a model that was trained
fitting, batch-normalisation [34] is applied before every activation on original WB signals.
layer. The learning rate is reduced by half when the validation loss The WER results are shown in Table 2. Highest WER (indi-
increases between 5 consecutive epochs. The CVAE is trained for a cates lower performance) is obtained for up-NB signals when tested
50 epochs using input x and output y data with a batch size of 512. with the ASR model trained on WB data, this is obvious because
The model giving the lowest validation loss is used for subsequent HB frequency content is missing in up-NB signals. While all ABE
processing. approaches achieve lower WERs than up-NB signals, the proposed
During adversarial training, the generator G network comprises CVAE based approaches consistently improve ASR performance
the same CVAE architecture described above. The discriminator D over the DNN based baseline method. The CVAE-DNN outper-
is a binary classifier with sigmoid activation unit and it consists of forms DNN by WER improvement of 0.3% for tri2 ASR system.
three hidden layers with 512 tanh activation units. The hyperparam- CVAE-GAN achieves lower WER by 0.9% (mono), 0.5% (tri1) and
eter λ (in Eq. 5) is set to 1. The GAN is trained for 30000 iter- 0.3% for remaining ASR systems. The WER assessment thus indi-
ations with a batch size of 512 using same optimization technique cates that the proposed ABE systems produce bandwidth-extended
and learning rate. signals closer to original WB signals. Few speech samples are
The performance of the proposed schemes is compared with a available at http://audio.eurecom.fr/content/media.
baseline DNN of same architecture that is trained using a MSE cri-
terion. All networks have a common structure of (128,128,128) hid-
den units in order to maintain low complexity given that ABE is a 5. CONCLUSIONS
real-time application.
Conditional variational auto-encoders (CVAEs) are directed graph-
4.3. Objective assessment ical models that are used for generative modelling. This paper re-
ports their application to ABE for modelling distribution of high-
Objective assessment metrics include root mean square log-spectral band features. The conditioning variable of the CVAE is derived
distortion (RMS-LSD) (calculated for a frequency range of 4-8kHz), from narrowband features via an auxillary neural network. CVAEs
segmental signal-to-noise ratio (segSNR) and a WB extension to the are also combined with GAN framework via adversarial training in
perceptual evaluation of speech quality algorithm (WB-PESQ) [35]. order to seek further improvements. The proposed approaches pro-
The latter gives objective estimates of mean opinion scores. duce of speech of substantially better quality which is confirmed
Results are presented in Table 1. The proposed approaches via improvements in WB-PESQ, segSNR estimates. The merit of
(CVAE-DNN and CVAE-GAN) outperform baseline DNN in terms the proposed approach is further assessed via improvements in ASR
segSNR and WB-PESQ. Surprisingly, the baseline DNN achieves performance. Crucially the improvements are achieved without aug-
better (lower) RMS-LSD value than the proposed schemes. menting complexity of the baseline regression model. Future work
should investigate why spectral distance measure did not show cor-
4.4. Word error rate (WER) assessments on ASR task relation with the other performance metrics. Better CVAE training
strategies in order to reduce the discrepancy during reconstruction
Objective metrics may not provide accurate estimates, subjective lis- and estimation phases should bring further improvements to ABE
tening tests are thus necessary in order to assess ABE performance. performance.
6927
6. REFERENCES [18] C.-C. Hsu et al., “Voice conversion from non-parallel corpora
using variational auto-encoder,” in Proc. of IEEE APSIPA,
[1] S. Li, S. Villette, P. Ramadas, and D. Sinder, “Speech band- 2016, pp. 1–6.
width extension using generative adversarial networks,” in
[19] B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang,
Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
“Variational neural machine translation,” arXiv preprint
cessing (ICASSP), 2018, pp. 5029–5033.
arXiv:1605.07869, 2016.
[2] K.-Y. Park and H. Kim, “Narrowband to wideband conver-
[20] A. Pagnoni, K. Liu, and S. Li, “Conditional variational au-
sion of speech using GMM based transformation,” in Proc. of
toencoder for neural machine translation,” arXiv preprint
IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
arXiv:1812.04405, 2018.
(ICASSP), 2000, pp. 1843–1846.
[21] A. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
[3] P. Jax and P. Vary, “On artificial bandwidth extension of tele-
“Autoencoding beyond pixels using a learned similarity met-
phone speech,” Signal Processing, vol. 83, no. 8, pp. 1707–
ric,” arXiv preprint arXiv:1512.09300, 2015.
1719, 2003.
[22] J. Bao et al., “CVAE-GAN: fine-grained image generation
[4] K. Li and C.-H. Lee, “A deep neural network approach to
through asymmetric training,” in Proc. of the IEEE Int. Conf.
speech bandwidth expansion,” in Proc. of IEEE Int. Conf. on
on Computer Vision, 2017, pp. 2745–2754.
Acoustics, Speech and Signal Processing (ICASSP), 2015, pp.
4395–4399. [23] P. Bachhav, M. Todisco, and N. Evans, “Exploiting explicit
[5] P. Bachhav, M. Todisco, M. Mossi, C. Beaugeant, and N. memory inclusion for artificial bandwidth extension,” in Proc.
Evans, “Artificial bandwidth extension using the constant Q of IEEE Int. Conf. on Acoustics, Speech and Signal Processing
transform,” in Proc. of IEEE Int. Conf. on Acoustics, Speech (ICASSP), 2018, pp. 5459–5463.
and Signal Processing (ICASSP), 2017, pp. 5550–5554. [24] D. Kingma and M. Welling, “Auto-encoding variational
[6] A. Nour-Eldin and P. Kabal, “Mel-frequency cep- bayes,” arXiv preprint arXiv:1312.6114, 2013.
stral coefficient-based bandwidth extension of narrowband [25] I. Goodfellow et al., “Generative adversarial nets,” in Ad-
speech,,” in Proc. of INTERSPEECH, 2008, pp. 53–56. vances in neural information processing systems, 2014, pp.
[7] P. Bachhav et al., “Artificial bandwidth extension with mem- 2672–2680.
ory inclusion using semi-supervised stacked auto-encoders,” in [26] J. Sautter et al., “Artificial bandwidth extension using a condi-
Proc. of INTERSPEECH, 2018, pp. 1185–1189. tional generative adversarial network with discriminative train-
[8] I. Katsir, D. Malah, and I. Cohen, “Evaluation of a speech ing,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
bandwidth extension algorithm based on vocal tract shape es- Processing (ICASSP), 2019, pp. 7005–7009.
timation,” in Proc. of Int. Workshop on Acoustic Signal En- [27] S. Pascual et al., “SEGAN: Speech enhancement generative
hancement (IWAENC). VDE, 2012, pp. 1–4. adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
[9] J. Abel, M. Strake, and T. Fingscheidt, “Artificial bandwidth [28] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi,
extension using deep neural networks for spectral envelope es- “Investigating RNN-based speech enhancement methods for
timation,” in Proc. of Int. Workshop on Acoustic Signal En- noise-robust text-to-speech.,” in SSW, 2016, pp. 146–152.
hancement (IWAENC). IEEE, 2016, pp. 1–5.
[29] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, and D. Pallett,
[10] Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth exten- “DARPA TIMIT acoustic-phonetic continous speech corpus
sion using bottleneck features and deep recurrent neural net- CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon tech-
works.,” in Proc. of INTERSPEECH, 2016, pp. 297–301. nical report N, vol. 93, 1993.
[11] K. Schmidt and B. Edler, “Blind bandwidth extension based [30] F. Chollet et al., “Keras,” https://github.com/
on convolutional and recurrent deep neural networks,” in in keras-team/keras, 2015.
Proc. of Int. Conf. on Acoustics, Speech and Signal Processing
[31] D. Kingma and J. Ba, “Adam: A method for stochastic opti-
(ICASSP), 2018, pp. 5444–5448.
mization,” arXiv preprint arXiv:1412.6980, 2014.
[12] Y. Wang et al., “Speech bandwidth extension using recurrent
[32] P. Bachhav et al., “Latent representation learning for artifi-
temporal restricted boltzmann machines,” IEEE Signal Pro-
cial bandwidth extension using a conditional variational auto-
cessing Letters, pp. 1877–1881, 2016.
encoder,” in Proc. of IEEE Int. Conf. on Acoustics, Speech and
[13] V. Kuleshov et al., “Audio super resolution using neural net- Signal Processing (ICASSP), 2019, pp. 7010–7014.
works,” arXiv preprint arXiv:1708.00853, 2017.
[33] K. He et al., “Delving deep into rectifiers: Surpassing human-
[14] T. Lim, R. Yeh, Y. Xu, M. Do, and M. Hasegawa-Johnson, level performance on imagenet classification,” in Proc. of the
“Time-frequency networks for audio super-resolution,” in IEEE int. conf. on computer vision, 2015, pp. 1026–1034.
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2018, pp. 646–650. [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,” in
[15] K. Sohn et al., “Learning structured output representation us- Int. conf. on machine learning, 2015, pp. 448–456.
ing deep conditional generative models,” in Advances in Neu-
ral Information Processing Systems, 2015, pp. 3483–3491. [35] “ITU-T Recommendation P.862.2 : Wideband extension to
Recommendation P.862 for the assessment of wideband tele-
[16] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Con- phone networks and speech codecs,” ITU, 2005.
ditional image generation from visual attributes,” in European
Conference on Computer Vision. Springer, 2016, pp. 776–791. [36] D. Povey et al., “The kaldi speech recognition toolkit,” in IEEE
Workshop on automatic speech recognition and understanding.
[17] M. Blaauw and J. Bonada, “Modeling and transforming speech
IEEE Signal Processing Society, 2011.
using variational autoencoders.,” in INTERSPEECH, 2016, pp.
1770–1774. 6928

Icassp40776 2020 9053737

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Icassp40776 2020 9053737

Uploaded by

Copyright:

Available Formats

ARTIFICIAL BANDWIDTH EXTENSION USING

CONDITIONAL VARIATIONAL AUTO-ENCODERS AND ADVERSARIAL LEARNING

Pramod Bachhav, Massimiliano Todisco and Nicholas Evans

EURECOM, Sophia Antipolis, France

ABSTRACT Probabilistic deep generative models such as variational auto-

978-1-5090-6631-5/20/$31.00 ©2020 IEEE 6924 ICASSP 2020

𝑞∅x (zx |x) 3.4. Combining CVAE with GAN

You might also like