Professional Documents
Culture Documents
WB power spectrum PSWB , and hence estimated WB LP parameters where C is a constant that can be ignored during optimisation. The
ĝ WB , âWB . Second (box 2), the HB excitation ûHB is estimated from scalar α = 2σ 2 can be seen as a weighting factor between the KL-
the spectral translation of the NB excitation uNB with fM = 8 kHz divergence and the reconstruction term. In practice, L = 1 samples
(which corresponds to spectral folding around 4kHz). NB and HB are used per datapoint [24]. The lower bound L(·) forms the objec-
excitation components are then combined to obtain the extended WB tive function which can be optimized with respect to parameters θ
excitation ûWB . Finally (box 3), ûWB is filtered using a synthesis and φ using a stochastic gradient descent algorithm.
filter defined by ĝ WB and âWB in order to resynthesise speech frame
ŝWB . A conventional overlap and add (OLA) technique is used to
produce extended WB speech. 3.2. Generative adversarial networks
In regression framework, a GAN consists of two adversarial net-
3. APPLICATION OF CVAE AND GAN FOR ABE works, a generator G and a discriminator D. Generator maps an
input sample x to an output sample y. D takes form of a binary
classifier which predicts the probability D(x) that a given sample x
In this section we describe how CVAEs can be used for estimation
belongs to the training distribution and not the distribution modelled
of HB features from input NB features in an ABE task. The CVAEs
by G. D is thus trained to maximise the probability of assigning
are trained in an adversarial fashion to deliver improvements in ABE
a correct label to both training samples and samples from G [25].
performance.
This adversarial learning process is formulated as a minimax game
between G and D given according to the following objective func-
3.1. Conditional variational auto-encoders tion:
A conditional variational auto-encoder (CVAE) is a conditional, gen- min max V (G, D) = Ey [log(D(x))] + Ex [log(1 − D(G(x)))]
G D
erative model of the form pθ (y, z|x) = pθ (z)pθ (y|x, z). For a (3)
given input observation x, a latent variable z is drawn from a prior During training, G tries to fool D by generating samples close to the
distribution pθ (x) from which the posterior distribution pθ (y|x, z) training data so that D classifies G’s output as real. D then updates
generates the output y [15, 16]. its parameters in order to classify samples generated by G as fake.
CVAE maximises the conditional likelihood pθ (y|x, z) via the Both G and D are trained iteratively until the GAN converges to a
use of recognition/inference model qφ (z|y)1 (also referred to as good estimator of the true data distribution [25].
probabilistic encoder) which estimates the parameters φ of the pos-
terior distribution over all possible values of the latent variables z 3.3. Application to ABE
that may have generated the given datapoint y. The probabilistic
decoder pθ (y|x, z) then produces a distribution with parameters This section describes the proposed scheme for optimisation of
θ over all possible values of y for given z and x. For simplicity, CVAEs specifically tailored to ABE. The scheme is illustrated in
it is assumed that the approximate (qφ (z|y)) and true posteriors Fig. 2(a). Parallel training data consisting of NB and WB utterances
(pθ (z|y)) are diagonal multivariate Gaussian distributions whose is processed in frames of 30ms duration with 15ms overlap. Input
respective parameters φ and θ are computed using two different data x = xNBconc 2 consists of NB LPS coefficients with memory (as
DNNs. described in Section 2). The output data y = yHB mvn consists of first
10 LPCCs extracted from parallel HB data via SLP.
1 In our formulation, we assume that the latent variable z is dependent The CVAE is then trained to model the distribution of the out-
only on the output variable y i.e., qφ (z|x, y) = qφ (z|y) put y conditioned on the input x as follows. The HB data y is fed
6925
𝝁 𝒛y 𝐲 = 𝝁𝐲 y 𝐺 y = 𝜇y 𝐷 fake
𝐳𝐲
y 𝐷 real
x
𝐲
Fig. 3. An illustration of an adversarial training where G network is
formed using CVAE architecture shown in Fig. 2(a).
𝑞∅y (zy |y) 𝝈𝒛 y 𝜖 ~ 𝑁(𝟎, 𝐈) 𝑝𝜃y (y|𝑧x , 𝑧y ) for estimation of y where zy is sampled from the prior distribution
𝐳𝐱 pθy (zy ) = N (0, I) during testing (or estimation phase). It can
𝜖 ~ 𝑁(𝟎, 𝐈)
be noted that there exists a discrepancy during reconstruction and
estimation phases of CVAEs since y is not available during estima-
𝝈𝐳 𝐱 𝝁𝐳x tion [20].
reconstruction) phase and (b) testing (or prediction) phase. The error term L1 (G) = Ey ky − ŷk1 is added in order to minimise
the distance between generated (ŷ) and real (y) samples, the trick
that helps to stabilise GAN training [27]. After adversarial train-
to the encoder qφy (zy |y) (top-left network in Fig. 2(a)) in order ing of the CVAE, a DNN (denoted as CVAE-GAN) is formed from
to predict the mean µzy and log-variance log(σz2y ) of the approxi- G network in a similar fashion as described in 3.3 and shown in
mate posterior distribution qφy (zy |y). The predicted parameters are Fig. 2(b)).
then used to obtain the latent representation zy ∼ qφy (zy |y) of the
output variable y via the reparameterization trick (see Section 3.1). 4. EXPERIMENTAL SETUP AND RESULTS
The encoder qφx (zx |x) (bottom of Fig. 2(a)) is fed with input data
x in order to predict the mean µzx and log-variance log(σz2x ) that This section describes the databases used for ABE experiments,
represent the posterior distribution qφx (zx |x). The latent variable baseline algorithm, configuration details of CVAE and GAN ar-
zx ∼ qφx (zx |x) is then used as the CVAE conditioning variable. chitectures, and results. Experiments are designed to compare the
After concatenation, zx and zy are fed to the decoder pθy (y|zx , zy ) performance of ABE systems that use CVAE-DNN and CVAE-GAN
(top-right network) in order to predict the mean µy = µ(zx , zy ; θ) with that uses a DNN trained with conventional MSE criterion.
of the output variable y. Finally, the entire network is trained to
learn parameters φx , φy and θy jointly. From Eqs. 1 and 2, the 4.1. Database
equivalent variational lower bound under optimisation is given by:
The dataset by Valentini et al. [28] was used for training and vali-
log pθy (y|zx ) ≥ L(θy , φy , φx ; zx , y) = dation. The training set consists of 11572 sentences spoken by 28
English speakers at a sampling rate of 48kHz. Parallel NB and WB
L
1 X ky − µ(zx , zy (l) ; θy )k2 speech signals were created by downsampling the original files to
− DKL [qφy (zy |y)||pθy (zy )]−
L α 8 and 16kHz respectively. While 80% of feature vectors extracted
l=1
(4) from the training set were used for training DNN models, remaining
20% samples were used for validation. The acoustically-different
It is expected that, during optimisation of Eq. 4, parameters φx , φy TIMIT core test subset [29] was used for testing.
and θy are jointly updated so that the framework learns to recon-
struct HB data y from the input data x. 4.2. CVAE configuration and training
Finally after training (i.e. reconstruction phase), the encoder
The CVAE architecture2 is implemented using the Keras toolkit [30].
qφx (zx |x) and the decoder pθy (y|zx , zy ) (signified by the red com-
Encoders qφx (zx |x) and qφy (zy |y) consist of one hidden layer with
ponents in Fig. 2(a)) networks are used to form a DNN (denoted
as CVAE-DNN) with two stochastic layers zx and zy . The pro- 2 Implementation is available at https://github.com/
posed CVAE-DNN scheme is illustrated in Fig. 2(b). It can be used bachhavpramod/bandwidth_extension
6926
Table 1. Objective assessment results. Lower values of RMS-LSD Table 2. ASR WER performance in % on the TIMIT core test subset.
(in dB) reflect better performance whereas higher values of segSNR Regression
(in dB) and MOS-LQOWB values indicate better performance. mono tri1 tri2 tri3
method
Regression method RMS-LSD segSNR WB-PESQ up-NB 39.4 35.7 31.9 27.2
DNN 8.90 17.09 3.35 DNN 35.4 29.3 26.2 24.4
CVAE-DNN 9.11 17.54 3.41 CVAE-DNN 35.1 29.2 25.9 24.3
CVAE-GAN 10.21 17.81 3.56 CVAE-GAN 34.5 28.8 26.1 24.1
WB 32.2 25.5 23.7 21.3
128 units, and 640 and 10 units for input layers respectively. Their
outputs are Gaussian-distributed latent variable layers zx and zy
consisting of 64 units for the means µzx , µzy and log-variances However, obtaining reliable quality estimates via subjective tests is
σzx , σzy . The decoder pθy (y|zx , zy ) consist of one hidden layer time-consuming and expensive. Thus, we evaluated merit of the pro-
with 128 units and an output layer with 10 units respectively. All posed methods in terms of WER performance for an ASR task.
hidden layers have tanh activation units whereas Gaussian parameter Four GMM-HMM based ASR systems were trained (using the
layers have linear activation units. The modelling of log-variances Kaldi toolkit [36]) on WB speech signals obtained from the TIMIT
avoids the estimation of negative variances. training set using monophone (mono) and triphone (tri1, tri2 and
Training is performed in order to minimise the negative con- tri3) modelling schemes. Note that the aim of this experimental setup
ditional log-likelihood in Eq. 4 using the Adam stochastic optimi- is to investigate performance of different ABE algorithms using sim-
sation technique [31] with an initial learning rate of 10−3 and hy- ple ASR systems. The performance of the ASR systems was tested
perparameters β1 = 0.9, β2 = 0.999 and = 10−8 . The hy- using the core TIMIT test subset. First, all WB test signals were pro-
perparameter α (in Eq. 4) is set to 10 (i.e., dimension of HB fea- cessed to produce upsampled NB (up-NB) signals at a sampling rate
ture vectors y) according to our previous investigations reported of 16kHz. The up-NB signals were then bandwidth-extended using
in [32]. Networks are initialised according to the approach described DNN, CVAE-DNN and CVAE-GAN based ABE systems. ASR was
in [33] so as to improve the rate of convergence. To discourage over- then performed for these test signals using a model that was trained
fitting, batch-normalisation [34] is applied before every activation on original WB signals.
layer. The learning rate is reduced by half when the validation loss The WER results are shown in Table 2. Highest WER (indi-
increases between 5 consecutive epochs. The CVAE is trained for a cates lower performance) is obtained for up-NB signals when tested
50 epochs using input x and output y data with a batch size of 512. with the ASR model trained on WB data, this is obvious because
The model giving the lowest validation loss is used for subsequent HB frequency content is missing in up-NB signals. While all ABE
processing. approaches achieve lower WERs than up-NB signals, the proposed
During adversarial training, the generator G network comprises CVAE based approaches consistently improve ASR performance
the same CVAE architecture described above. The discriminator D over the DNN based baseline method. The CVAE-DNN outper-
is a binary classifier with sigmoid activation unit and it consists of forms DNN by WER improvement of 0.3% for tri2 ASR system.
three hidden layers with 512 tanh activation units. The hyperparam- CVAE-GAN achieves lower WER by 0.9% (mono), 0.5% (tri1) and
eter λ (in Eq. 5) is set to 1. The GAN is trained for 30000 iter- 0.3% for remaining ASR systems. The WER assessment thus indi-
ations with a batch size of 512 using same optimization technique cates that the proposed ABE systems produce bandwidth-extended
and learning rate. signals closer to original WB signals. Few speech samples are
The performance of the proposed schemes is compared with a available at http://audio.eurecom.fr/content/media.
baseline DNN of same architecture that is trained using a MSE cri-
terion. All networks have a common structure of (128,128,128) hid-
den units in order to maintain low complexity given that ABE is a 5. CONCLUSIONS
real-time application.
Conditional variational auto-encoders (CVAEs) are directed graph-
4.3. Objective assessment ical models that are used for generative modelling. This paper re-
ports their application to ABE for modelling distribution of high-
Objective assessment metrics include root mean square log-spectral band features. The conditioning variable of the CVAE is derived
distortion (RMS-LSD) (calculated for a frequency range of 4-8kHz), from narrowband features via an auxillary neural network. CVAEs
segmental signal-to-noise ratio (segSNR) and a WB extension to the are also combined with GAN framework via adversarial training in
perceptual evaluation of speech quality algorithm (WB-PESQ) [35]. order to seek further improvements. The proposed approaches pro-
The latter gives objective estimates of mean opinion scores. duce of speech of substantially better quality which is confirmed
Results are presented in Table 1. The proposed approaches via improvements in WB-PESQ, segSNR estimates. The merit of
(CVAE-DNN and CVAE-GAN) outperform baseline DNN in terms the proposed approach is further assessed via improvements in ASR
segSNR and WB-PESQ. Surprisingly, the baseline DNN achieves performance. Crucially the improvements are achieved without aug-
better (lower) RMS-LSD value than the proposed schemes. menting complexity of the baseline regression model. Future work
should investigate why spectral distance measure did not show cor-
4.4. Word error rate (WER) assessments on ASR task relation with the other performance metrics. Better CVAE training
strategies in order to reduce the discrepancy during reconstruction
Objective metrics may not provide accurate estimates, subjective lis- and estimation phases should bring further improvements to ABE
tening tests are thus necessary in order to assess ABE performance. performance.
6927
6. REFERENCES [18] C.-C. Hsu et al., “Voice conversion from non-parallel corpora
using variational auto-encoder,” in Proc. of IEEE APSIPA,
[1] S. Li, S. Villette, P. Ramadas, and D. Sinder, “Speech band- 2016, pp. 1–6.
width extension using generative adversarial networks,” in
[19] B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang,
Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
“Variational neural machine translation,” arXiv preprint
cessing (ICASSP), 2018, pp. 5029–5033.
arXiv:1605.07869, 2016.
[2] K.-Y. Park and H. Kim, “Narrowband to wideband conver-
[20] A. Pagnoni, K. Liu, and S. Li, “Conditional variational au-
sion of speech using GMM based transformation,” in Proc. of
toencoder for neural machine translation,” arXiv preprint
IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
arXiv:1812.04405, 2018.
(ICASSP), 2000, pp. 1843–1846.
[21] A. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,
[3] P. Jax and P. Vary, “On artificial bandwidth extension of tele-
“Autoencoding beyond pixels using a learned similarity met-
phone speech,” Signal Processing, vol. 83, no. 8, pp. 1707–
ric,” arXiv preprint arXiv:1512.09300, 2015.
1719, 2003.
[22] J. Bao et al., “CVAE-GAN: fine-grained image generation
[4] K. Li and C.-H. Lee, “A deep neural network approach to
through asymmetric training,” in Proc. of the IEEE Int. Conf.
speech bandwidth expansion,” in Proc. of IEEE Int. Conf. on
on Computer Vision, 2017, pp. 2745–2754.
Acoustics, Speech and Signal Processing (ICASSP), 2015, pp.
4395–4399. [23] P. Bachhav, M. Todisco, and N. Evans, “Exploiting explicit
[5] P. Bachhav, M. Todisco, M. Mossi, C. Beaugeant, and N. memory inclusion for artificial bandwidth extension,” in Proc.
Evans, “Artificial bandwidth extension using the constant Q of IEEE Int. Conf. on Acoustics, Speech and Signal Processing
transform,” in Proc. of IEEE Int. Conf. on Acoustics, Speech (ICASSP), 2018, pp. 5459–5463.
and Signal Processing (ICASSP), 2017, pp. 5550–5554. [24] D. Kingma and M. Welling, “Auto-encoding variational
[6] A. Nour-Eldin and P. Kabal, “Mel-frequency cep- bayes,” arXiv preprint arXiv:1312.6114, 2013.
stral coefficient-based bandwidth extension of narrowband [25] I. Goodfellow et al., “Generative adversarial nets,” in Ad-
speech,,” in Proc. of INTERSPEECH, 2008, pp. 53–56. vances in neural information processing systems, 2014, pp.
[7] P. Bachhav et al., “Artificial bandwidth extension with mem- 2672–2680.
ory inclusion using semi-supervised stacked auto-encoders,” in [26] J. Sautter et al., “Artificial bandwidth extension using a condi-
Proc. of INTERSPEECH, 2018, pp. 1185–1189. tional generative adversarial network with discriminative train-
[8] I. Katsir, D. Malah, and I. Cohen, “Evaluation of a speech ing,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal
bandwidth extension algorithm based on vocal tract shape es- Processing (ICASSP), 2019, pp. 7005–7009.
timation,” in Proc. of Int. Workshop on Acoustic Signal En- [27] S. Pascual et al., “SEGAN: Speech enhancement generative
hancement (IWAENC). VDE, 2012, pp. 1–4. adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
[9] J. Abel, M. Strake, and T. Fingscheidt, “Artificial bandwidth [28] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi,
extension using deep neural networks for spectral envelope es- “Investigating RNN-based speech enhancement methods for
timation,” in Proc. of Int. Workshop on Acoustic Signal En- noise-robust text-to-speech.,” in SSW, 2016, pp. 146–152.
hancement (IWAENC). IEEE, 2016, pp. 1–5.
[29] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, and D. Pallett,
[10] Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth exten- “DARPA TIMIT acoustic-phonetic continous speech corpus
sion using bottleneck features and deep recurrent neural net- CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon tech-
works.,” in Proc. of INTERSPEECH, 2016, pp. 297–301. nical report N, vol. 93, 1993.
[11] K. Schmidt and B. Edler, “Blind bandwidth extension based [30] F. Chollet et al., “Keras,” https://github.com/
on convolutional and recurrent deep neural networks,” in in keras-team/keras, 2015.
Proc. of Int. Conf. on Acoustics, Speech and Signal Processing
[31] D. Kingma and J. Ba, “Adam: A method for stochastic opti-
(ICASSP), 2018, pp. 5444–5448.
mization,” arXiv preprint arXiv:1412.6980, 2014.
[12] Y. Wang et al., “Speech bandwidth extension using recurrent
[32] P. Bachhav et al., “Latent representation learning for artifi-
temporal restricted boltzmann machines,” IEEE Signal Pro-
cial bandwidth extension using a conditional variational auto-
cessing Letters, pp. 1877–1881, 2016.
encoder,” in Proc. of IEEE Int. Conf. on Acoustics, Speech and
[13] V. Kuleshov et al., “Audio super resolution using neural net- Signal Processing (ICASSP), 2019, pp. 7010–7014.
works,” arXiv preprint arXiv:1708.00853, 2017.
[33] K. He et al., “Delving deep into rectifiers: Surpassing human-
[14] T. Lim, R. Yeh, Y. Xu, M. Do, and M. Hasegawa-Johnson, level performance on imagenet classification,” in Proc. of the
“Time-frequency networks for audio super-resolution,” in IEEE int. conf. on computer vision, 2015, pp. 1026–1034.
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2018, pp. 646–650. [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
deep network training by reducing internal covariate shift,” in
[15] K. Sohn et al., “Learning structured output representation us- Int. conf. on machine learning, 2015, pp. 448–456.
ing deep conditional generative models,” in Advances in Neu-
ral Information Processing Systems, 2015, pp. 3483–3491. [35] “ITU-T Recommendation P.862.2 : Wideband extension to
Recommendation P.862 for the assessment of wideband tele-
[16] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Con- phone networks and speech codecs,” ITU, 2005.
ditional image generation from visual attributes,” in European
Conference on Computer Vision. Springer, 2016, pp. 776–791. [36] D. Povey et al., “The kaldi speech recognition toolkit,” in IEEE
Workshop on automatic speech recognition and understanding.
[17] M. Blaauw and J. Bonada, “Modeling and transforming speech
IEEE Signal Processing Society, 2011.
using variational autoencoders.,” in INTERSPEECH, 2016, pp.
1770–1774. 6928