You are on page 1of 5

2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) | 978-1-6654-1540-8/20/$31.

00 ©2021 IEEE | DOI: 10.1109/ICBAIE52039.2021.9389968

2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE 2021)

Codec Network for Speech Bandwidth Extension


Chun-Dong Xu Xian-Peng Ling Dong-Wen Ying*
School of Information Engineering School of Information Engineering School of Information Engineering
Jiangxi University of Science and Jiangxi University of Science and Jiangxi University of Science and
Technology Technology Technology
Ganzhou, Jiangxi, China Ganzhou, Jiangxi, China Ganzhou, Jiangxi, China
e-mail: xuchundong@jxust.edu.cn e-mail: lxp_never@foxmain.com e-mail: building2009@163.com

Abstract—In order to further improve the performance of narrowband speech waveform, and the output is wideband
speech bandwidth extension, this paper proposes an end-to-end speech waveform. Although this method generates better
codec network model for speech bandwidth extension. It is reconstructed broadband speech, the model is complex and
modeled on the time-domain waveform and consists of encoder difficult to run in real time, and it has poor fitting ability to
network, long short-term memory (LSTM) and decoder network. different types of speech data. Ling et al. [7] proposed a full
The encoder network is responsible for feature extraction and RNN structured BWE model, which is also based on
data dimensionality reduction of the input data, the LSTM is time-domain waveform level modeling. Using RNN good
responsible for extracting the context-dependent information of
fitting ability to process time series data, the reconstructed
the speech signal, and the decoder network is responsible for
wideband speech has good speech quality, but RNN must
wideband speech reconstruction of the potential features output
by the LSTM. In addition, this paper also proposes a
process the data of the previous time step before it can start
time-frequency perception loss function to guide model training to processing the data of the next time step. Parallel calculation is
generate more accurate time-domain waveforms and more not possible. Therefore, the model efficiency of this full RNN
realistic frequency-domain spectrum. Experimental results show stack structure is very low. Lim et al. [8] and Dong et al. [9]
that the reconstructed wideband speech generated by the model converted the time domain BWE to the time-frequency domain,
achieves better results in both subjective and objective evaluation. and proposed a time-frequency neural network (TF-Net), using
two identical neural network models, one of which fits the
Keywords: codec; time-frequency perception loss; deep neural time-domain waveform, and the other is to fit the frequency
network; speech bandwidth extension domain spectrum. Although the time-frequency neural network
can generate higher-quality wideband speech, the modeling is
I. INTRODUCTION complicated and the calculation pressure on the terminal is
In the current public switched telephone network and some greater. Wang et al. [10] proposed a time-frequency loss
wireless communication systems, due to the influence of the function, which can guide the model to learn time domain and
channel bandwidth, speech acquisition equipment, and codec frequency domain features, obtain results similar to TF-Net,
methods, the bandwidth of the speech signal in the and solve the problem of the model being too complex.
communication process only retains the low frequency part of 0
This paper proposes a time domain end-to-end codec
to 4000Hz. Although narrowband speech retains the basic
network model for BWE. After multi-layer dimensionality
semantic information, the speech that loses the high frequency
reduction is performed on high-dimensional input data, LSTM
part lacks a sense of presence, the auditory performance is low,
is used in the bottleneck layer to extract the context dependent
and detailed information such as timbre and emotion is lost. The
information of timing features. In addition, we also designed a
speech bandwidth extension (BWE) aims to restore the high
time-frequency perception loss function, which can promote
frequency part of narrowband speech and improve speech
the model to learn more accurate mapping relationships
quality and naturalness.
between narrowband speech and wideband speech in the time
Earlier BWE methods mostly used the source-filter model domain, frequency domain, and perception domain. The
method. Based on the speech generation process, it models, structure of this paper is as follows: Section 1 describes the
then generates wideband speech excitation and spectrum model proposed in this article, Section 2 introduces the
envelope, and finally synthesizes wideband speech. Then the time-frequency perception loss function, Section 3 carries out
researchers proposed codebook mapping [1], Hidden Markov subjective and objective evaluation experiments, and Section 4
Model [2, 3] and Gaussian Mixture Model [4, 5] BWE methods, summarizes.
although these methods achieve better results than the
source-filter model, there are still problems such as insufficient II. CODEC NETWORK MODEL
modeling ability and excessive smoothness of the recovered This paper proposes a BWE model based on codec network.
high-frequency spectrum of the speech. In recent years, deep The structure is shown in Fig. 1. The model is composed of
learning has been widely and successfully applied in the field of encoder, LSTM layer and decoder. In order to avoid that the
speech signal processing. Kuleshov et al. [6] transferred the model becomes insensitive to data as the number of network
BWE in the frequency domain to the time domain, and layers deepens, the additive jump connection form is used to
proposed an end-to-end BWE model based on CNN. The model encourage the model to learn differentiated features, that is, the
is modeled on the time domain waveform. The input is input of the decoder block Decoder_block_(i) in the decoder is
978-1-6654-1540-8/21/$31.00 ©2021 IEEE
387

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 26,2021 at 04:18:03 UTC from IEEE Xplore. Restrictions apply.
the output of the previous layer add the output of Encode_block_(L-i+1) in the encoder.

Figure 1. BWE model based on codec network, L represents the number of layers of the codec, N is the number of output channels, H=48, K is the kernel size, K=8,
S is stride, S=2, C_in Represents the number of input channels, and C_out represents the number of output channels.

The encoder is composed of 5 layers of coding blocks, The time domain sub-loss function is defined as the root
which are responsible for reducing the data length and mean square error (RMSE) between the time domain wideband
increasing the number of channels. The first layer of the coding speech waveform y and the model output reconstructed
block is a one-dimensional convolution with strides, followed wideband speech waveform y :
by the LeakyReLU activation function after the convolution.
The second layer is a one-dimensional convolution with a stride
1 N
of 1 and a kernel size of 1. It is worth noting that we then used LossT(y ,y) =
X ^ ) ~ y ("))2
n 0)
Gated Linear Units (GLU) [11] to reduce gradient dispersion N n=1
and other problems, and speed up Model convergence rate where n is the index of the current frame, and N is the
The bottleneck feature is the deep speech feature extracted number of frames.
by the encoder from the narrowband speech, and each sampling The frequency domain sub-loss is defined as the RMSE loss
point of the speech data has a large correlation, so the model of the time domain wideband speech y and the model output
bottleneck layer uses 2 layers of LSTM that are good at learning
the context dependence of time series data, and the length of reconstruction wideband speech y after extracting the Log
LSTM has been reduced to 1/ S L times by the encoder, and Mel-Spectrogram (LMS) features. Because LMS is based on
the number of hidden units is less, which greatly shortens the the Mel scale of the human ear perception of sound frequency,
time for LSTM to process data. LMS can characterize both the frequency domain
characteristics of speech and the perceptual domain
The decoder is composed of 5 layers of decoding blocks, characteristics. The frequency domain perceptron loss is
which are responsible for increasing the data length and defined as:
reducing the number of channels. The first layer of the decoded
block is a one-dimensional convolution with a kernel size of 1 1M K
and a stride of 1, following the GLU activation function. The Lossp (y, y) = £ £ ( Z M S ( y ) - L M S(y))2 (2)
second layer is one-dimensional transposed convolution. MK m=1k=1
where M and K represent the number of mel-filter banks
III. TIME-FREQUENCY PERCEPTION LOSS FUNCTION and the frame length, respectively, and m and k are the
The loss function in deep learning plays a role in guiding the index of the sample points of the mel-filter bank and speech
training of the model. We introduce the time-frequency frame. We choose RMSE as the calculation method of the loss
perception loss function in the BWE, the purpose is to stimulate function because after the loss is squared, if the loss value is
the model to generate more realistic time domain waveforms greater than 1, the loss value will become larger and larger, and
and frequency domain spectrum. The time-frequency the model will become extremely sensitive. Any abnormal
perception loss is defined as the linear combination of the time sample requires the model to adjust its parameters to adapt to a
domain sub-loss function and the frequency domain perception single outliers, this will sacrifice many other normal samples. If
sub-loss function. the loss value is less than 1, the loss value will become smaller
and smaller after rehabilitation, and the model will become

388

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 26,2021 at 04:18:03 UTC from IEEE Xplore. Restrictions apply.
extraordinarily insensitive and prone to under-fitting, so we use The training set is used to train the model and adjust the
RMSE. parameters. The test set is used to verify the performance of the
model.
In summary, the time-frequency perception loss is defined
as: We use python toolkit librosa [14] to preprocess the data.
First, the dataset is uniformly down-sampled to 16kHz and
Losstotal = LossT + 0.001* LossF (3) marked as wideband speech data. For the original 16kHz
In the formula (3), 0.001 is the frequency domain loss speech sampling rate, there is no need to process it. In order to
weight obtained by grid search. obtain narrowband voice data, the dataset is uniformly
down-sampled to 8kHz, and the 8kHz voice is resampled to
IV. EXPERIMENT 16kHz without recovering the high-frequency components,
In order to verify the speech quality of the proposed method which is marked as narrowband voice. According to the nyquist
for reconstructing wideband speech, we compared the sampling theorem, the frequency band of wideband speech
spectrograms of wideband speech, narrowband speech and ranges from 0 to 8 kHz, and the frequency band of narrowband
reconstructed wideband speech, and reimplementation the speech ranges from 0 to 4 kHz.
BWE of the CNN method[6], and carried out subjective and When training the model, the batch size is set to 32, the
objective evaluation experiments. Subjective evaluation uses Adam optimizer is used, the learning rate is set to 3e-4, the
mean opinion score (MOS), and objective evaluation uses neural network weight parameters of the model are all
metric methods such as signal-to-noise ratio (SNR), initialized with xavier_uniform, and the bias term is initialized
logarithmic spectral distance (LSD), and short-term objective to 0.
intelligibility (STOI).
B. Spectrogram
A. Dataset and Experimental Setup
The comparison of spectrograms is shown in Fig. 2. The
The experimental design follows the previous research[6], spectrogram of reconstructed wideband speech is very similar
and the single-speaker experiment and the multi-speaker to that of wideband speech. It can be seen that the spectrum of
experiment are carried out respectively. The purpose of the reconstructed wideband speech has been basically restored. In
experiment is to verify the model ability to fit voice data with the frequency range of 0 to 4kHz, the spectrum structure,
different acoustic characteristics. The single-speaker energy and texture are basically the same. In the frequency
experimental design was carried out on the VCTK P225 range of 4 to 8kHz, the higher the frequency, the restored
dataset[12], and the multi-speaker experimental design was spectrum details are more blurred, but the spectrum energy and
carried out on the TIMIT dataset [13]. Both VCTK and TIMIT structure are roughly restored.
are recorded by native English speakers. The sampling rate of
VCTK is 48kHz and the sampling rate of TIMIT is 16kHz. The
dataset is divided into a training set and a test set at a ratio of 7:3.

0.5 1.0 1.5


time(s) time(s) time(s)
Figure 2. Comparison of spectrograms

C. Objective Evaluation
The SNR is a commonly used metric in signal processing, s n t r ) - ml
Es(n)2
n=1
so it is used to evaluate the audio quality of reconstructed \ >)~ &io n ^ (4)
wideband speech. E [^(n) _ s(n)]2
n-1

389

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 26,2021 at 04:18:03 UTC from IEEE Xplore. Restrictions apply.
where s(n) represents wideband speech, s(n) represents Table IV. Experimental results of subjective evaluation on the TIMIT dataset

reconstructed wideband speech, N represents the number of Model MOS


frames, and n represents frame index. The larger the SNR
Spline 2.271
value, the clearer the reconstructed wideband speech.
CNN 3.180
The LSD is used to measure the speech quality of the
reconstructed wideband speech in the frequency domain, and Proposed 3.532
the formula is defined as:

S = 10log!0 | s(l,m )\2 Table III and Table IV show the results of the subjective
(5)
evaluation experiments on different datasets respectively. The
S = 1 0 log !0 \ s(l, m) \2 (6) proposed model achieved the best scores on both the VCTK
p225 dataset and the TIMIT dataset.

LSD tj {M2
=L s -*}
In the formula (7), s(l,m) and s(l, m) are the frequency
(7) V. CONCLUSION
This paper proposes a BWE method based on the codec
structure. The bottleneck feature layer in the middle of the
spectrum of wideband speech and reconstructed wideband codec uses LSTM to learn the acoustic characteristics between
speech, respectively. The smaller the LSD, the more accurate the speech data contexts. In addition, a time-frequency
the recovered frequency spectrum. perception loss function is proposed. It can guide the model to
It can be seen from Table I and Table II that the learn more accurate mapping relationships between
reconstructed wideband speech generated by our proposed narrowband and wideband speech in the time domain,
method has achieved significant improvements in SNR, LSD frequency domain, and perceptual domain. Subjective and
and STOI measurement methods. objective evaluations show that the reconstructed wideband
speech by proposed method has higher speech quality.
Table I. Objective evaluation experimental results on the VCTK p225 dataset
Acknowledgment
Model SNR(dB) LSD(dB) STOI This research was jointly supported by the National Natural
Spline 20.981 3.522 0.9801
Science Foundation of China under Grant11864016,61671442
and Funds of the Humanities and Social Sciences Key Research
CNN 21.825 1.883 0.9841 Base Project of Jiangxi Province under Grant JD19042 and
Postgraduate Innovation Special Fund project of Jiangxi
Proposed 22.291 1.130 0.9858
University of Science and Technology under Grant
ZS2019-S080.
Table II. The results of the objective evaluation experiment on the TIMIT
dataset REFERENCES
[1] Y. Qian and P. Kabal, "Wideband speech recovery from narrowband
Model SNR(dB) LSD(dB) STOI
speech using classified codebook mapping," Proc. Australian
International Conference Speech Science, Technology, Melbourne,
Spline 15.866 3.674 0.9853
Australian, 2002, pp. 106-111, doi: 10.1121/1.429571
CNN 17.443 1.761 0.9955 [2] P Jax and P. Vary, "Wideband extension of telephone speech using a
hidden Markov model," Proc. IEEE Workshop on Speech Coding, 2000,
Proposed 17.456 0.812 0.9968 pp. 133-135, doi: 10.1109/SCFT.2000.878427.
[3] P. Jax and P. Vary, "Artificial bandwidth extension of speech signals using
D. Subjective Evaluation MMSE estimation based on a hidden Markov model," Proc. IEEE
International Conference on Acoustics, Speech, and Signal Processing,
The subjective evaluation experiment uses the MOS. The Hong Kong, China, 2003, vol. 1, pp. 1764-1771, doi:
subjective evaluation experiment requires 20 testers to listen to 10.1109/ICASSP.2003.1198872.
different methods of reconstructed wideband speech. [4] D. M. Mohan, D. B. Karpur, M. Narayan, and J. Kishore, "Artificial
According to the intelligibility and naturalness of the speech, bandwidth extension of narrowband speech using Gaussian mixture
the average opinion score is not biased. A segment of speech is model," Proc. International Conference on Communications and Signal
Processing, Kerala, India, 2011, pp. 410-412, doi:
scored subjectively, and the score is between 1 and 5. The 10.1109/ICCSP.2011.5739348.
higher the score, the better the speech quality.
[5] A. H. Nour-Eldin and P. Kabal, "Memory-based approximation of the
Gaussian mixture model framework for bandwidth extension of
Table III. Experimental results of subjective evaluation on the VCTK p225 narrowband speech," Proc. 12th Annual Conference of the International
dataset Speech Communication Association, Florence, Italy, 2011, pp. 1185-1188,
doi: 10.1109/icassp.2011.5947504
Model MOS
[6] V. Kuleshov, S. Z. Enam, and S. Ermon, "Audio super resolution using
Spline 3.036 neural networks," presented at the International Conference on Learning
Representations, Palais des Congrès Neptune, Toulon, France, 2017of
CNN 3.355 Work.
[7] Z. Ling, Y. Ai, Y. Gu, and L. R. Dai, "Waveform modeling and generation
Proposed 3.648 using hierarchical recurrent neural networks for speech bandwidth

390

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 26,2021 at 04:18:03 UTC from IEEE Xplore. Restrictions apply.
extension," IEEE/ACM Transactions on Audio, Speech, and Language [11] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling
Processing, vol. 26, no. 5, pp. 883-894, 2018. with gated convolutional networks," Proc. International conference on
[8] T. Y. Lim, R. A. Yeh, Y. Xu, M. N. Do, and M. Hasegawa-Johnson, machine learning, 2017, pp. 933-941
"Time-frequency networks for audio super-resolution," Proc. IEEE Int. [12] C. Veaux, J. Yamagishi, and K. MacDonald, "Superseded-cstr vctk corpus:
Conf. on Acoustics, Speech and Signal Processing, 2018, pp. 646-650, doi: English multi-speaker corpus for cstr voice cloning toolkit," 2016.
10.1109/icassp.2018.8462049 [13] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,
[9] Y. Dong et al., "A Time-Frequency Network with Channel Attention and "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.
Non-Local Modules for Artificial Bandwidth Extension," Proc. IEEE Int. NIST speech disc 1-1.1," NASA STI/Recon Tech. Rep. LDC93S1, vol. 93,
Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 6954-6958 p . 27403, 1993.
[10] H. Wang and D. Wang, "Time-Frequency Loss for CNN Based Speech [14] B. McFee et al., "librosa: Audio and music signal analysis in python,"
Super-Resolution," Proc. IEEE International Conference on Acoustics, Proc. 14th Python in Science Conference, Austin, USA, 2015, vol. 8, pp.
Speech and Signal Processing, 2020, pp. 861-865, doi: 18-25, doi: 10.25080/majora-7b98e3ed-003
10.1109/icassp40776.2020.9053712.

391

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 26,2021 at 04:18:03 UTC from IEEE Xplore. Restrictions apply.

You might also like