Professional Documents
Culture Documents
Enhancement
Author
Roy, Sujan Kumar, Nicolson, Aaron, Paliwal, Kuldip K
Published
2020
Conference Title
Interspeech 2020
Version
Version of Record (VoR)
DOI
https://doi.org/10.21437/interspeech.2020-1551
Copyright Statement
© 2020 ISCA and the Author(s). The attached file is reproduced here in accordance with the
copyright policy of the publisher. For information about this conference please refer to the
conference’s website or contact the author(s).
Downloaded from
http://hdl.handle.net/10072/399640
For a noisy speech frame, the error covariances (Ψ(n|n−1) Firstly, the frame-wise noise PSD, λ̂v (l, m) is estimated using
and Ψ(n|n) corresponding to ŝ(n|n − 1) and ŝ(n|n)) and the DeepMMSE [18]. DeepMMSE is described in the following
Kalman gain K(n) are continually updated on a samplewise subsection.qAn estimate of noise, v̂(n, l) is given by taking the
basis, while σv2 and ({ai }, σw
2
) remain constant. At sample n, |IDFT| of λ̂v (l, m) exp[∠Y (l, m)]. The noise variance, σv2
⊤
c ŝ(n|n) gives the estimated speech, ŝ(n|n), as in [20]:
is then computed from v̂(n, l) frame-wise as:
ŝ(n|n) = [1 − K0 (n)]ŝ(n|n − 1) + K0 (n)y(n), (12) M −1
1 X 2
σv2 = v̂ (n, l). (15)
where K0 (n) is the 1st component of K(n) given by [20]: M n=0
2
α2 (n) + σw
2
The LPC parameters, ({ai }, σw ) are sensitive to noise.
K0 (n) = , (13) 2
We compute ({ai }, σw ) (p = 10) frame-wise from pre-
α2 (n) + σw2 + σ2
v
whitened speech, yw (n, l), using the autocorrelation method
2
where α2 (n) = c⊤ ΦΨ(n − 1|n − 1)Φ⊤ c is the transmission [19]. yw (n, l) is used to reduce bias in ({ai }, σw ). Then,
of a posteriori mean squared error from the previous sample, yw (n, l) is obtained by employing a whitening filter, Hw (z)
n − 1, to the total a priori mean prediction squared error [20]. to y(n, l). Hw (z) is found, as in [19, section 8.1.7]:
Eq. (12) implies that K0 (n) has a significant impact on q
ŝ(n|n), which is the output of the KF. In practice, poor esti-
X
Hw (z) = 1 + bk z −k , (16)
mates of σu2 and ({ai }, σw 2
) introduce bias in K0 (n), which k=1
affects ŝ(n|n). In the proposed SEA, DeepMMSE is used to
accurately estimate σu2 and ({ai }, σw 2
), leading to a more accu- where the whitening filter coefficients, ({bk }; q = 40) are com-
rate ŝ(n|n). puted from v̂(n, l) using the autocorrelation method [19].
2693
3.2. DeepMMSE 𝝃
ഥ𝑙
𝝃ഥ𝑙
the speech or noise, and produces a noise PSD estimate with e=4, d=1
Conv1D
negligible bias, unlike other MMSE-based noise PSD estima- (1, 𝑑𝑚𝑜𝑑𝑒𝑙 , 1) ⋯ e=3, d=4
|𝐘𝑙−𝟏𝟐|
|𝐘𝑙−𝟔|
|𝐘𝑙−𝟏𝟏|
|𝐘𝑙−𝟓|
|𝐘𝑙−𝟏𝟎|
|𝐘𝑙−𝟒|
|𝐘𝑙−𝟗|
|𝐘𝑙−𝟑|
|𝐘𝑙−𝟏𝟒|
|𝐘𝑙−𝟖|
|𝐘𝑙−𝟐|
|𝒀𝑙 |
|𝐘𝑙−𝟏𝟑|
|𝐘𝑙−𝟕|
|𝐘𝑙−𝟏|
|𝒀𝑙 |
estimate is computed using the a priori SNR [23]:
γ̂(l, m) = ξ(l, m) + 1. (a) (b)
3. Using ξˆ and γ̂, the noise periodogram estimate is found Figure 2: (a) Deep Xi-ResNet and (b) example of the contextual
2
|V̂ (l, m)| =
using the MMSE estimator [21, 22]: field of Deep Xi-ResNet with D = 4, E = 6, and r = 3.
ξ(l,m)
1
(1+ξ(l,m))2
+ (1+ξ(l,m))γ(l,m) |Y (l, m)|2 .
4. The final noise PSD estimate, λ̂v (l, m), is found by ap- Each block contains three convolutional units (CUs) [25], where
plying a first-order temporal recursive smoothing opera- each CU is pre-activated by layer normalisation [26] followed
tion: λ̂v (l, m) = αλ̂d [l − 1, k] + (1 − α)|V̂ (l, m)|2 , by the ReLU activation function [27]. The 1st and 3rd CUs
where α is the smoothing factor. In this work, α = 0 have a kernel size of r = 1 to that of r = 3 for the 2nd CU.
was used, i.e. the instantaneous noise power spectrum The 2nd CU employs a dilation rate (DR) of d, providing a con-
estimate from DeepMMSE was used. textual field over previous time steps. As in [18], d is cycled as
the block index e increases: d = 2(e−1 mod (log2 (D)+1) , where
3.3. Deep Xi-ResNet for ξ(l, m) estimation mod is the modulo operation, and D is the maximum DR. An
example of how the DR is cycled is shown in Fig. 2 (b), with
Deep Xi-ResNet is used to estimate ξ(l, m) for DeepMMSE D = 4, and E = 6. It can be seen that the DR is reset after
(available at https://github.com/anicolson/DeepXi). Deep Xi is block three. This also demonstrates the contextual field gained
a deep learning approach to a priori SNR estimation [24]. Dur- by the use of causal dilated CUs. For Deep Xi-ResNet, D is set
ing training, the clean speech and noise of the noisy speech are to 16. The 1st and 2nd CUs have an output size of df = 64 to
available. This allows the instantaneous case of the a priori that of dmodel = 256 for the 3rd CU. FC is a fully-connected
SNR to be used as the training target. To compute the instan- layer with an output size of dmodel , where layer normalisation
taneous a priori SNR, λs (l, m) and λv (l, m) are replaced with is applied to the output of FC, followed by the ReLU activation
the squared magnitude of the clean-speech and noise spectral function. The output layer O is a fully-connected layer with
components, respectively. sigmoidal units.
The observation and target of a training example for a deep
neural network (DNN) in the Deep Xi framework is |Yl | and
the mapped a priori SNR, ξ̄ξ l , respectively. The mapped a 4. Speech enhancement experiment
priori SNR is a mapped version of the instantaneous a priori 4.1. Training set
SNR. The instantaneous a priori SNR is mapped to the interval
[0, 1] in order to improve the rate of convergence of the used For training the ResNet, a total of 74250 clean speech record-
stochastic gradient descent algorithm. The cumulative distribu- ings belonging to the train-clean-100 set from the Librispeech
tion function (CDF) of ξdB (l, m) = 10 log10 (ξ(l, m)) is used corpus [28] (28539), the CSTR VCTK corpus [29] (42015), and
as the map. As shown in [24, Fig. 2 (top)], the distribution of the si∗ and sx∗ training sets from the TIMIT corpus [30] (3696)
ξdB (l, m) for the kth frequency component follows a normal are used. 5% of the clean speech recordings are randomly se-
distribution. It is thus assumed that ξdB (l, m) is distributed nor- lected and used as a validation set. Thus, 70537 clean speech
mally with mean µk and variance σk2 : ξdB (l, m) ∼ N (µk , σk2 ). recordings are used in the training set and 3713 in the valida-
Thus, the mapped a priori SNR is found by applying the normal tion set. The 2382 noise recordings adopted in [24] are used as
CDF to ξdB (l, m): the noise training set. All clean speech and noise recordings are
" !# single-channel, with a sampling frequency of 16 kHz.
¯ 1 ξdB (l, m) − µk
ξ(l, m) = 1 + erf √ , (17)
2 σk 2 4.2. Training strategy
The ResNet is trained using cross-entropy as the loss function
where µk and σk2 found in [24] are used in this work. During
ˆ m), is found from and the Adam algorithm [31] with default hyper-parameters.
inference, the a priori SNR estimate, ξ(l,
√ The gradients are also clipped between [−1, 1]. The selection
ˆ
¯ m) using ξ(l,ˆ m) = 10 σ 2erf−1
(2 ξ̄ˆ(l,m)−1)+µk /10
ξ(l, k
. order for the clean speech recordings is randomised for each
Deep Xi-ResNet utilises a residual network consisting of 1- epoch. 175 epochs are used to train the ResNet, where a mini-
D causal dilated convolutional units withing the Deep Xi frame- batch size of 10 noisy speech signals is used. The noisy signals
work [18], as shown in Fig. 2 (a). It consists of E = 40 bottle- are created as follows: each clean speech recording selected for
neck residual blocks, where eǫ{1, 2, . . . , E} is the block index. the mini-batch is mixed with a random section of a randomly
2694
a
Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)Freq.(kHz)
8
selected noise recording at a randomly selected SNR level (-10 6
4
to 20 dB, in 1 dB increments). 2
0
8
b
6
4
4.3. Test set 2
0 c
8
For objective experiments, 30 utterances belonging to six 6
4
2
speakers are taken from the NOIZEUS corpus and are sampled 0
d
8
at 16 kHz [9, Chapter 12]. We generate a noisy data set that 6
4
has been corrupted by the passing car and café babble noise 2
0
e
recordings that were adopted in [24] at SNR levels from -5dB 8
6
to 15dB, in 5 dB increments. Note that these clean speech and 4
2
0
noise recordings are not used during training. 8
f
6
4
2
4.4. Evaluation metrics 0 g
8
6
The objective quality and intelligibility evaluation was carried 4
2
0
out using the perceptual evaluation of speech quality (PESQ) 0.5 1 1.5 2
Time (s)
[32] and quasi-stationary speech transmission index (QSTI)
[33] measures. We also analyse the enhanced speech spectro- Figure 4: (a) Clean speech, (b) noisy speech (sp05 is corrupted
grams of the SEAs. The subjective evaluation was carried out with 5 dB passing car noise), and the enhanced speech spec-
through blind AB listening tests [34, Section 3.3.4]. Five En- trograms produced by the: (c) RWF-FCN, (d) DNN-KF, (e)
glish speaking listeners participated in the tests, where the ut- IAM+IFD, (f) proposed, and (g) KF-Ideal methods.
terance sp05 (“Wipe the grease off his dirty face”) was corrupted
with 5 dB passing car noise and used as the stimulus.
The proposed method is compared with benchmark meth- 100
ods, such as raw waveform processing using FCNN (RWF- Mean preference (%) 80
FCN) method [15], phase-aware DNN (IAM+IFD) method 60
[16], deep learning KF (DNN-KF) method [17], KF-Ideal 40
2
method (where ({ai }, σw ) and σv2 are computed from the clean 20
speech and noise signal) and Noisy (noise corrupted speech).
0
an
FD
F
l
N
ed
ea
s
-K
C
oi
le
os
+I
-Id
F
N
N
C
op
M
N
KF
W
IA
D
Pr
R
2 2
1 1 Fig. 5 shows that the enhanced speech produced by the pro-
0 0
c 1 d 1 posed method is widely preferred by the listeners (76.33%) than
0.8 0.8 the benchmark methods, apart from the KF-Ideal (83.22%) and
QSTI
QSTI
0.6 0.6
0.4 0.4 clean speech. The IAM+IFD method [16] is found to be the
0.2 0.2
0 0 best preferred (66.67%) amongst the benchmark methods.
-5 0 5 10 15 -5 0 5 10 15
Input SNR(dB) Input SNR(dB)
2695
7. References [19] S. V. Vaseghi, “Linear prediction models,” in Advanced Digital
Signal Processing and Noise Reduction. John Wiley & Sons,
[1] S. Boll, “Suppression of acoustic noise in speech using spectral 2009, ch. 8, pp. 227–262.
subtraction,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 27, pp. 113–120, April 1979. [20] S. So, A. E. W. George, R. Ghosh, and K. K. Paliwal, “Kalman
filter with sensitivity tuning for improved noise reduction in
[2] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of
speech,” Circuits, Systems, and Signal Processing, vol. 36, no. 4,
speech corrupted by acoustic noise,” IEEE International Confer-
pp. 1476–1492, April 2017.
ence on Acoustics, Speech, and Signal Processing, vol. 4, pp. 208–
211, April 1979. [21] R. C. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise
[3] Y. Ephraim and D. Malah, “Speech enhancement using a PSD tracking with low complexity,” IEEE International Confer-
minimum-mean square error short-time spectral amplitude esti- ence on Acoustics, Speech and Signal Processing, pp. 4266–4269,
mator,” IEEE Transactions on Acoustics, Speech, and Signal Pro- March 2010.
cessing, vol. 32, no. 6, pp. 1109–1121, December 1984. [22] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-based noise
[4] Y. Ephraim and D. Malah, “Speech enhancement using a mini- power estimation with low complexity and low tracking delay,”
mum mean-square error log-spectral amplitude estimator,” IEEE IEEE Transactions on Audio, Speech, and Language Processing,
Transactions on Acoustics, Speech, and Signal Processing, vol. 20, no. 4, pp. 1383–1393, December 2012.
vol. 33, no. 2, pp. 443–445, April 1985. [23] P. Scalart and J. V. Filho, “Speech enhancement based on a pri-
[5] P. Scalart and J. V. Filho, “Speech enhancement based on a pri- ori signal to noise estimation,” in 1996 IEEE International Con-
ori signal to noise estimation,” IEEE International Conference on ference on Acoustics, Speech, and Signal Processing Conference
Acoustics, Speech, and Signal Processing, vol. 2, pp. 629–632, Proceedings, vol. 2, 1996, pp. 629–632 vol. 2.
May 1996. [24] A. Nicolson and K. K. Paliwal, “Deep learning for minimum
[6] C. Plapous, C. Marro, L. Mauuary, and P. Scalart, “A two-step mean-square error approaches to speech enhancement,” Speech
noise reduction technique,” IEEE International Conference on Communication, vol. 111, pp. 44–55, August 2019.
Acoustics, Speech, and Signal Processing, vol. 1, pp. 289–292,
[25] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of
May 2004.
generic convolutional and recurrent networks for sequence mod-
[7] K. Paliwal and A. Basu, “A speech enhancement method based on eling,” ArXiv, vol. abs/1803.01271, 2018. [Online]. Available:
Kalman filtering,” IEEE International Conference on Acoustics, http://arxiv.org/abs/1803.01271
Speech, and Signal Processing, vol. 12, pp. 177–180, April 1987.
[26] J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” ArXiv,
[8] N. Upadhyay and A. Karmakar, “Speech enhancement using spec- vol. abs/1607.06450, 2016.
tral subtraction-type algorithms: A comparison and simulation
study,” Procedia Computer Science, vol. 54, pp. 574 – 584, 2015. [27] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” 27th International Conference on Machine
[9] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. Learning, pp. 807–814, June 2010.
Boca Raton, FL, USA: CRC Press, Inc., 2013.
[28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[10] S. K. Roy, W. P. Zhu, and B. Champagne, “Single channel speech
rispeech: An ASR corpus based on public domain audio books,”
enhancement using subband iterative Kalman filter,” IEEE Inter-
IEEE International Conference on Acoustics, Speech and Signal
national Symposium on Circuits and Systems, pp. 762–765, May
Processing, pp. 5206–5210, April 2015.
2016.
[11] Y. Xu, J. Du, L. Dai, and C. Lee, “An experimental study on [29] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK
speech enhancement based on deep neural networks,” IEEE Sig- corpus: English multi-speaker corpus for CSTR voice cloning
nal Processing Letters, vol. 21, no. 1, pp. 65–68, 2014. toolkit,” University of Edinburgh. The Centre for Speech Tech-
nology Research (CSTR), 2017.
[12] A. Nicolson and K. K. Paliwal, “Bidirectional long-short term
memory network-based estimation of reliable spectral component [30] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S.
locations,” in Proc. Interspeech 2018, 2018, pp. 1606–1610. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech
[Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018- corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon
1134 Technical Report N, vol. 93, Feb. 1993.
[13] Y. Wang, A. Narayanan, and D. Wang, “On training targets for [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic
supervised speech separation,” IEEE/ACM Transactions on Au- optimization,” ArXiv, vol. abs/1412.6980, 2014. [Online].
dio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849– Available: http://arxiv.org/abs/1412.6980
1858, 2014. [32] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-
[14] S. R. Park and J. Lee, “A fully convolutional neural network for ceptual evaluation of speech quality (PESQ)-a new method for
speech enhancement,” Proceedings of Interspeech, p. 1993–1997, speech quality assessment of telephone networks and codecs,”
2017. IEEE International Conference on Acoustics, Speech, and Signal
[15] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End- Processing, vol. 2, pp. 749–752, May 2001.
to-end waveform utterance enhancement for direct evaluation [33] B. Schwerin and K. K. Paliwal, “An improved speech transmis-
metrics optimization by fully convolutional neural networks,” sion index for intelligibility prediction,” Speech Communication,
IEEE/ACM Transactions on Audio, Speech, and Language Pro- vol. 65, pp. 9–19, December 2014.
cessing, vol. 26, no. 9, pp. 1570–1584, 2018.
[34] K. K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel
[16] N. Zheng and X. Zhang, “Phase-aware speech enhancement based speech enhancement using spectral subtraction in the short-time
on deep neural networks,” IEEE/ACM Transactions on Audio, modulation domain,” Speech Communication, vol. 52, no. 5, pp.
Speech, and Language Processing, vol. 27, no. 1, pp. 63–76, 450–475, May 2010.
2019.
[17] H. Yu, Z. Ouyang, W. Zhu, B. Champagne, and Y. Ji, “A deep neu-
ral network based Kalman filter for time domain speech enhance-
ment,” IEEE International Symposium on Circuits and Systems,
pp. 1–5, May 2019.
[18] Q. Zhang, A. M. Nicolson, M. Wang, K. Paliwal, and C. Wang,
“Deepmmse: A deep learning approach to mmse-based noise
power spectral density estimation,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 2020.
2696