Professional Documents
Culture Documents
Learning Rate
Weight Decay
modeling auditory perception of the human ear. Thus, the
Batch Size
0.0006 0.03 64
aim of the first stage is to enhance the speech envelope given
0.0004 0.02
its coarse frequency resolution. The second stage operates in Learning Rate 32
0.0002 Weight Decay 0.01 24
complex domain utilizing deep filtering [7, 6] and is trying to Batch Size 16
8
reconstruct the periodicity of speech. [2] showed, that deep 0.0000 0.00 0
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Epochs
filtering (DF) generally outperforms traditional complex ratio
masks (CRMs) especially in very noisy conditions. Fig. 2. Learning rate, weight decay and batch size scheduling
The combined SE procedure can be formulated as follows. used for training.
An encoder Fenc encodes both ERB and complex features into
one embedding E.
2.3. Multi-Target Loss
E(k) = Fenc (Xerb (k, b), Xdf (k, ferb )) (3) We adopt the spectrogram loss Lspec from [2]. Addition-
Next, the first stage predicts real-valued gains G and enhances ally use a multi-resolution (MR) spectrogram loss where
the speech envelope resulting in the short-time spectrum YG . the enhancement spectrogram Y (k, f ) is first transformed
into time-domain before computing multiple STFTs with
Gerb (k, b) = Ferb dec (E(k)) windows from 5 ms to 40 ms [11]. To propagate the gradi-
G(k, f ) = interp(Gerb (k, b)) (4) ent for this loss, we use the pytorch STFT/ISTFT, which is
numerically sufficiently close to the original DeepFilterNet
YG (k, f ) = X(k, f ) · G(k, f )
processing loop implemented in Rust.
Finally in the second stage, Fdf dec predicts DF coefficients X
N
Cdf of order N which are then linearly applied to YG . LMR = || |Yi0 |c −|Si0 |c ||2 || |Yi0 |c ejϕY −|Si0 |c ejϕS ||2 , (6)
i
N
Cdf (k, i, fdf ) = Fdf dec (E(k))
N
where Yi0 = STFTi (y) is the i-th STFT with window sizes in
X (5) {5, 10, 20, 40}ms of the predicted TD signal y, and c = 0.3
Y (k, f 0 ) = C(k, i, f 0 ) · X(k − i + l, f ),
is a compression parameter [1]. Compared to DeepFilterNet
i=0
[2], we drop the α loss term since the employed heuristic is
where l is the DF look-ahead. As stated before, the second only a poor approximation of the local speeech periodicity.
stage only operates on the lower part of the spectrogram up to Also, DF may enhance speech in non-voiced sections and can
a frequency fdf = 5 kHz. The DeepFilterNet2 framework is disable its effect by setting the real part of the coefficient at
visualized in Fig. 1. t0 to 1 and the remaining coefficients to 0. The combined
multi-target loss is given by:
2.2. Training Procedure
L = λspec Lspec + λMR LMR (7)
In DeepFilterNet [2], we used an exponential learning rate
schedule and fixed weight decay. In this work, we addition- 2.4. Data and Augmentation
ally use a learning rate warmup of 3 epochs followed by a
cosine decay. Most importantly, we update the learning rate While DeepFilterNet was trained on the deep noise suppres-
at every iteration, instead of after each epoch. Similarly, we sion (DNS) 3 challenge dataset [12], we train DeepFilterNet2
schedule the weight decay with an increasing cosine schedule on the english part of DNS4 [9] which contains more full-
resulting in a larger regularization for the later stages of the band noise and speech samples.
training. Finally, to achieve faster convergence especially in In speech enhancement, usually only background noise and in
the beginning of the training, we use batch scheduling [10] some cases reverberation is reduced [1, 11, 2]. In this work,
starting with a batch size of 8 and gradually increasing it to we further extended the SE concept to declipping. Therefore,
96. The scheduling scheme can be observed in Fig. 2. we distinguish between augmentations and distortions in the
Complex ERB ERB DF
on-the-fly data pre-processing pipeline. Augmentations are Features Features Gains Coefs
applied to speech and noise samples with the aim of further
extending the data distributions the network observes during
Encoder
+
ERB Decoder
DF Decoder
Conv Conv PConv Conv
training. Distortions, on the other hand, are only applied to
Conv Conv PConv TConv GLinear
speech samples for noisy mixture creation. The clean speech
PConv
target is not affected by a distortion transform. Thus, the DNN GLinear Conv PConv TConv 2 ⨉ GRU
learns to reconstruct the original, undistorted speech signal.
Conv PConv TConv GLinear
Currently, the DeepFilterNet framework supports the follow-
ing randomized augmentations: C GLinear
• Random 2nd order filtering [13]
• Gain changes GLinear 1 ⨉ GRU 2 ⨉ GRU
a Metrics and RTF measured with source code and weights provided at https://github.com/xiph/rnnoise/
b Note, that RNNoise runs single-threaded
c RTF measured with source code provided at https://github.com/huyanxin/DeepComplexCRN
d Composite and STOI metrics provided by the same authors in [16]
e Metrics and RTF measured with source code and weights provided at https://github.com/hit-thusz-RookieCJ/FullSubNet-plus
f RTF measured with source code provided at https://github.com/Andong-Li-speech/GaGNet/
Table 2. DNSMOS results on the DNS4 blind test set. is able to obtain best results in most metrics, but has a high
Model SIGMOS BAKMOS OVLMOS computational complexity not feasible for embedded devices.
Table 2 shows DNSMOS P.835 [22] results on the DNS4
Noisy 4.14 2.94 3.29 blind test set. While DeepFilterNet [2] was not able to en-
RNNoise [13] 3.88 3.69 3.38 hance the speech quality mean opinion score (SIGMOS), with
NSNet2 [14] 3.87 4.21 3.59 DeepFilterNet2 we obtain good results also for background
FullSubNet+ [18] 4.22 4.12 3.75 and overall MOS values. Moreover, DeepFilterNet2 comes
DeepFilterNet [2] 4.14 4.18 3.75 relatively close to the minimum DNSMOS values that were
DeepFilterNet2 4.20 4.43 3.88 used to select clean speech samples to train the DNS4 base-
+ Post-Filter 4.19 4.47 3.90 line NSNet2 (SIG=4.2, BAK=4.5, OVL=4.0) [9] further em-
phasizing its good SE performance.
3.2. Results
4. CONCLUSION
We evaluate the speech enhancement performance of Deep-
FilterNet2 using the Valentini Voicebank+Demand test set In this work, we presented DeepFilterNet2, a low-complexity
[8]. Therefore, we chose WB-PESQ [19], STOI [20] and speech enhancement framework. Taking advantage from
the composite metrics CSIG, CBAK, COVL [21]. Table 1 DeepFilterNet’s perceptual approach, we were able to further
shows DeepFilterNet2 results in comparison with other state- apply several optimizations resulting in SOTA SE perfor-
of-the-art (SOTA) methods. One can find that DeepFilter- mance. Due to its lightweight architecture, it can be run on a
Net2 achieves SOTA-level results while requiring a mini- Raspberry Pi 4 with a real-time factor of 0.42. In future work,
mal amount of multiply-accumulate operation per second we plan to extend the idea of speech enhancement to other
(MACS). The number of parameters has slightly increased enhancements, like correcting lowpass characteristics due to
over DeepFilterNet (Sec. 2.5), but the network is able to run the current room environment.
more than twice as fast and achieves a 0.27 higher PESQ
score. GaGNet [5] achieves a similar RTF while having good 5. REFERENCES
SE performance. However, it only runs fast when provided
with the whole audio and requires large temporal buffers due [1] Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik
to its usage of big temporal convolution kernels. FRCRN [3] Giri, Karim Helwani, and Arvindh Krishnaswamy, “A
Perceptually-Motivated Approach for Low-Complexity, [12] Chandan KA Reddy, Harishchandra Dubey, Kazuhito
Real-Time Enhancement of Fullband Speech,” in IN- Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebas-
TERSPEECH 2020, 2020. tian Braun, Hannes Gamper, Robert Aichner, and Sri-
ram Srinivasan, “Interspeech 2021 deep noise suppres-
[2] Hendrik Schröter, Alberto N Escalante-B, Tobias
sion challenge,” in INTERSPEECH, 2021.
Rosenkranz, and Andreas Maier, “DeepFilterNet: A low
complexity speech enhancement framework for full- [13] Jean-Marc Valin, “A hybrid dsp/deep learning approach
band audio based on deep filtering,” in IEEE Interna- to real-time full-band speech enhancement,” in 2018
tional Conference on Acoustics, Speech and Signal Pro- IEEE 20th international workshop on multimedia signal
cessing (ICASSP). IEEE, 2022. processing (MMSP). IEEE, 2018.
[3] Shengkui Zhao, Bin Ma, Karn N Watcharasupat, and [14] Sebastian Braun, Hannes Gamper, Chandan KA Reddy,
Woon-Seng Gan, “FRCRN: Boosting feature represen- and Ivan Tashev, “Towards efficient models for real-
tation using frequency recurrence for monaural speech time deep noise suppression,” in IEEE International
enhancement,” in IEEE International Conference on Conference on Acoustics, Speech and Signal Processing
Acoustics, Speech and Signal Processing (ICASSP). (ICASSP). IEEE, 2021.
IEEE, 2022. [15] Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin
[4] Guochen Yu, Yuansheng Guan, Weixin Meng, Chengshi Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei
Zheng, and Hui Wang, “DMF-Net: A decoupling-style Xie, “DCCRN: Deep complex convolution recurrent
multi-band fusion model for real-time full-band speech network for phase-aware speech enhancement,” in IN-
enhancement,” arXiv preprint arXiv:2203.00472, 2022. TERSPEECH, 2020.
[16] Shubo Lv, Yihui Fu, Mengtao Xing, Jiayao Sun, Lei
[5] Andong Li, Chengshi Zheng, Lu Zhang, and Xiaodong
Xie, Jun Huang, Yannan Wang, and Tao Yu, “S-
Li, “Glance and gaze: A collaborative learning frame-
DCCRN: Super wide band dccrn with learnable com-
work for single-channel speech enhancement,” Applied
plex feature for speech enhancement,” in IEEE Interna-
Acoustics, vol. 187, 2022.
tional Conference on Acoustics, Speech and Signal Pro-
[6] Hendrik Schröter, Tobias Rosenkranz, Alberto Es- cessing (ICASSP). IEEE, 2022.
calante Banuelos, Marc Aubreville, and Andreas Maier, [17] Shubo Lv, Yanxin Hu, Shimin Zhang, and Lei Xie, “DC-
“CLCNet: Deep learning-based noise reduction for CRN+: Channel-wise Subband DCCRN with SNR Es-
hearing aids using complex linear coding,” in IEEE In- timation for Speech Enhancement,” in INTERSPEECH,
ternational Conference on Acoustics, Speech and Signal 2021.
Processing (ICASSP), 2020.
[18] Jun Chen, Zilin Wang, Deyi Tuo, Zhiyong Wu, Shiyin
[7] Wolfgang Mack and Emanuël AP Habets, “Deep Filter- Kang, and Helen Meng, “FullSubNet+: Channel atten-
ing: Signal Extraction and Reconstruction Using Com- tion fullsubnet with complex spectrograms for speech
plex Time-Frequency Filters,” IEEE Signal Processing enhancement,” in IEEE International Conference on
Letters, vol. 27, 2020. Acoustics, Speech and Signal Processing (ICASSP).
[8] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, IEEE, 2022.
and Junichi Yamagishi, “Investigating RNN-based [19] ITU, “Wideband extension to Recommendation P.862
speech enhancement methods for noise-robust Text-to- for the assessment of wideband telephone networks and
Speech,” in SSW, 2016. speech codecs,” ITU-T Recommendation P.862.2, 2007.
[9] Harishchandra Dubey, Vishak Gopal, Ross Cutler, [20] Cees H Taal, Richard C Hendriks, Richard Heusdens,
Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, and Jesper Jensen, “An algorithm for intelligibility
Sefik Emre Eskimez, Manthan Thakker, Takuya Yosh- prediction of time–frequency weighted noisy speech,”
ioka, Hannes Gamper, et al., “ICASSP 2022 deep IEEE Transactions on Audio, Speech, and Language
noise suppression challenge,” in IEEE International Processing, 2011.
Conference on Acoustics, Speech and Signal Processing [21] Yi Hu and Philipos C Loizou, “Evaluation of objec-
(ICASSP). IEEE, 2022. tive quality measures for speech enhancement,” IEEE
[10] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, Transactions on audio, speech, and language process-
and Quoc V Le, “Don’t decay the learning rate, increase ing, 2007.
the batch size,” arXiv preprint arXiv:1711.00489, 2017. [22] Chandan KA Reddy, Vishak Gopal, and Ross Cutler,
[11] Hyeong-Seok Choi, Sungjin Park, Jie Hwan Lee, Hoon “Dnsmos p. 835: A non-intrusive perceptual objective
Heo, Dongsuk Jeon, and Kyogu Lee, “Real-time de- speech quality metric to evaluate noise suppressors,”
noising and dereverberation wtih tiny recurrent u-net,” in IEEE International Conference on Acoustics, Speech
in International Conference on Acoustics, Speech and and Signal Processing (ICASSP). IEEE, 2022.
Signal Processing (ICASSP). IEEE, 2021.