You are on page 1of 2

Multiple Input Audio Denoising Using Deep Neural Networks

*Nelson Yalta (Waseda University), Kuniaki Noda (Waseda University),

Kazuhiro Nakadai (Honda Research Institute Japan Co., Ltd), and Tetsuya Ogata (Waseda University)

1. INTRODUCTION estimation [3], which the generation model of the

Robots should have auditory functions[1] in order to observed signal were assumed, presents a limit on
have a natural interaction. Maintaining robustness and capacity and scalability.
adaptively to the daily dynamically-changing
environments, is the principal topic of the methods 3. IMPLEMENTATION
proposed in [2]. In fact, Robots must be able to adapt,
recognize, and separate the sources information by
3.1. Deep Neural Network (DNN)
reducing the noise from the source in order to work
DNN is a neural network with multiple hidden layers
properly in the real world. This process is called
denoising. In this paper, we evaluate the use of Deep of units between the input and the output layers that
Neural Network (DNN) for doing a denoising task, and can model complex nonlinear representations. One of
also using the DNN after a previous denoising stage its principal characteristics is the ability of self-
performed by the existing source separation methods. organizing sensory features from large amount of
training data, which the recognition performance
2. SOUND SOURCE SEPARATION exceeds the performance from conventional models on
the speech recognition task [4]. Moreover, it can show
2.1. Independent Component Analysis high performance as a model for Noise Reduction tasks
Blind Source Separation (BSS), is the separation of a [5].
set of source signals from a set of unknown mixed In this paper, we aim to use DNN to achieve both,
signals. For this problem the typical solution is to apply SSS and NR performance over the prior methods.
an Independent Component Analysis (ICA). ICA is
defined as a method for separating a multivariate signal 3.2. Extraction Method of Acoustics features
into additive subcomponents. This is done by assuming We prepare the data using the Mel filter bank
that the subcomponents are non-Gaussian signals and features extracted from the audio data. For the analysis
that they are statistically independent from each other. of continuous speech, we perform the following:
An important note to consider is that if N sources are As a specific extraction method, first the
present, at least N observations are needed to recover recording is performed by an arrangement of N
the original signal. array microphones.
The recorded audio signal is converted into Mel
2.2. Speech Enhancement and Sound Source Filter Bank (MFB) of D dimensions features.
Separation for dynamic environments The obtained MFB features are sampled cutting
Geometrically constrained High-order Decorrelation on with a T frames sliding window.
based Source Separation with Adaptive Step-size Finally, the feature are concatenated on N parts
control (GHDSS-AS) is a Sound Source separation channels.
technique proposed by Nakadai et al.[2]. GHDSS-AS This generated D x T x N noised acoustic feature
can be applied to track the dynamic changes for robot become the input dimension for DNN described in the
audition system. next section.
GHDSS is a hybrid algorithm that approaches
beamforming and BSS by performing directivity 3.3. Proposed Model
formation of the decorrelation and sound direction We use a DNN as shown in Fig. 1, with supervised
between the sound source signals from a microphone learning. For modeling the separation filter we use the
array. MFB features from the 1 channel clean data and as
A Noise Reduction (NR) method called Histogram- input depending on the task we use the 8 channels
based Recursive Level Estimation (HRLE) is also signal data, and a 2 channels HRLE + GHDSS output
proposed. It enhances speech with a small number of data. The DNN has multiple layers of which dimension
parameters in consideration of dynamic environments. is compressed from the input to output layers.

2.3. Task
Since GHDSS is a linear separation technique, it
does not handle non-linear mixed signals. Meanwhile,
a proposed model representation for a noisy nonlinear
ICA based on the formulation of maximum likelihood
Fig. 1. DNN - Proposed model
Fig. 3. Results
4.3. Results
4.1. Audio Dataset We show the results of our experiments in Fig. 3
For the training and test stage we use the JNAS showing the accuracy rate of the speech recognition by
dataset, which includes almost 70,000 speech sentences Julius engine for multiple audio inputs from different
from different speakers, at different sound noise rate in direction. We can see that both, the GHDSS-HRLE and
Japanese language and different speech length. the DNN methods has almost the same average result.
For preparation of this dataset was used a hearbo However, applying DNN for denoising after a GHDSS-
robot equipped with an 8 Channel array microphone HRLE previous process, enhance the previous accuracy rate.
(Fig. 2). A speech sentence was sampled every 10 deg.
around the robot, starting at 0 deg. And a noise input 5. CONCLUSION
was used the Fan Noise from the robot. In this paper, we proposed a SSS method using DNN. As
show in the results, a combined method of DNN + GHDSS-
HRLE enhanced the recognition rate that using only one
method. However, is important to remark the DNN can
achieve a recognition rate of two different process in one
DNN process without no more information that the noisy
information and the clear information.

The work has been supported by MEXT Grant-in-Aid for
Scientific Research (A) 15H01710.

Fig. 2. Hearbo [1] K. Nakadai, T. Lourens, G. O. Hiroshi, and H. Kitano,
Active audition for humanoid, Proc. Natl. Conf. Artif.
Intell., pp. 832839, 2000.
[2] K. Nakadai, G. Ince, K. Nakamura, and H. Nakajima,
4.2. Training the DNN
Robot audition for dynamic environments, 2012 IEEE
For our experiments, from the audio data using MFB Int. Conf. Signal Process. Commun. Comput. ICSPCC
we get 27 dimensions and use a sliding window of 11 2012, pp. 125130, 2012.
frames, so the input 297 dimensions x Channel, and the [3] S. Maeda and S. Ishii, A noisy nonlinear independent
output become a size of 297 dimensions. For the first component analysis, Proc. 2004 14th IEEE Signal
Process. Soc. Work. Mach. Learn. Signal Process. 2004.,
experiment we use the 8 Channels noisy data, so the
pp. 173182, 2004.
input becomes to 2376 dimensions; and for the second [4] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N.
experiment using the data from then HRLE GHDSS Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N.
cleaned output as input, the DNN input becomes to 594 Sainath, and B. Kingsbury, Deep Neural Networks for
dimensions. Acoustic Modeling in Speech Recognition, IEEE Signal
Process. Mag., no. November, pp. 8297, 2012.
[5] X. Feng, Y. Zhang, and J. Glass, Speech feature
We evaluated multiple type of DNN, and found the denoising and dereverberation via deep autoencoders for
most suitable structure with three Hidden Layers of noisy reverberant speech recognition, ICASSP, IEEE Int.
which neuron size are 1600, 800, and 400 respectively. Conf. Acoust. Speech Signal Process. - Proc., pp. 1759
As for the training data, we use random information 1763, 2014.
from the JNAS dataset without information of the
speaker location angle.