Real-Time Noise Classifier for Smartphones

國立台灣科技大學
電子工程系
碩士學位論文
Evaluation of Real-Time Noise Classifier

based on CNN-LSTM and MFCC for Smartphones
Winner Roedily
M10702803
指導教授：阮聖彰博士
中華民國 109 年 07 月 21 日
Evaluation of Real-Time Noise Classifier
based on CNN-LSTM and MFCC for Smartphones
Student: Winner Roedily Advisor: Prof. Shanq-Jang Ruan
Submitted to Department of Electronic and Computer Engineering

College of Electrical Engineering and Computer Science
National Taiwan University of Science and Technology
ABSTRACT
Recent studies demonstrate various methods to classify noises present in daily human
activity. Most of these methods utilize multiple audio features that require heavy
computation, which increases the latency. This paper presents a real-time sound classifier
based on a smartphone by utilizing only the Mel-frequency Cepstral Coefficient (MFCC)
as the feature vector. By relying on this single feature and an augmented audio dataset,
this system drastically reduced the computation complexity and achieved 92.06%
accuracy. This system utilizes the TarsosDSP library for feature extraction and
Convolutional Neural Network – Long-Short Term Memory (CNN-LSTM) for both
classification and MFCCs determination. The results show that the developed system can
classify the noises with higher accuracy and shorter processing time compared with other
architectures. Additionally, this system only takes up 0.03 Watts of power consumption,
which makes it suitable for future commercial use.
Keywords: Noise classification, MFCC, CNN-LSTM, TarsosDSP library, Android
i
Table of Contents
Abstract .............................................................................................................................i
Table of Contents ............................................................................................................ ii
List of Tables ...................................................................................................................iv
List of Figures ..................................................................................................................v
1 Introduction ..................................................................................................................1
1.1 Hearing Loss and Hearing Aids.............................................................................1
1.2 Noise Classification ............................................................................................... 4
1.3 Organization of this Thesis ....................................................................................6
2 Related Works...............................................................................................................8
2.1 MFCC Characteristics ........................................................................................... 9
2.2 MFCC Extraction ................................................................................................ 11
2.3 Convolutional Neural Network (CNN) ............................................................... 15
2.4 Long Short-Term Memory (LSTM) ....................................................................17
2.5 Motivations ..........................................................................................................19
ii
3 Proposed Method ........................................................................................................21
3.1 Feature Extraction................................................................................................ 22
3.2 Noise Classification ............................................................................................. 28
3.2.1. Datasets ........................................................................................................28
3.2.2. Noise Classifier Model.................................................................................32
4 Experimental Result ...................................................................................................35
4.1 MFCC Determination .......................................................................................... 37
4.2 The Runtime of the Developed Application ........................................................ 40
4.3 Resource Consumption during Classification ..................................................... 44
4.4 Developed Application Overview .......................................................................47
4.5 Discussion ............................................................................................................50
5 Conclusions..................................................................................................................52
References....................................................................................................................... 53
iii
List of Tables
3.1 VALUES FOR AUDIODISPATCHER PARAMETERS .............................. 23
3.2 THREE SETS OF FSDKAGGLE2019 DATASET .......................................30
4.1 SPECIFICATIONS OF THE TESTED SMARTPHONES ............................ 41
4.2 RESOURCE CONSUMPTION OF THE DEVELOPED SYSTEM IN A
NOISY ENVIRONMENT ..............................................................................41
iv
List of Figures
1.1 Block diagram of the feedforward ANC system ...............................................2
2.1 Sampled audio files ......................................................................................... 11
2.2 Periodogram sample of one frame ..................................................................12
2.3 Mel-spaced filterbank ..................................................................................... 13
2.4 Classification accuracy of UrbanSound8K dataset by varying MFCCs .........19
3.1 Block diagram of feature extraction flow ....................................................... 22
3.2 MFCC heatmap for the sound of (a) car horn, (b) crowd, (c) dog bark, (d) siren,
(e) traffic, and (f) mixed noises .......................................................................24
3.3 Block diagram of the proposed classifier ........................................................ 32
4.1 Classification accuracy comparison between the proposed system and two
other architectures ........................................................................................... 37
4.2 Elapsed time in classifying a six-seconds long audio file using three different
architectures ....................................................................................................40
4.3 Confusion matrix for augmented test data using 30 MFCCs .......................... 42
v
4.4 Screen capture of the application’s power usage ............................................44
4.5 Classification result from (a) the beginning (nothing), (b) an audio file, and (c)
microphone......................................................................................................47
vi
Chapter 1
Introduction
1.1 Hearing Loss and Hearing Aids
Ear is a sophisticated human organ that works daily to receive sound from its
surroundings in an audible range. Human ear receives typically sound waves from the
environment to the eardrum, then passes the vibration through the middle and inner part
of the ear, causes the hair cells in the cochlea to move, which generates nerve impulses
to the brain. Hearing problems occur mainly because of the dysfunction of the cochlea.
For reasons such as aging, exposure to deafening sounds, drug consumption, and some
infections, the hair cells in human ears are reduced in number [1].
In the latest statistic from the World Health Organization (WHO), around 466 million
people worldwide suffer from hearing loss [2]. Though these numbers are large, the
majority (80%) of adults aged 55-74 years old refuses to use the hearing aid product. The
low percentage was due to various reasons, such as the price, misfit and discomfort, and
1
Hearing Loss and Hearing Aids Introduction
Figure 1.1: Block diagram of the feedforward ANC system
Figure
periodic maintenance 1.2:
[3]. Block
One of thediagram of the feedforward
most compelling ANC caught
reasons which systemour attention
is the presence of environmental noises that are not filtered out by conventional hearing
aids [4].
Different approaches attempt to remove these environmental noises. Active Noise
Cancellation (ANC) is the most popular one and is still present in the market nowadays.
The ANC cancels out the noises by combining the sound wave presents at the moment
with its inverse [5], as seen in Figure 1.1. Instead of removing one particular noise, this
system removes all ambient sound. Such a system is not suitable for hearing aid
applications since the ANC will attenuate all sound, including the information itself. A
better approach for such an application is first to classify the noise and then apply a
particular treatment to it without harming the information.
2
Hearing aids come with different models and technologies such as In-the-Ear, In-
the-Canal, Completely-in-the-Canal, Behind-the-Ear, and Receiver-in-Canal. Behind-
the-Ear (BTE) hearing aids are the most popular type since those are less susceptible to
feedback problems due to the greater separation between the microphones and receivers.
However, BTE that comes with earmolds may need periodic maintenance to preserve its
acoustic seal [6].
From a technical point of view, the hearing aid comprises three essential parts: a
microphone (some types have dual), an amplifier, and a speaker [7], which are all present
in smartphones nowadays. According to the Global Mobile Market Report, the number
of smartphone users worldwide in 2017 was 2.7 billion and reached 3.2 billion in 2019,
as shown in Fig. 2. Based on this trend, we may conclude that smartphones are getting
more affordable in the coming year. Such fact brings an opportunity to distribute the
hearing aids closer to the people by presenting such applications into their mobile devices.
3
1.2 Noise Classification
In recent years, researchers tried to implement deep learning in noise reduction, for
instance, Deep Denoising Autoencoder (DDAE). Lai et al. [8] presented DDAE-NR in
their research, which separated the system into Noise Classifier and Noise Removal. Their
result was satisfying both in classification and denoising. However, this system is not
suitable for a mobile device due to its complexity.
The idea of utilizing machine learning for noise classification leads to several kinds
of implementations, such as cochlear implants, for instance. Such collaboration between
computer engineering and the biomedical field contributes to solving health issues to
improve human’s daily life. Researchers tried different approaches to improve the
performance and accuracy of noise classification for this cochlear implant. In 2014, Saki
et al. [9] implemented the Random Forest Tree Classifier in a cochlear implant and
achieved a 10% improvement in accuracy compared to the previous version, which used
Gaussian Mixture Model (GMM). However, they only tested this method on three kinds
of noises, which still needs further improvements. Five years later, Alavi et al. [10]
improved the classifier by utilizing GMM as the classifier model and MFCC as the feature.
The classification result was very satisfying, with the classification accuracy reached 100%
except for babble noise.
Singh et al. [11] presented an eight-layers VGG-like network in their research by
utilizing the augmented UrbanSound8K dataset. This architecture achieved 80.2%
accuracy by utilizing 127×64 log mel-spectrogram as the feature vector. Hassan et al. [12]
also brought a remarkable result from their research. With 5000 epochs, their model
achieved 94.6% accuracy. Reducing the number of target classes supported the
4
achievement of such results as done in their research, where they classify only five out of
ten classes from the dataset.
All those researches mentioned above implemented the convolutional neural network
(CNN) for classifying the noises. Sang et al. [13]introduced another method called
convolutional recurrent neural network (CRNN) for classifying the UrbanSound8K
dataset and achieved 79.06% accuracy for their CRNN8 architecture, which composed of
eight CNN layers, one RNN layer, and one fully connected layer.
5
1.3 Organization of this Thesis
This thesis introduces a real-time supervised noise classifier on a smartphone as a
stepping stone to overcome the issue mentioned in [4]. We divide the entire process into
model training and application development. The challenge that presents in this
development is to achieve high classification accuracy while maintaining the latency issue
at the same time. Compared to desktop, mobile applications are resources limited, and
heavy computation will increase the latency of the overall process. TensorFlow Lite
provides the solution to deploy a machine learning model on a mobile device for lighter
computation and minimal resource consumption. Since TensorFlow Lite was made as a
compact version of the TensorFlow model, some of the operators in TensorFlow are not
present in TensorFlow Lite. The most apparent incompatibility is in the Recurrent Neural
Network model. Due to such an issue, we need to use the library’s experimental version
for our model to perform as expected in our developed application.
By designing and developing the proposed mobile application, this thesis offers two
significant contributions as follow:
• High accuracy noise classification by utilizing only one audio feature, mel-
frequency cepstral coefficient (MFCC),
• Low computation complexity, low latency, and low battery consumption mobile
application to classify daily urban noises.
The remainder of this thesis is organized as follows: Chapter 2 elaborates on the
existing studies regarding noise classification. Then, Chapter 3 describes the proposed
noise classifier and determining the most suitable MFCCs’ amount to achieve the
expected results. Next, Chapter 4 presents the performance of the developed application
6
and compares the accuracy of each number of MFCCs selected for the test. Finally,
HearingFigure 2.1: Classification accuracy of UrbanSound8K dataset by varying
Chapter
MFCCs5Loss
provides
and Hearing
the conclusions
Aids as well as the future direction of this thesis.
Introduction
7
Chapter 2
Related Works
Based on the brief explanation in Chapter 1, CRNN is a classification method that
incorporates both convolution and recurrent neural networks, which means that it
perseveres both sparsity and sequentiality. In this chapter, we present the most common
audio features and the neural networks for further discussion.
Before elaborating on the development and analysis of the mobile application, we
need to understand the characteristics of the MFCC, which we present in Section 2.1. The
procedure for extracting MFCC also took part as the essential process for this application,
and Section 2.2 discloses this matter. Section 2.3 elaborated the method regarding the
sparsity of the extracted features and followed by the sequentiality in Section 2.4. Also,
Section 2.5 declares the motivation of this developed application.
8
2.1 MFCC Characteristics
There are several types of “worth-extracting” information contained in raw audio data,
such as mel frequency power spectra [14], log-mel spectrogram [15, 11], Mel-Frequency
Cepstral Coefficient (MFCC) [8, 12, 16, 17], and many more. MFCC is the most popular
and widely used for noise classification and vocal representation since MFCC accurately
represents the shape of the vocal tract that manifests itself in the envelope of the short-
time power spectrum [18].
We can observe MFCC as a representation of how our brain perceives sounds that
propagate to our ears by analyzing the periodogram. This periodogram contains the power
spectrum of an audio frame where the estimation performs similar to the human cochlea
that identifies which frequencies are present in the frame. The human cochlea cannot
discern the difference between two closely spaced frequencies; hence we implement Mel
filterbank to imitate this characteristic. We only concern about how much energy occurs
roughly with fewer variations, and the Mel scale tells us exactly how to space filterbanks.
The Mel scale relates perceived frequency (pitch) of a pure tone to its actual measured
frequency. Human ears discern small changes in pitch much better at low frequencies
than at high, and Mel scale makes the features match more closely to this characteristic.
With f as the frequency in Hertz, and m as the frequency in mel, the formula for converting
from frequency to Mel scale and vice versa presents as:
𝑀(𝑓) = 1125 ln(1 + 𝑓 ⁄700) (2.1)
𝑀−1 (𝑚) = 700(𝑒 𝑚⁄1125 − 1) (2.2)
Another characteristic that we need to concern is the fact that human ears perceive
loudness on a logarithmic scale. Such a condition means that the significant variations in
9
2.1. MFCC Characteristics Related Works
energy may not sound significantly different if the sound begins with a high sound-
2.1. MFCC Characteristics Related Works
pressure level.
Audio engineers commonly use MFCCs as features in speech recognition [19]
system, such as the systems which can automatically recognize numbers spoken into a
telephone. MFCCs are also increasingly finding uses in music information retrieval
applications such as genre classification, audio similarity measures, and many more. The
European Telecommunications Standard Institute in the early 2000s declared to use a
standardized MFCC algorithm in mobile phones.
10
Figure 2.1. Sampled audio file
2.2 MFCC Extraction
Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of
a sound; therefore, MFCCs are the coefficients that collectively make up an MFC [20].
Deciding on how many MFCCs to extract from a single audio frame is challenging. A
system can achieve higher classification accuracy by increasing the amount of MFCCs to
extract. On the other hand, this will increase the dimensionality, which requires more
computation and cause latency. As mentioned in Jacoby’s research [21], it shows how the
amount of MFCCs (25 to 40) affects the accuracy of a model trained using the
UrbanSound8K dataset [22].
MFCCs are commonly derived as follows [23]:
• Take the Fourier transform of a windowed excerpt of a signal.
• Map the powers of the spectrum obtained above onto the mel scale, using
triangular-overlapping windows.
• Take the logs of the powers at each of the mel frequencies.
11
2.2. MFCC Extraction Related Works
Figure 2.2. Periodogram sample of one frame
• Take the discrete cosine transform of the list of the mel log powers, as if it were
a signal.
• The MFCCs are the amplitudes of the resulting spectrum.
Like any other audio feature extraction, MFCC extraction starts with sampling, and
we can visualize the sampled audio file as shown in Figure 2.1. Sampling an audio signal
into frames can be tricky since an audio signal is continually changing. If the frame
duration is too short, we will not have enough samples to get a reliable spectral estimate.
If it is too long, we would receive too much variation in the frame and might lead us to
overfit the classifier model.
The next step is to calculate the periodogram estimate of the power spectrum by
taking the Discrete Fourier Transform for each frame. A periodogram sample of one
frame is presented in Figure 2.2. If we consider the DFT of the sampled frame as 𝑆𝑖 (𝑘),
the periodogram-based power presents as:
1
𝑃𝑖 (𝑘) = 𝑁 |𝑆𝑖 (𝑘)|2 (2.3)
12
Figure 2.3. Mel-spaced filterbank
Where Pi is the power spectrum of frame i, Si is the DFT of the frame i, and N is the
number of sample.
After we receive the result, we can compute the Mel-spaced filterbank by applying
the triangular filters to the periodogram estimate, where we can visualize the result as
shown in Figure 2.3. To calculate the filterbank energies, we multiply each filterbank with
the power spectrum, then add up the coefficients. This calculation will leave us with n
values, where n is the number of coefficients we determine to extract, which indicates
how much energy is in each filterbank.
In calculating filterbanks, at first, we need to choose a lower and upper frequency,
where the lower frequency can go as low as zero while upper frequency should not be
higher than half the sampling rate. Using Equ. 2.1, we convert the upper and lower
frequencies to Mels, then we divide the values in between these two numbers into the
number of coefficients we determine to extract. After we got the results, we need to
convert them back to frequency scale (Hz) and round those frequencies to the nearest
13
FFT bin. With denoting m as the number of filters we desired, the formula to create the
filterbanks in the triangular filter and overlapping each other presents as:
0 𝑘 < 𝑓(𝑚 − 1)
𝑘−𝑓(𝑚−1)
, 𝑓(𝑚 − 1) ≤ 𝑘 ≤ 𝑓(𝑚)
𝑓(𝑚)−𝑓(𝑚−1)
𝐻𝑚 (𝑘) = 𝑓(𝑚+1)−𝑘
(2.4)
, 𝑓(𝑚) ≤ 𝑘 ≤ 𝑓(𝑚 + 1)
𝑓(𝑚+1)−𝑓(𝑚)
{ 0 𝑘 > 𝑓(𝑚 + 1)
The final step for this extraction process is to take the log of each of the energies
calculated in the filterbank, which leaves us with log filterbank energies. To get the
MFCC, we need to compute the DCT of this log filterbank energies.
There are several variations in extracting MFCC features, such as the differences in
the shape of spacing the windows used to map the scale [24], or the addition of dynamics
features such as “delta” and “delta-delta” (first- and second-order frame-to-frame
difference) coefficients. The number of filters, the shape filters, how filters are spaced,
and how the power spectrum is warped may affect the performance of the MFCC itself.
14
2.3 Convolutional Neural Network (CNN)
For the last few years, CNN has made a breakthrough for image, text, and audio
classification. CNN has been around since the early 1990s. In the years from the late
1990s to the early 2010s, CNN was in incubation. As more and more data and computing
power became available, researchers get more interested in exploring tasks which are
sufficient for CNN to handle.
Hassan et al. [12] presented in their research how they implemented a 64-filters
convolution layer followed by max-pooling and two fully-connected layers with MFCCs
as their feature vector. With 5000 epochs, their model achieved 94.6% accuracy. However,
using two fully-connected layers in series is not suitable for real-time smartphone
applications due to its heavy computation and large model size, even after converted into
a TensorFlow Lite file (around 200 MB). On the other hand, Singh et al. [11] implemented
four convolutional layers and a max-pool layer followed by a fully-connected layer. This
architecture achieved 80.2% accuracy on an augmented UrbanSound8K dataset with a
log-mel spectrogram as the feature vector.
A Convolutional Neural Network (CNN) is a deep learning algorithm that can take
in an input subject, assign importance (learnable weights and biases) to various
aspects/objects in the subject, and be able to differentiate one from the other. The pre-
processing required on CNN is much lower as compared to other classification algorithms.
While the filters in primitive methods are hand-engineered, with enough training, CNN
can learn these filters by itself.
Convolutional layers convolve the input and pass its result to the next layer. The
convolution operation reduces the number of free parameters, allowing the network to be
more in-depth with fewer parameters. There are two kinds of convolution layer: temporal
15
2.3. Convolutional Neural Network (CNN) Related Works
and spatial convolution; and we utilize the temporal one in this development. This layer
2.3. Convolutional Neural Network (CNN) Related Works
creates a convolution kernel that is convolved with the layer input over a single temporal
dimension to produce a tensor output. When programming this temporal CNN, the input
is a tensor with shape (l audio files) × (m frames) × (n MFCCs).
CNN can successfully capture the spatial and temporal dependencies in an input
subject by applying the relevant filters. The architecture performs better fitting to the
dataset due to parameter reduction, and the reusability of weights. In other words, we can
train the network to understand the sophistication of the input subject better.
16
2.4 Long Short-Term Memory (LSTM)
LSTM is a kind of recurrent neural network that uses a gating mechanism for better
modeling of long-term dependencies in the data [25]. A recurrent neural network is
suitable for dealing with sequential information such as audio since every sample of this
kind of information is related to its predecessor. Numerous researches proved that LSTM
is suitable to increase performance in dealing with such an issue due [13, 17, 26] to its
high efficiency in capturing long-term dependencies.
The key idea of RNN is that the recurrent connections between the hidden layers
allow the memory of previous inputs to retain internal state, which can affect the outputs.
However, RNN mainly has two issues to solve in the training phase: vanishing gradient
and exploding gradient problems. When computing the derivatives of activation function
in the back-propagation process, long-term components may go exponentially fast to zero.
This condition makes the model hard to learn the correlation between temporally distant
inputs. Meanwhile, when the gradient grows exponentially during training, the exploding
gradient problem occurs.
By design, RNN takes two inputs at each time step: an input vector and a hidden
state. The next RNN step takes the second input vector and first hidden state to create the
output of that step. Therefore, in order to capture semantic meanings in long sequences,
we need to run RNN over many time steps, turning the unrolled RNN into a profound
network.
LSTM layers comprise recurrently connected memory blocks where one memory
cell contains three multiplicative gates. The gates perform continuous analogs of write,
read, and reset operations, which enable the network to utilize the temporal information
17
2.4. Long Short-Term Memory (LSTM) Related Works
over a while. Each single LSTM cell governs what to remember, what to forget, and how
2.4. Long Short-Term Memory (LSTM) Related Works
to update the memory using gates. By doing so, the LSTM network solves the problem
of exploding or vanishing gradients. By aligning multiple LSTM cells, we can process
the input of sequence data. LSTM units are typically arranged in layers; hence each of the
output of each unit is the input of the other units. In this way, the network becomes richer
and captures more dependencies.
18
Figure 2.4 Classification accuracy of UrbanSound8K dataset by varying MFCCs
2.5 Motivations
Based on the background mentioned above in Section 1, hearing loss is still a big
issue in our daily life, ironically, most of the people refuse to wear hearing aids in the
market due to various reasons, including the presence of background noises. Although
many kinds of research in noise classification show the possibility to achieve high
accuracy, we need to reduce the complexity and latency to implement such a system on a
mobile device.
Most of the noise classification methods utilize multiple features from an audio
signal to increase accuracy. Such a method indeed increases the classification accuracy,
as proved by Dang et al. [27] in their research. However, it also increases the computation
complexity, which is not suitable for mobile applications. For instance, Alamdari et al.
[15] introduced a real-time smartphone application for unsupervised noise classification
by utilizing band-periodicity, band-entropy, and mel-frequency spectral coefficients as
the feature sets. The main drawback of this unsupervised noise classifier was the latency
since it utilized multiple audio features. This latency drastically increased when it
detected a new noise class.
19
2.5. Motivations Related Works
According to the characteristic of MFCC, we learned that the resemblance of such a

2.5. Motivations Related Works
feature and human cochlea behavior is uncanny. We can see the accuracies we can
achieve by altering the number of MFCC we use are various, as shown in Fig. 2.4. Based
on this fact, we understand that using MFCC alone is promising to develop a reliable
system to classify
The objective of this proposed method is to develop a smartphone application to
classify daily noises with low computation complexity, latency, and power consumption
while maintaining the accuracy at the same time by using only MFCC as the input feature.
20
Chapter 3
Proposed Method
In this thesis, we proposed a power-efficient noise classifier designed for a smartphone.
The proposed method comprises two significant blocks. The first block implements a
DSP library called TarsosDSP, which handles feature extraction from the incoming audio
signal, as presents in Figure 3.1. From this block, we will receive a feature vector in the
size of m frames × n MFCCs. This feature vector then passed to the second block, which
is our classifier model in the form of a TensorFlow Lite model. We will elaborate these
two blocks on the rest of this chapter.
21
Figure 3.1 Block diagram of feature extraction flow
3.1 Feature Extraction
The feature extraction process in this application is purely relying on the TarsosDSP [28]
library. TarsosDSP is a Java library for audio processing. It aims to provide an easy-to-
use interface to implement an audio processing algorithm as simple as possible in pure
Java and without any other external dependencies. This library tries to hit the sweet spot
between being capable enough to get an actual task done while remaining compact and
simple enough to serve as a demonstration on how DSP algorithms works. TarsosDSP
features an implementation of a percussion onset detector and several pitch-detection
algorithms. It also includes a Goertzel DTMF decoding algorithm, a time stretch
algorithm, resampling, filters, pure synthesis, some audio effects, and a pitch-shifting
algorithm.
The extraction starts with capturing and sampling the raw audio waveform into audio
frames. TarsosDSP provides two functions to handle such processes, each according to
its source. Frame size and overlaps determine the number of frames it captures
proportional to the duration and sample rate.
The extraction process in TarsosDSP needs several predefined parameters for its
AudioDispatcher, such as sample rate, frame size, frame overlap, cepstrum coefficient,
22
3.1. Feature Extraction Proposed Method
TABLE 3.1
3.1. Feature Extraction
VALUES FOR AUDIODISPATCHER PARAMETERS Proposed Method
Parameters From File From Microphone
Sample rate 8 kHz 8 kHz
Frame size 400 samples 640 samples
Frame overlap 322 samples 564 samples
Mel filter 40 mels 40 mels
Lower frequency 0 Hz 0 Hz
Higher frequency 4 kHz 4 kHz
mel filter, and frequency thresholds. Table I shows the values we assigned for those
TABLE
parameters n this thesis, and based on these I
parameters, we determined the most suitable
VALUES FOR AUDIODISPATCHER PARAMETERS
number of MFCCs for our input features. During the feature extraction process, we also
Parameters From File From Microphone
applied a silence-detection function by calculating the energy in decibel sound pressure
Sample rate 8 kHz 8 kHz
level (dBSPL) of a buffer b of the length n presents as:
Frame size 400 samples √∑𝑛

640 samples
𝑖=0 𝑏[𝑖]
𝑑𝐵𝑆𝑃𝐿 = 20 log10 (3.1)
𝑛
Frame overlap 322 samples 564 samples
Then, we compare this dBSPL with a defined threshold amount. If the dBSPL is
Mel the
below filterthreshold, we consider
40 mels 40 corresponding
it as silence and purge the mels frame. The
commonly used threshold amount is -70dB [28], and it works well on loud noises such as
Lower frequency 0 Hz 0 Hz
an explosion, gunshots, car horns, or other impulses. Since our noise classes also include
Higher frequency 4 kHz 4 kHz
continuous murmur and chatter, we need to lower the threshold amount to -95 dB.
23
(a) (b)
(c) (d)
(e) (f)
Figure 3.2. MFCC heatmap for the sound of (a) car horn, (b) crowd, (c) dog bark, (d)
siren, (e) traffic, and (f) mixed noises
TarsosDSP can works very well without any dependencies on the Android libraries.
However, for the loading and sampling process, we need to prepare FFMPEG in our asset
directory. FFMPEG is the leading multimedia framework that can decode, encode,
transcode, mux, demux, stream, filter, and play pretty much anything that humans and
machines have created. It supports the most obscure ancient formats up to the cutting
edge. No matter if they were designed by some standards committee, the community, or
a corporation. FFMPEG in our application will handle the decoding and streaming before
the extraction process began.
24
In the extraction process, we implement the BlockingQueue method supported by

Java utility due to its thread-safe characteristic. BlockingQueue is a Queue that
additionally supports operations that wait for the queue to become non-empty when
retrieving an element, and wait for space to become available in the queue when storing
an element. All queuing methods achieve their effects atomically using internal locks or
other forms of concurrency control. Such a locking mechanism makes multi-threading
operations more secure, especially against NullPointerException.
Java BlockingQueue does not accept null values and throw NullPointerException if
we try to store the null value in the queue. Its interface is part of the Java collections
framework, and we use it primarily for implementing the producer-consumer problem.
Java provides several BlockingQueue implementations such as ArrayBlockingQueue,
LinkedBlockingQueue, PriorityBlockingQueue, SynchronousQueue, and several more.
The main difference between ArrayBlockingQueue and LinkedBlockingQueue lies
in the objects placed in the queue. We use ArrayBlockingQueue for arrays and the other
for link tables, respected to their names. There are also other differences in these types of
BlockingQueue. Linked queues typically have higher throughput than an array-based
queue, but have less predictable performance in most concurrent applications. The
synchronous queue is a special BlockingQueue since we use it during offers. If no other
thread is currently performing take or poll, the offer will fail. During the take, if no other
thread is performing offer concurrently, it will also fail. This unique Morse code is well
suited for a queue with high response requirements and threads from a non-fixed thread
pool.
25
While implementing a producer-consumer problem in the BlockingQueue

mechanism, there are some essential methods that we need to be familiar with:
• put is a method that we use to insert elements to the queue. If the queue is full, it
will wait for the space to be available
• add is a method that will return true if the insertion was successful, and throws
an IllegalStateException if failed.
• offer is a method that will try to insert an element into the queue and waits for an
available space in the queue within a specified duration.
• take is a method that retrieves and removes the element from the head of the
queue. If the queue is empty, it will wait for the element to be available.
• poll is a method that retrieves and removes the element from the head of the
queue and wait for a specified time if necessary for an element to become
available. This method will return null after a timeout.
Generally, we can distinguish BlockingQueue in two types: unbounded queue and
bounded queue, due to the queue capacity. In an unbounded queue, it will set the capacity
of the BlockingQueue to Integer.MAX_VALUE. All operations that add an element to the
unbounded queue will never block; thus, it could grow to a massive size. The crucial part
when designing a producer-consumer application using unbounded BlockingQueue is the
requirement for the consumer to be able to retrieve the messages as quickly as the
producers fill the queue. Otherwise, the memory could fill up, and the program will throw
an OutOfMemory exception.
The second type of BlockingQueue is the bounded queue. We can create such a
queue with passing the capacity in an integer as an argument to a constructor. Passing the
value to the constructor would mean that when a producer tries to add an element to a
queue that is already full, depending on the method that we use to add, it will block the
26
incoming data until there is space available. Using a bounded queue is an excellent way
to design concurrent programs.
27
3.2 Noise Classification
Before we start training and designing our machine learning model, we need to know
and prepare which dataset we are going to use. The dataset we use needs to contain the
audio set we need, which is urban noises. After we gather the dataset, then we start
designing the classifier model.
3.2.1. Datasets
While there are many datasets available, in this research, we used only two of the most
popular urban noises datasets: UrbanSound8K and FSDKaggle2019 [29], which present
10 and 80 classes, respectively.
UrbanSound8K contains 8732 labeled sound excerpts of urban sound from 10 classes:
air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot,
jackhammer, siren, and street music. The classes are from the urban sound taxonomy. In
the UrbanSound8K dataset, the files are pre-sorted into ten folds (folders named fold1 to
fold10) to help in the reproduction of and comparison with automatic classification results.
FSDKaggle2019 (Freesound Dataset Kaggle 2019) is an audio dataset containing
29,266 audio files annotated with 80 labels of the AudioSet Ontology. Fonseca et al. used
this dataset for Task 2 of the Detection and Classification of Acoustic Scenes and Events
(DCASE) Challenge 2019. All the audio clips present as uncompressed PCM with a bit-
depth of 16-bit, 44.1 kHz sampling rate, and mono channel for each audio file.
FSDKaggle2019 employs audio clips from the following sources:
1. Freesound Dataset (FSD): a dataset from the MTG-UPF based on Freesound
content organized with the AudioSet Ontology
28
3.2. Noise Classification Proposed Method
2. The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative
Commons 100M dataset (YFCC)
Regarding the labeling, these audio data is using a vocabulary of 80 labels from
Google’s AudioSet Ontology. These labels cover diverse topics: Guitar and other Musical
Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human
locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid,
Motor vehicle (road), Mechanism, Doors, and a variety of Domestic sounds.
Compared with UrbanSound8K, the audio clips in FSDKaggle2019 have variable
lengths (roughly from 0.3 to 30 seconds). The labeling in this dataset is also varied.
Specifically, FSDKaggle2019 features three types of label quality, one for each set in the
dataset:
• Curated train set: correct (but potentially incomplete) labels
• Noisy train set: noisy labels
• Test set: correct and complete labels
FSDKaggle2019 comprises two train sets and one test set. The idea is to limit the
supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus
promoting approaches to deal with label noise.
In the curated train set, the duration of the audio clips ranges from 0.3 to 30 seconds
due to the diversity of the sound categories and the preferences of Freesound users when
recording/uploading sounds. Labels are correct but potentially incomplete. A few of these
audio clips may present additional acoustic material beyond the provided ground truth
labels.
29
TABLE 3.2
3.2. Noise Classification
THREE SETS OF FSDKAGGLE2019 DATASET Proposed Method
Params Sets Curated train set Noisy train set Test set
Clips/class 75 300 50-150
Total clips 4970 19,815 4481
Avg. labels/clip 1.2 1.2 1.4
Duration (hours) 10.5 80 12.9
On the other hand, the duration of the audio clips in the noisy train set ranges from
one second to 15 seconds, with the vast majority lasting 15 seconds. Table II presents the
TABLE II
details of each set of this dataset.
THREE SETS OF FSDKAGGLE2019 DATASET
Params Sets of developing

A key aspect Curated trainand
set successfully
Noisy train set
deploying neural Test set
network-based
solutions is the availability of a suitable dataset. While deep learning methodologies have
Clips/class 75 300 50-150
proved themselves in many different contexts, it is easy to get carried away and ignore
Total clips 4970 19,815 4481
some practical realities about the deep neural network-based solution. A network is only
as Avg.
good labels/clip
as the data used to train it, and1.2
training an effective1.2
neural network requires
1.4 a
large annotated dataset.
Duration (hours) 10.5 80 12.9
Today’s neural networks require large datasets in order to converge on an accurate
representation of multidimensional data distribution. The deeper and the more
sophisticated the network, the higher the need for data. Nevertheless, acquiring good
quality data is a complicated and costly process, and accurately annotating the data to
match a particular problem adds further cost and complexity to the process [30].
30
To improve the classification accuracy, we merged the least significant classes into
a new class, which was distinguishable among the rest, leaving us with five classes in
total. Those five classes are the sounds of car horns, crowd, dog bark, siren, and traffic.
We did not only augment the UrbanSound8K with 448 data from FSDKaggle2019 but
also filtered out the audio files that have less than three seconds in duration to attain a
uniform dataset.
31
Figure 3.3 Block diagram of the proposed classifier
3.2.2. Noise Classifier Model
In this thesis, our classifier model adopts the CNN-LSTM model by Sang et al. [13] with
a few modifications. Instead of using raw audio waveforms, our system uses MFCCs as
the input, and we only classify the noises into five classes. Based on their results, we
decided to build our model with two CNN layers (64 and 128 filters), one stacked RNN
cells comprise two TFLiteLSTMCells, and followed with one fully-connected layer.
TFLiteLSTMCell is a recurrent network cell that is specially designed only for
TensorFlow Lite. It provides hints, and it also makes the variables suitable for the TFLite
ops (transposed and separated). This architecture is suitable enough to provide an
acceptable accuracy while maintaining high performance at the same time since this
application needs to run on a smartphone in low latency.
As presented in Figure 3.2., the process starts with feeding the feature vector into our
first pair of Conv1D (64 filters) and MaxPooling1D. The second pair comprises the same
elements with a different filter size for the second Conv1D (128 filters). For optimization
32
purposes, we use He initialization to avoid vanishing and exploding gradients in each
convolutional layer, which works well with ReLU activation [31].
When we initialize our weights randomly, the values are probably close to zero,
given the probability distributions with which we initialized them. If the weights are close
to zero, the gradients in upstream layers vanish, due to the multiplication of small values.
On the other hand, if the weights are higher than one, the multiplication got too intense
and commonly known as an explosion. In the Xavier/He initialization, the weights are
initialized, keeping in mind the size of the previous layer, which helps in attaining a global
minimum of the cost function faster and more efficiently. The weights are still random
but differ in range depending on the size of the previous layer of neurons. This mechanism
provides a controlled initialization; hence the gradient descent became faster and more
efficient.
We trained our model using Adam [32] as the optimizer. We may observe Adam as
a combination of RMSprop and Stochastic Gradient Descent with momentum. Stochastic
gradient-based optimization is a core practical importance in many fields of science and
engineering. Adam uses the squared gradients to scale the learning rate like RMSprop,
and it takes advantage of momentum by using the moving average of the gradient instead
of the gradient itself like SGD with momentum. SGD proved itself as an efficient and
effective optimization method that was central in many machine learning success stories.
Adam uses estimations of the first and second moments of the gradient to adapt the
learning rate for each weight of the neural network.
Instead of using categorical cross-entropy, we used binary cross-entropy as our loss
function since a raw audio signal in the environment is an overlapping combination of
different audio signals propagating from various sources. Categorical cross-entropy is
suitable for multi-class classification, while binary cross-entropy is suitable for multi-
33
label classification. Binary cross-entropy is a loss function that we commonly use in
binary classification tasks. Formally, this loss is equal to the average of the categorical
cross-entropy. Using categorical cross-entropy will prohibit us from seeing the dominant
noise in the environment and other noises that come together during the sampling.
34
Chapter 4
Experimental Result
Accuracy is an important parameter that we need to observe in a classification task. Since
this thesis is about developing a mobile application, we also need to consider the
computation complexity, resource usage, and power consumption. While these
considerations are all related, latency is also essential since this thesis also covers audio
signal processing. In this thesis, we execute two kinds of experiments. The first one is to
find a suitable amount of MFCC to use in our system to determine the fixed cepstral
coefficient and achieve high accuracy.
Along with this first experiment, we also tested the same amount of MFCCs, epochs,
and dataset on two other noise classifier models. We meant to run this test as a
benchmarking process to the classification accuracy we achieved from our proposed
model. The second experiment is to observe the resources and power consumption while
running the application and compare the results if the application is running on different
models of smartphones. We also run the same benchmarking process here to see the
amount of time needed for the application to run. The overall process comprises the
35
Experimental Result
feature extraction process and the classification process. The feature extraction process
should be static since we use the same method for extracting the MFCCs in the proposed
architecture and the other two. In this chapter, we provide the experimental results along
with some comprehensive discussions regarding the performance evaluation.
36
100
90
80
70
Accuracy (%)
60
50
40
30
20
10
0
30 35 40 45
MFCCs
Proposed Hassan Singh
Figure 4.1 Classification accuracy comparison between the proposed system and
two other architectures
4.1 MFCCs Determination
Determining the size of a feature vector is crucial in classification. Increasing the
size (dimensionality) will increase the chances for higher accuracy; however, it will also
demand higher and heavier computation. As mentioned in the MFCC Extraction process,
the size of the feature we extract depends on the value we put in the sampling frame and
overlapping frame size. These amounts determine the number of audio frames our
application needs to sample, and each sampling for each frame requires computation time
and memory allocation for storing the extracted value temporarily. Decreasing the feature
vector size will increase the performance of the system, but it might also compromise the
accuracy of the model itself. Therefore, before we start the development of the entire
application, we need to determine the dimension of our feature vector.
37
4.1. MFCCs Determination Experimental Result
We ran four experiments to determine the most suitable number of coefficients from
30 to 45. We decided not to get over 45, as mentioned in Jacoby’s research, that the
accuracy goes down beyond that amount of MFCCs. Figure 4.1 presents the results of the
experiments, and we found that 30 is the most suitable one since it leads to the highest
accuracy and the shortest computation time needed, as seen in Figure 4.2. Based on this
result, our feature vector size will be 102 × 30.
Finding the amount of frame, we need to use for the feature vector is also tricky due
to the workflow of the TarsosDSP library. As presented in Table 3.1, the acceptable data
types for frame size and frame overlap size are integers, which makes it harder for us to
perform the tuning. Since we train our model with the most prolonged duration at a six-
second-long audio file, we try to get 600 frames per six seconds, which will make 100
frames per second. However, we can not achieve such amount due to the resolution, and
we decided to use the closest one, which is 102 frames per second. This amount is suitable
enough to achieve high accuracy and low processing time.
The trend of the classification accuracy does not go linearly with the amount of
MFCCs. As we can see in Figure 4.1, the accuracy somehow goes down when we utilized
40 MFCCs. According to the results from Jacoby’s research, he found a relatively even
distribution for some sounds and a drastic change in some other cases. The sound of the
car horn is a notable example, where the classification accuracy drastically increases with
the increase in the band from 10 to 50. This phenomenon makes sense for a very tonal
sound, such as the sound of a car horn, as the resolution of audible frequencies increases
with the number of available Mel bands. Another kind of sound that is also tonal is the
sound of the siren. The classification accuracy for this sound starts relatively high at near
75%, and gradually decreases with the increasing Mel bands.
38
4.1. MFCCs Determination Experimental Result
From observing the trend and behavior of the MFCC feature, we understand that the
classification accuracy from utilizing MFCC fluctuates based on the tonal characteristic
of the soundwave. Therefore, we decided to modify our dataset by changing the label of
some audio files. Based on the classification trends, we understand that some labels have
a similar characteristic to the others. For instance, the sound of a car engine and a
motorcycle has a minimal difference. By regrouping these similar audio files increases
the classification accuracy, and it also helps us in determining the number of MFCC we
need to use for our feature vector.
39
3.500
3.000
Processing time (s)
2.500
2.000
1.500
1.000
0.500
0.000
30 35 40 45
MFCCs
Proposed Hassan Singh
Figure 4.2 Elapsed time in classifying a six-seconds long audio file using three
different architectures
4.2 The Runtime of the Developed Application
In the past few years, smartphones became a high-demand product in the global
market. Many developers came with their products and new devices’ architecture. These
vastly diverse architecture turns into a challenge for us to develop a system that can work
on the majority of the products. To validate the performance of our developed application,
we measured the average runtime of each audio file sample (six samples) on four different
smartphones with different specifications, as shown in Table III.
Figure 4.2 presents the runtime comparisons between our developed system and the
other classifiers. Based on these results, the system consumed an average runtime of 666
ms for extracting 30 MFCCs and classifying a six-second long audio file. Compared to
the other architectures, our proposed system consumed less processing time.
40
4.2. The Runtime of the Developed Application Experimental Result
TABLE 4.1
SPECIFICATIONS OF THE TESTED SMARTPHONES
Index CPU RAM OS
Phone A Snapdragon 630

4 GB 8.0
Octa-core 2.2 GHz
Phone B Snapdragon 630

4 GB 9
Octa-core 2.2 GHz
Phone C Snapdragon 820

4 GB 8.0
Quad-core 2.2 GHz
Phone D Snapdragon 855

6 GB 10
Octa-core 2.84 GHz
TABLE 4.2
RESOURCES CONSUMPTION OF THE DEVELOPED SYSTEM IN A NOISY ENVIRONMENT
Index Max. CPU Usage Max Memory Consumption
Phone A 10% 254.7 MB
Phone B 14% 76.5 MB
Phone C 33% 151.9 MB
Phone D 7% 109.7 MB
Average 16% 148.2 MB
41
Figure 4.3 Confusion matrix for augmented test data using 30 MFCCs
Regarding our noise classifier model, we managed to achieve 92.06% accuracy; for
a brief review and analysis, we present the results in a confusion matrix, as seen in Figure
4.3. Each of these five labels has its unique characteristic such as continuous medium
range frequency for the car horn, murmur sound for the crowd, short and low range
frequency for the dog bark, the flange-like audio signal for siren, and continuous motor
sound for vehicles. These different tonal characteristics affect classification accuracy
since one type of audio is classified well with a certain amount of MFCCs. In contrast,
that same amount could mess the accuracy of another type.
As we can see in the confusion matrix, the classifications for car horn and siren are
not as accurate as the rest. This issue occurs since the audio files for those two labels are
42
less than the amount of the other labels; besides, the audio file itself is not clean enough
to gain higher accuracy. As we mentioned before, the audio signal presents in the
environment is an overlapping combination of several audio signals from different
sources that propagate at the same moment.
43
Figure 4.4 Screen capture of the application’s power usage
4.3 Resource Consumption during Classification
In this thesis, to assess the resource consumption of the developed application, we
conducted the experiments to measure the average CPU and memory usage during the
classification in a noisy environment. Table IV presents the results of the experiments.
One challenging objective in this development is to create an application that can run
over an extended period. Therefore, we also need to examine the battery consumption
during the runtime. After testing the application for one hour in a noisy environment, we
retrieved the result by screen-capturing the battery usage information, as shown in Figure
4.4. Based on the captured information, we found that the device’s estimated power use
during the execution time is around 0.03 Watts as converted from the 6 mAh, written in
44
4.3. Resource Consumption during Classification Experimental Result
the snapshot. This information also shows that the total CPU runtime is around 5minutes
and 43 seconds out of one hour of application runtime.
Bearing in mind that the CPU foreground shows the duration of a running application
while an Activity from the application was in the foreground. It might also include when
a Service from the application was in the foreground, which displays an ongoing
notification. CPU total includes all of the CPU usage (services and broadcast receivers in
the background and activities in the foreground). According to the results present in
Figure 4.4, we can see that the CPU total time is only three seconds more than the CPU
foreground. This result means that our application does not have much of a process that
runs on the background. Most of the exhaustive processes are for presenting the graph,
which shows the classification results.
Keep awake measures the length of time that this application has used wake locks or
alarms to keep the device awake when it would otherwise have been asleep. In a way, this
is potentially the most significant drain on the battery. Sleeping uses much less power
than staying awake, so if an application keeps a wake lock for a long time, it is keeping
the device in a high-power mode all the time, even if the application is not doing any
significant work. In our results, we can seed that our application does not require any
wake locks to keep the device awake; hence our application can conserve the energy for
further use.
In our captured results, we may assume that our application consumes a low amount
of power from the device battery. However, this result is only a rough estimation of the
real battery consumption. The CPU does not do the usage calculations by itself. It may
have hardware features to make the task less cumbersome, but it is mostly the job of the
operating system. The details of these implementations will vary (especially in the case
of multicore systems). The general idea is to see how long is the queue of tasks our CPU
45
4.3. Resource Consumption during Classification Experimental Result
needs to finish. The operating system may take a look at the scheduler periodically to
determine the number of things it has to do.
After all, we still need to refer back to the number of cores the devices have to
measure the CPU loads since CPU loads and battery consumption are related. On a quad-
core system, a value higher than 25% would mean that the system has fully utilized one
core for the application and, as a result, would be considered as high CPU usage, although
this is still fine in short bursts. All the treads used in the application should be named to
use the information provided in the best possible way. It is always good to check if any
of the light threads are unexpectedly consuming more CPU cycles.
46
(a) (b) (c)
Figure 4.5 Classification result from (a) the beginning (nothing), (b) an audio file, and
(c) microphone
4.4 Developed Application Overview
After the development, we run a quick test to see how the application behaves. Figure
4.5(a) shows the overview of the application the moment it starts running. When the
application starts, it will first ask for permission to access the device’s microphone and
access to write and read the internal storage. The need for accessing the device’s internal
storage is to move our audio files sample to a folder the application will create called
CSV. This folder serves the purpose of the application default storage to store CSV files
and audio files.
This application gives two options to choose the source of the audio signal: from an
audio file or microphone. The purpose of classifying from an audio file is to see how good
this system in classifying a known noise label. On the top-right corner of the screen, we
put a button called CSV, which will bypass the classification function and save the
47
4.4. Developed Application Overview Experimental Result
extracted feature as a CSV file instead. To save the features as CSV serves the purpose
for further research or to compare our extracted feature with other MFCCs extracted from
other libraries. If we activate the CSV button, the From Microphone radio button will be
disabled since saving the features from the microphone to the CSV usually caused the
application to crash due to the silence detector function, which will wipe the features
when the application detects silence.
As we can see in Figure 4.5(b), the system can classify the sound of siren accurately.
It also shows that the system classifies the sound as a sound of traffic in a smaller portion.
Such classification occurs due to the sound presents in the audio file also contain a sound
of the engine (probably the ambulance); however, the siren sound dominates in the
environment. During the test, we noticed a little bit of latency for the results to show up
after we clicked on the SELECT FILE button. The latency is longer than the computation
time mentioned in Figure 4.2. The plotter causes this latency. The plotter calculates and
plots the graph to show the results in an interactive view. According to the debugger
report, the processing time remains match with the results present in Figure 4.2.
Figure 4.5(c) shows an overview of the classification from the device’s microphone.
It behaves similarly with the one from an audio file with a difference that the graph is
continually changing whenever there is a presence of sound above the silence threshold.
The microphone button will change into a sign of a stop button since we add a function
that we can stop this process at any time. The application will show the classification
result for every second. This one-second is not a delay nor a latency. As mentioned in
Chapter 3, we captured 102 frames for every second. We can change this capturing
duration to a shorter time or longer according to our desire. We decided to use a one-
second duration only for research purposes.
48
4.4. Developed Application Overview Experimental Result
We include our model file in the asset folder, together with the audio file samples.
This TensorFlow Lite model is compact and relatively small in size due to the
compression done by the TFLite Converter.
49
4.5 Discussion
In this thesis, we surveyed the other state-of-the-art systems for noise classification
and compared them with the developed application. The real-time noise classifier by
Alamdari et al. offers an advantage in class limitation since it is using an unsupervised
classification method. However, this system uses multiple features such as band-
periodicity, band-entropy, and mel-frequency spectral coefficients, which consumes a
significant amount of computation. According to their results, the system required 26%
CPU consumption on an octa-core 2.35 GHz CPU, where our system requires less when
it runs on a similar specification, as seen in Table IV. Our developed classifier is designed
in such a way not only to overcome the latency and high computation requirement but
also to filter out the white noises at the same time.
Supporting our proposed architecture, the dataset augmentation [30] contributes to
achieving high accuracy. Also, merging the least significant classes expanded the range
of variation in our noise classes. This merging also allows us to have a higher resolution
for the tonal characteristic, which affects the decision to choose the amount of MFCCs
we need to use.
Although our system has achieved high classification accuracy, there are still
possibilities to increase these values by using a cleaner audio dataset. One of the methods
we may try is by using synthesized audio, mimicking multiple variations of sound in a
single class by varying the frequency and timbre.
This work also allows us to try to implement this classifier in a noise-removal system
by treating the noisy sound we received according to the type of noise presented at the
moment. Since we developed the system for a smartphone, it may provide an opportunity
to replace hearing aid devices in the future.
50
4.5. Discussion Experimental Result
We currently developed this application for research purposes only. We found
several incompatibilities when we implemented the application on various smartphones,
and we will try to fix this issue in future development. Currently, we tried our application
only on several brands of smartphone, such as:
• Lenovo K6 Note
• ASUS Z01KD
• ASUS X017DA
• Sony Xperia XZs G8232
• Google Pixel 4
51
Chapter 5
Conclusions
We developed a real-time supervised noise classifier which runs on a smartphone
that surpasses the performances of the other two architectures with 92.06% accuracy by
only using one audio feature (MFCC). This system presents as a stepping stone to
overcome the current hearing aids problem to eliminate annoying environment noises,
which discouraged people from using such a helpful device. When the system received
an audio signal which is above the defined threshold, the noise classifier model can
perform the classification by using only MFCCs as the input features. Also, this system
works with lower CPU consumption and less processing time (666 ms) compared with
other architecture. Therefore, this developed system can help the further development of
a noise removal system which is suitable for a modern smartphone.
52
References
[1] K. R. Borisagar, R. M. Thanki and B. S. Sedani, Speech Enhancement Techniques
for Digital Hearing Aids, Switzerland: Springer, 2019.
[2] World Health Organization, “Deafness and hearing loss,” World Health
Organization, 20 March 2019. [Online]. Available: https://www.who.int/news-
room/fact-sheets/detail/deafness-and-hearing-loss. Accessed on: September 16,
2019.
[3] A. McCormack and H. Fortnum, “Why do people fitted with hearing aids not wear
them?,” International journal of audiology, vol. 52, no. 5, pp. 360-368, 2013.
[4] Å. Skagerstrand, S. Stenfelt, S. Arlinger and J. Wikström, “Sounds perceived as
annoying by hearing-aid users in their daily soundscape,” International Journal of
Audiology, vol. 53, no. 4, pp. 259-269, 2014.
[5] C.-Y. Chang, A. Siswanto, C.-Y. Ho, T.-K. Yeh, Y.-R. Chen and S. M. Kuo,
“Listening in a Noisy Environment: Integration of active noise control in audio
products,” IEEE Consumer Electronics Magazine, vol. 5, no. 4, pp. 34-43, 2016.
53
[6] AUDI-LAB, “The Advantages & Disadvantages of Hearing Aid Types,” AUDI-
LAB, [Online]. Available: https://www.audi-lab.com/types-of-hearing-aids-
advantages-disadvantages. Accessed on: January 20, 2020.
[7] National Institute of Deafness and Other Communication Disorders (NIDCD),
“Hearing Aids,” National Institutes of Health, Bethesda, 2013.
[8] Y.-H. Lai, Y. Tsao, X. Lu, F. Chen, Y.-T. Su, K.-C. Chen, Y.-H. Chen, L.-C. Chen,
L. Li and C.-H. Lee, “Deep Learning–Based Noise Reduction Approach to Improve
Speech Intelligibility for Cochlear Implant Recipients,” Ear and Hearing, vol. 39,
p. 1, 2018.
[9] F. Saki and N. Kehtarnavaz, “Background Noise Classification using Random
Forest Tree Classifier for Cochlear Implant Applications,” in 2014 IEEE
International Conference on Acoustic, Speech and Signal Processing (ICASSP),
2014.
[10] Z. Alavi and B. Azimi, “Application of Environment Noise Classification towards
Sound Recognition for Cochlear Implant Users,” in 2019 6th International
Conference on Electrical and Electronics Engineering (ICEEE), 2019.
[11] J. Singh and R. Joshi, “Background Sound Classification in Speech Audio
Segments,” in 2019 International Conference on Speech Technology and Human-
Computer Dialogue (SpeD), Timisoara, 2019.
[12] S. U. Hassan, M. Z. Khan, M. U. Ghani Khan and S. Saleem, “Robust Sound
Classification for Surveillance using Time Frequency Audio Features,” in 2019
International Conference on Communication Technologies (ComTech), 2019.
54
[13] J. Sang, S. Park and J. Lee, “Convolutional Recurrent Neural Networks for Urban
Sound Classification using Raw Waveforms,” in 2018 26th European Signal
Processing Conference (EUSIPCO), Rome, 2018.
[14] X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech enhancement based on deep
denoising Auto-Encoder,” Proc. Interspeech, pp. 436-440, 2013.
[15] N. Alamdari and N. Kehtarnavaz, “A Real-Time Smartphone App for Unsupervised
Noise Classification in Realistic Audio Environments,” in 2019 IEEE International
Conference on Consumer Electronics (ICCE), Las Vegas, 2019.
[16] F. Saki, A. Sehgal, I. Panahi and N. Kehtarnavaz, “Smartphone-based Real-time
Classification of Noise Signals Using Subband Features and Random Forest
Classifier,” in 2016 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), Shanghai, 2016.
[17] S. Hyun, I. Choi and N. K. Soo, “ACOUSTIC SCENE CLASSIFICATION USING
PARALLEL COMBINATION OF LSTM AND CNN,” in Detection and
Classiﬁcation of Acoustic Scenes and Events 2016, Budapest, 2016.
[18] A. Samal, D. Parida, M. R. Satapathy and M. N. Mohanty, “On the Use of MFCC
Feature Vector Clustering for Efficient Text Dependent Speaker Recognition,” in
Proceedings of the International Conference on Frontiers of Intelligent Computing:
Theory and Applications (FICTA) 2013, 2014.
[19] T. Ganchev, N. Fakotakis and K. George, “Comparative evaluation of various
MFCC implementations on the speaker verification task,” in 10th International
Conference on Speech and Computer (SPECOM 2005), 2005.
55
[20] M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu and Q. Tian, “HMM-Based Audio
Keyword Generation,” in 5th Pacific Rim Conference on Multimedia, Tokyo, 2004.
[21] C. B. Jacoby, “Automatic Urban Sound Classiﬁcation Using Feature Learning
Techniques,” New York University, New York, 2014.
[22] J. Salamon, C. Jacoby and J. P. Bello, “A Dataset and Taxonomy for Urban Sound
Research,” in MM ’14: Proceedings of the 22nd ACM international conference on
Multimedia, Orlando, 2014.
[23] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block
based transformation in MFCC computation for speaker recognition,” Speech
Communication, vol. 54, no. 4, pp. 543-565, 2012.
[24] F. Zheng, G. Zhang and Z. Song, “Comparison of different implementations of
MFCC,” Journal of Computer Science and Technology, vol. 16, no. 6, pp. 582-589,
2001.
[25] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural
Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[26] J. Dai, S. Liang, W. Xue, C. Ni and W. Liu, “Long short-term memory recurrent
neural network based segment features for music genre classification,” in 2016 10th
International Symposium on Chinese Spoken Language Processing (ISCSLP),
Tianjin, 2016.
[27] A. Dang, T. H. Vu and J.-C. Wang, “Acoustic scene classification using
convolutional neural networks and multi-scale multi-feature extraction,” in 2018
IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, 2018.
56
[28] J. Six, Digital Sound Processing and Java - Documentation for the TarsosDSP
Audio Processing Library, Belgium: IPEM, 2015.
[29] E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis and X. Serra, “FSDKaggle2019,”
Zenodo, New York, 2020.
[30] P. Corcoran, C. Costacke, V. Varkarakis and J. Lemley, “Deep Learning for
Consumer Devices and Services 3—Getting More From Your Datasets With Data
Augmentation,” IEEE Consumer Electronics Magazine, vol. 9, no. 3, pp. 48-54,
2020.
[31] S. K. Kumar, “On weight initialization in deep neural networks,” ArXiv, vol.
abs/1704.08863, 2017.
[32] D. P. Kingma and J. L. Ba, “ADAM: A Method for Stochastic Optimization,” in
International Conference on Learning Representation, 2014.
[33] L. D. Shapiro, “Boston Chapter Attuned to Hearing Health Care [Society News],”
IEEE Consumer Electronics Magazine, vol. 8, no. 3, pp. 5-6, 2019.
57

Real-Time Noise Classifier for Smartphones

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Real-Time Noise Classifier for Smartphones

Uploaded by

Copyright:

Available Formats

國立台灣科技大學

Evaluation of Real-Time Noise Classifier

Student: Winner Roedily Advisor: Prof. Shanq-Jang Ruan

Submitted to Department of Electronic and Computer Engineering

based on a smartphone by utilizing only the Mel-frequency Cepstral Coefficient (MFCC)

Convolutional Neural Network – Long-Short Term Memory (CNN-LSTM) for both

which makes it suitable for future commercial use.

Keywords: Noise classification, MFCC, CNN-LSTM, TarsosDSP library, Android

Table of Contents ............................................................................................................ ii

List of Tables ...................................................................................................................iv

List of Figures ..................................................................................................................v

1.1 Hearing Loss and Hearing Aids.............................................................................1

1.2 Noise Classification ............................................................................................... 4

1.3 Organization of this Thesis ....................................................................................6

2.1 MFCC Characteristics ........................................................................................... 9

2.2 MFCC Extraction ................................................................................................ 11

2.3 Convolutional Neural Network (CNN) ............................................................... 15

2.4 Long Short-Term Memory (LSTM) ....................................................................17

2.5 Motivations ..........................................................................................................19

3.1 Feature Extraction................................................................................................ 22

3.2 Noise Classification ............................................................................................. 28

3.2.1. Datasets ........................................................................................................28

3.2.2. Noise Classifier Model.................................................................................32

4 Experimental Result ...................................................................................................35

4.1 MFCC Determination .......................................................................................... 37

4.2 The Runtime of the Developed Application ........................................................ 40

4.3 Resource Consumption during Classification ..................................................... 44

4.4 Developed Application Overview .......................................................................47

4.5 Discussion ............................................................................................................50

3.1 VALUES FOR AUDIODISPATCHER PARAMETERS .............................. 23

3.2 THREE SETS OF FSDKAGGLE2019 DATASET .......................................30

4.1 SPECIFICATIONS OF THE TESTED SMARTPHONES ............................ 41

4.2 RESOURCE CONSUMPTION OF THE DEVELOPED SYSTEM IN A

NOISY ENVIRONMENT ..............................................................................41

1.1 Block diagram of the feedforward ANC system ...............................................2

2.1 Sampled audio files ......................................................................................... 11

2.2 Periodogram sample of one frame ..................................................................12

2.3 Mel-spaced filterbank ..................................................................................... 13

2.4 Classification accuracy of UrbanSound8K dataset by varying MFCCs .........19

3.1 Block diagram of feature extraction flow ....................................................... 22

(e) traffic, and (f) mixed noises .......................................................................24

3.3 Block diagram of the proposed classifier ........................................................ 32

other architectures ........................................................................................... 37

1.1 Hearing Loss and Hearing Aids

Hearing Loss and Hearing Aids Introduction

Figure 1.1: Block diagram of the feedforward ANC system

Different approaches attempt to remove these environmental noises. Active Noise

particular treatment to it without harming the information.

acoustic seal [6].

suitable for a mobile device due to its complexity.

of implementations, such as cochlear implants, for instance. Such collaboration between

except for babble noise.

Singh et al. [11] presented an eight-layers VGG-like network in their research by

utilizing the augmented UrbanSound8K dataset. This architecture achieved 80.2%

convolutional recurrent neural network (CRNN) for classifying the UrbanSound8K

This thesis introduces a real-time supervised noise classifier on a smartphone as a

for our model to perform as expected in our developed application.

significant contributions as follow:

frequency cepstral coefficient (MFCC),

application to classify daily urban noises.

The remainder of this thesis is organized as follows: Chapter 2 elaborates on the

Based on the brief explanation in Chapter 1, CRNN is a classification method that