Professional Documents
Culture Documents
電子工程系
碩士學位論文
Winner Roedily
M10702803
指導教授:阮聖彰博士
中華民國 109 年 07 月 21 日
Evaluation of Real-Time Noise Classifier
based on CNN-LSTM and MFCC for Smartphones
ABSTRACT
Recent studies demonstrate various methods to classify noises present in daily human
activity. Most of these methods utilize multiple audio features that require heavy
computation, which increases the latency. This paper presents a real-time sound classifier
as the feature vector. By relying on this single feature and an augmented audio dataset,
this system drastically reduced the computation complexity and achieved 92.06%
accuracy. This system utilizes the TarsosDSP library for feature extraction and
classification and MFCCs determination. The results show that the developed system can
classify the noises with higher accuracy and shorter processing time compared with other
architectures. Additionally, this system only takes up 0.03 Watts of power consumption,
i
Table of Contents
Abstract .............................................................................................................................i
1 Introduction ..................................................................................................................1
2 Related Works...............................................................................................................8
ii
3 Proposed Method ........................................................................................................21
5 Conclusions..................................................................................................................52
References....................................................................................................................... 53
iii
List of Tables
iv
List of Figures
3.2 MFCC heatmap for the sound of (a) car horn, (b) crowd, (c) dog bark, (d) siren,
4.1 Classification accuracy comparison between the proposed system and two
4.2 Elapsed time in classifying a six-seconds long audio file using three different
architectures ....................................................................................................40
4.3 Confusion matrix for augmented test data using 30 MFCCs .......................... 42
v
4.4 Screen capture of the application’s power usage ............................................44
4.5 Classification result from (a) the beginning (nothing), (b) an audio file, and (c)
microphone......................................................................................................47
vi
Chapter 1
Introduction
Ear is a sophisticated human organ that works daily to receive sound from its
surroundings in an audible range. Human ear receives typically sound waves from the
environment to the eardrum, then passes the vibration through the middle and inner part
of the ear, causes the hair cells in the cochlea to move, which generates nerve impulses
to the brain. Hearing problems occur mainly because of the dysfunction of the cochlea.
For reasons such as aging, exposure to deafening sounds, drug consumption, and some
infections, the hair cells in human ears are reduced in number [1].
In the latest statistic from the World Health Organization (WHO), around 466 million
people worldwide suffer from hearing loss [2]. Though these numbers are large, the
majority (80%) of adults aged 55-74 years old refuses to use the hearing aid product. The
low percentage was due to various reasons, such as the price, misfit and discomfort, and
1
Hearing Loss and Hearing Aids Introduction
Figure
periodic maintenance 1.2:
[3]. Block
One of thediagram of the feedforward
most compelling ANC caught
reasons which systemour attention
is the presence of environmental noises that are not filtered out by conventional hearing
aids [4].
Cancellation (ANC) is the most popular one and is still present in the market nowadays.
The ANC cancels out the noises by combining the sound wave presents at the moment
with its inverse [5], as seen in Figure 1.1. Instead of removing one particular noise, this
system removes all ambient sound. Such a system is not suitable for hearing aid
applications since the ANC will attenuate all sound, including the information itself. A
better approach for such an application is first to classify the noise and then apply a
2
Hearing Loss and Hearing Aids Introduction
Hearing aids come with different models and technologies such as In-the-Ear, In-
Hearing Loss and Hearing Aids Introduction
the-Canal, Completely-in-the-Canal, Behind-the-Ear, and Receiver-in-Canal. Behind-
the-Ear (BTE) hearing aids are the most popular type since those are less susceptible to
feedback problems due to the greater separation between the microphones and receivers.
However, BTE that comes with earmolds may need periodic maintenance to preserve its
From a technical point of view, the hearing aid comprises three essential parts: a
microphone (some types have dual), an amplifier, and a speaker [7], which are all present
in smartphones nowadays. According to the Global Mobile Market Report, the number
of smartphone users worldwide in 2017 was 2.7 billion and reached 3.2 billion in 2019,
as shown in Fig. 2. Based on this trend, we may conclude that smartphones are getting
more affordable in the coming year. Such fact brings an opportunity to distribute the
hearing aids closer to the people by presenting such applications into their mobile devices.
3
1.2 Noise Classification
In recent years, researchers tried to implement deep learning in noise reduction, for
instance, Deep Denoising Autoencoder (DDAE). Lai et al. [8] presented DDAE-NR in
their research, which separated the system into Noise Classifier and Noise Removal. Their
result was satisfying both in classification and denoising. However, this system is not
The idea of utilizing machine learning for noise classification leads to several kinds
computer engineering and the biomedical field contributes to solving health issues to
improve human’s daily life. Researchers tried different approaches to improve the
performance and accuracy of noise classification for this cochlear implant. In 2014, Saki
et al. [9] implemented the Random Forest Tree Classifier in a cochlear implant and
achieved a 10% improvement in accuracy compared to the previous version, which used
Gaussian Mixture Model (GMM). However, they only tested this method on three kinds
of noises, which still needs further improvements. Five years later, Alavi et al. [10]
improved the classifier by utilizing GMM as the classifier model and MFCC as the feature.
The classification result was very satisfying, with the classification accuracy reached 100%
accuracy by utilizing 127×64 log mel-spectrogram as the feature vector. Hassan et al. [12]
also brought a remarkable result from their research. With 5000 epochs, their model
achieved 94.6% accuracy. Reducing the number of target classes supported the
4
Hearing Loss and Hearing Aids Introduction
achievement of such results as done in their research, where they classify only five out of
Hearing Loss and Hearing Aids Introduction
ten classes from the dataset.
All those researches mentioned above implemented the convolutional neural network
(CNN) for classifying the noises. Sang et al. [13]introduced another method called
dataset and achieved 79.06% accuracy for their CRNN8 architecture, which composed of
eight CNN layers, one RNN layer, and one fully connected layer.
5
1.3 Organization of this Thesis
stepping stone to overcome the issue mentioned in [4]. We divide the entire process into
model training and application development. The challenge that presents in this
development is to achieve high classification accuracy while maintaining the latency issue
at the same time. Compared to desktop, mobile applications are resources limited, and
heavy computation will increase the latency of the overall process. TensorFlow Lite
provides the solution to deploy a machine learning model on a mobile device for lighter
computation and minimal resource consumption. Since TensorFlow Lite was made as a
compact version of the TensorFlow model, some of the operators in TensorFlow are not
present in TensorFlow Lite. The most apparent incompatibility is in the Recurrent Neural
Network model. Due to such an issue, we need to use the library’s experimental version
By designing and developing the proposed mobile application, this thesis offers two
• High accuracy noise classification by utilizing only one audio feature, mel-
• Low computation complexity, low latency, and low battery consumption mobile
existing studies regarding noise classification. Then, Chapter 3 describes the proposed
noise classifier and determining the most suitable MFCCs’ amount to achieve the
expected results. Next, Chapter 4 presents the performance of the developed application
6
Hearing Loss and Hearing Aids Introduction
and compares the accuracy of each number of MFCCs selected for the test. Finally,
HearingFigure 2.1: Classification accuracy of UrbanSound8K dataset by varying
Chapter
MFCCs5Loss
provides
and Hearing
the conclusions
Aids as well as the future direction of this thesis.
Introduction
7
Chapter 2
Related Works
incorporates both convolution and recurrent neural networks, which means that it
perseveres both sparsity and sequentiality. In this chapter, we present the most common
need to understand the characteristics of the MFCC, which we present in Section 2.1. The
procedure for extracting MFCC also took part as the essential process for this application,
and Section 2.2 discloses this matter. Section 2.3 elaborated the method regarding the
sparsity of the extracted features and followed by the sequentiality in Section 2.4. Also,
8
2.1 MFCC Characteristics
There are several types of “worth-extracting” information contained in raw audio data,
such as mel frequency power spectra [14], log-mel spectrogram [15, 11], Mel-Frequency
Cepstral Coefficient (MFCC) [8, 12, 16, 17], and many more. MFCC is the most popular
and widely used for noise classification and vocal representation since MFCC accurately
represents the shape of the vocal tract that manifests itself in the envelope of the short-
We can observe MFCC as a representation of how our brain perceives sounds that
propagate to our ears by analyzing the periodogram. This periodogram contains the power
spectrum of an audio frame where the estimation performs similar to the human cochlea
that identifies which frequencies are present in the frame. The human cochlea cannot
discern the difference between two closely spaced frequencies; hence we implement Mel
filterbank to imitate this characteristic. We only concern about how much energy occurs
roughly with fewer variations, and the Mel scale tells us exactly how to space filterbanks.
The Mel scale relates perceived frequency (pitch) of a pure tone to its actual measured
frequency. Human ears discern small changes in pitch much better at low frequencies
than at high, and Mel scale makes the features match more closely to this characteristic.
With f as the frequency in Hertz, and m as the frequency in mel, the formula for converting
Another characteristic that we need to concern is the fact that human ears perceive
loudness on a logarithmic scale. Such a condition means that the significant variations in
9
2.1. MFCC Characteristics Related Works
energy may not sound significantly different if the sound begins with a high sound-
2.1. MFCC Characteristics Related Works
pressure level.
system, such as the systems which can automatically recognize numbers spoken into a
telephone. MFCCs are also increasingly finding uses in music information retrieval
applications such as genre classification, audio similarity measures, and many more. The
10
Figure 2.1. Sampled audio file
a sound; therefore, MFCCs are the coefficients that collectively make up an MFC [20].
Deciding on how many MFCCs to extract from a single audio frame is challenging. A
system can achieve higher classification accuracy by increasing the amount of MFCCs to
extract. On the other hand, this will increase the dimensionality, which requires more
computation and cause latency. As mentioned in Jacoby’s research [21], it shows how the
amount of MFCCs (25 to 40) affects the accuracy of a model trained using the
• Map the powers of the spectrum obtained above onto the mel scale, using
triangular-overlapping windows.
11
2.2. MFCC Extraction Related Works
• Take the discrete cosine transform of the list of the mel log powers, as if it were
a signal.
Like any other audio feature extraction, MFCC extraction starts with sampling, and
we can visualize the sampled audio file as shown in Figure 2.1. Sampling an audio signal
into frames can be tricky since an audio signal is continually changing. If the frame
duration is too short, we will not have enough samples to get a reliable spectral estimate.
If it is too long, we would receive too much variation in the frame and might lead us to
The next step is to calculate the periodogram estimate of the power spectrum by
taking the Discrete Fourier Transform for each frame. A periodogram sample of one
frame is presented in Figure 2.2. If we consider the DFT of the sampled frame as 𝑆𝑖 (𝑘),
1
𝑃𝑖 (𝑘) = 𝑁 |𝑆𝑖 (𝑘)|2 (2.3)
12
2.2. MFCC Extraction Related Works
Where Pi is the power spectrum of frame i, Si is the DFT of the frame i, and N is the
number of sample.
After we receive the result, we can compute the Mel-spaced filterbank by applying
the triangular filters to the periodogram estimate, where we can visualize the result as
shown in Figure 2.3. To calculate the filterbank energies, we multiply each filterbank with
the power spectrum, then add up the coefficients. This calculation will leave us with n
where the lower frequency can go as low as zero while upper frequency should not be
higher than half the sampling rate. Using Equ. 2.1, we convert the upper and lower
frequencies to Mels, then we divide the values in between these two numbers into the
convert them back to frequency scale (Hz) and round those frequencies to the nearest
13
2.2. MFCC Extraction Related Works
FFT bin. With denoting m as the number of filters we desired, the formula to create the
2.2. MFCC Extraction Related Works
filterbanks in the triangular filter and overlapping each other presents as:
0 𝑘 < 𝑓(𝑚 − 1)
𝑘−𝑓(𝑚−1)
, 𝑓(𝑚 − 1) ≤ 𝑘 ≤ 𝑓(𝑚)
𝑓(𝑚)−𝑓(𝑚−1)
𝐻𝑚 (𝑘) = 𝑓(𝑚+1)−𝑘
(2.4)
, 𝑓(𝑚) ≤ 𝑘 ≤ 𝑓(𝑚 + 1)
𝑓(𝑚+1)−𝑓(𝑚)
{ 0 𝑘 > 𝑓(𝑚 + 1)
The final step for this extraction process is to take the log of each of the energies
calculated in the filterbank, which leaves us with log filterbank energies. To get the
There are several variations in extracting MFCC features, such as the differences in
the shape of spacing the windows used to map the scale [24], or the addition of dynamics
difference) coefficients. The number of filters, the shape filters, how filters are spaced,
and how the power spectrum is warped may affect the performance of the MFCC itself.
14
2.3 Convolutional Neural Network (CNN)
For the last few years, CNN has made a breakthrough for image, text, and audio
classification. CNN has been around since the early 1990s. In the years from the late
1990s to the early 2010s, CNN was in incubation. As more and more data and computing
power became available, researchers get more interested in exploring tasks which are
Hassan et al. [12] presented in their research how they implemented a 64-filters
convolution layer followed by max-pooling and two fully-connected layers with MFCCs
as their feature vector. With 5000 epochs, their model achieved 94.6% accuracy. However,
using two fully-connected layers in series is not suitable for real-time smartphone
applications due to its heavy computation and large model size, even after converted into
a TensorFlow Lite file (around 200 MB). On the other hand, Singh et al. [11] implemented
four convolutional layers and a max-pool layer followed by a fully-connected layer. This
A Convolutional Neural Network (CNN) is a deep learning algorithm that can take
aspects/objects in the subject, and be able to differentiate one from the other. The pre-
While the filters in primitive methods are hand-engineered, with enough training, CNN
Convolutional layers convolve the input and pass its result to the next layer. The
convolution operation reduces the number of free parameters, allowing the network to be
more in-depth with fewer parameters. There are two kinds of convolution layer: temporal
15
2.3. Convolutional Neural Network (CNN) Related Works
and spatial convolution; and we utilize the temporal one in this development. This layer
2.3. Convolutional Neural Network (CNN) Related Works
creates a convolution kernel that is convolved with the layer input over a single temporal
dimension to produce a tensor output. When programming this temporal CNN, the input
CNN can successfully capture the spatial and temporal dependencies in an input
subject by applying the relevant filters. The architecture performs better fitting to the
dataset due to parameter reduction, and the reusability of weights. In other words, we can
train the network to understand the sophistication of the input subject better.
16
2.4 Long Short-Term Memory (LSTM)
LSTM is a kind of recurrent neural network that uses a gating mechanism for better
suitable for dealing with sequential information such as audio since every sample of this
kind of information is related to its predecessor. Numerous researches proved that LSTM
is suitable to increase performance in dealing with such an issue due [13, 17, 26] to its
The key idea of RNN is that the recurrent connections between the hidden layers
allow the memory of previous inputs to retain internal state, which can affect the outputs.
However, RNN mainly has two issues to solve in the training phase: vanishing gradient
and exploding gradient problems. When computing the derivatives of activation function
This condition makes the model hard to learn the correlation between temporally distant
inputs. Meanwhile, when the gradient grows exponentially during training, the exploding
By design, RNN takes two inputs at each time step: an input vector and a hidden
state. The next RNN step takes the second input vector and first hidden state to create the
output of that step. Therefore, in order to capture semantic meanings in long sequences,
we need to run RNN over many time steps, turning the unrolled RNN into a profound
network.
LSTM layers comprise recurrently connected memory blocks where one memory
cell contains three multiplicative gates. The gates perform continuous analogs of write,
read, and reset operations, which enable the network to utilize the temporal information
17
2.4. Long Short-Term Memory (LSTM) Related Works
over a while. Each single LSTM cell governs what to remember, what to forget, and how
2.4. Long Short-Term Memory (LSTM) Related Works
to update the memory using gates. By doing so, the LSTM network solves the problem
the input of sequence data. LSTM units are typically arranged in layers; hence each of the
output of each unit is the input of the other units. In this way, the network becomes richer
18
Figure 2.4 Classification accuracy of UrbanSound8K dataset by varying MFCCs
2.5 Motivations
Based on the background mentioned above in Section 1, hearing loss is still a big
issue in our daily life, ironically, most of the people refuse to wear hearing aids in the
market due to various reasons, including the presence of background noises. Although
many kinds of research in noise classification show the possibility to achieve high
accuracy, we need to reduce the complexity and latency to implement such a system on a
mobile device.
Most of the noise classification methods utilize multiple features from an audio
signal to increase accuracy. Such a method indeed increases the classification accuracy,
as proved by Dang et al. [27] in their research. However, it also increases the computation
complexity, which is not suitable for mobile applications. For instance, Alamdari et al.
the feature sets. The main drawback of this unsupervised noise classifier was the latency
since it utilized multiple audio features. This latency drastically increased when it
19
2.5. Motivations Related Works
achieve by altering the number of MFCC we use are various, as shown in Fig. 2.4. Based
on this fact, we understand that using MFCC alone is promising to develop a reliable
system to classify
classify daily noises with low computation complexity, latency, and power consumption
while maintaining the accuracy at the same time by using only MFCC as the input feature.
20
Chapter 3
Proposed Method
The proposed method comprises two significant blocks. The first block implements a
DSP library called TarsosDSP, which handles feature extraction from the incoming audio
signal, as presents in Figure 3.1. From this block, we will receive a feature vector in the
size of m frames × n MFCCs. This feature vector then passed to the second block, which
is our classifier model in the form of a TensorFlow Lite model. We will elaborate these
21
Figure 3.1 Block diagram of feature extraction flow
The feature extraction process in this application is purely relying on the TarsosDSP [28]
library. TarsosDSP is a Java library for audio processing. It aims to provide an easy-to-
Java and without any other external dependencies. This library tries to hit the sweet spot
between being capable enough to get an actual task done while remaining compact and
algorithm, resampling, filters, pure synthesis, some audio effects, and a pitch-shifting
algorithm.
The extraction starts with capturing and sampling the raw audio waveform into audio
frames. TarsosDSP provides two functions to handle such processes, each according to
its source. Frame size and overlaps determine the number of frames it captures
The extraction process in TarsosDSP needs several predefined parameters for its
AudioDispatcher, such as sample rate, frame size, frame overlap, cepstrum coefficient,
22
3.1. Feature Extraction Proposed Method
TABLE 3.1
3.1. Feature Extraction
VALUES FOR AUDIODISPATCHER PARAMETERS Proposed Method
Lower frequency 0 Hz 0 Hz
mel filter, and frequency thresholds. Table I shows the values we assigned for those
TABLE
parameters n this thesis, and based on these I
parameters, we determined the most suitable
VALUES FOR AUDIODISPATCHER PARAMETERS
number of MFCCs for our input features. During the feature extraction process, we also
Parameters From File From Microphone
applied a silence-detection function by calculating the energy in decibel sound pressure
Sample rate 8 kHz 8 kHz
level (dBSPL) of a buffer b of the length n presents as:
commonly used threshold amount is -70dB [28], and it works well on loud noises such as
Lower frequency 0 Hz 0 Hz
an explosion, gunshots, car horns, or other impulses. Since our noise classes also include
Higher frequency 4 kHz 4 kHz
continuous murmur and chatter, we need to lower the threshold amount to -95 dB.
23
3.1. Feature Extraction Proposed Method
(a) (b)
(c) (d)
(e) (f)
Figure 3.2. MFCC heatmap for the sound of (a) car horn, (b) crowd, (c) dog bark, (d)
siren, (e) traffic, and (f) mixed noises
TarsosDSP can works very well without any dependencies on the Android libraries.
However, for the loading and sampling process, we need to prepare FFMPEG in our asset
directory. FFMPEG is the leading multimedia framework that can decode, encode,
transcode, mux, demux, stream, filter, and play pretty much anything that humans and
machines have created. It supports the most obscure ancient formats up to the cutting
edge. No matter if they were designed by some standards committee, the community, or
a corporation. FFMPEG in our application will handle the decoding and streaming before
24
3.1. Feature Extraction Proposed Method
additionally supports operations that wait for the queue to become non-empty when
retrieving an element, and wait for space to become available in the queue when storing
an element. All queuing methods achieve their effects atomically using internal locks or
Java BlockingQueue does not accept null values and throw NullPointerException if
we try to store the null value in the queue. Its interface is part of the Java collections
in the objects placed in the queue. We use ArrayBlockingQueue for arrays and the other
for link tables, respected to their names. There are also other differences in these types of
queue, but have less predictable performance in most concurrent applications. The
thread is currently performing take or poll, the offer will fail. During the take, if no other
thread is performing offer concurrently, it will also fail. This unique Morse code is well
suited for a queue with high response requirements and threads from a non-fixed thread
pool.
25
3.1. Feature Extraction Proposed Method
• put is a method that we use to insert elements to the queue. If the queue is full, it
• add is a method that will return true if the insertion was successful, and throws
an IllegalStateException if failed.
• offer is a method that will try to insert an element into the queue and waits for an
• take is a method that retrieves and removes the element from the head of the
queue. If the queue is empty, it will wait for the element to be available.
• poll is a method that retrieves and removes the element from the head of the
queue and wait for a specified time if necessary for an element to become
bounded queue, due to the queue capacity. In an unbounded queue, it will set the capacity
unbounded queue will never block; thus, it could grow to a massive size. The crucial part
requirement for the consumer to be able to retrieve the messages as quickly as the
producers fill the queue. Otherwise, the memory could fill up, and the program will throw
an OutOfMemory exception.
The second type of BlockingQueue is the bounded queue. We can create such a
queue with passing the capacity in an integer as an argument to a constructor. Passing the
value to the constructor would mean that when a producer tries to add an element to a
queue that is already full, depending on the method that we use to add, it will block the
26
3.1. Feature Extraction Proposed Method
incoming data until there is space available. Using a bounded queue is an excellent way
3.1. Feature Extraction Proposed Method
to design concurrent programs.
27
3.2 Noise Classification
Before we start training and designing our machine learning model, we need to know
and prepare which dataset we are going to use. The dataset we use needs to contain the
audio set we need, which is urban noises. After we gather the dataset, then we start
3.2.1. Datasets
While there are many datasets available, in this research, we used only two of the most
popular urban noises datasets: UrbanSound8K and FSDKaggle2019 [29], which present
UrbanSound8K contains 8732 labeled sound excerpts of urban sound from 10 classes:
air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot,
jackhammer, siren, and street music. The classes are from the urban sound taxonomy. In
the UrbanSound8K dataset, the files are pre-sorted into ten folds (folders named fold1 to
fold10) to help in the reproduction of and comparison with automatic classification results.
29,266 audio files annotated with 80 labels of the AudioSet Ontology. Fonseca et al. used
this dataset for Task 2 of the Detection and Classification of Acoustic Scenes and Events
(DCASE) Challenge 2019. All the audio clips present as uncompressed PCM with a bit-
depth of 16-bit, 44.1 kHz sampling rate, and mono channel for each audio file.
28
3.2. Noise Classification Proposed Method
2. The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative
3.2. Noise Classification Proposed Method
Commons 100M dataset (YFCC)
Regarding the labeling, these audio data is using a vocabulary of 80 labels from
Google’s AudioSet Ontology. These labels cover diverse topics: Guitar and other Musical
locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid,
lengths (roughly from 0.3 to 30 seconds). The labeling in this dataset is also varied.
Specifically, FSDKaggle2019 features three types of label quality, one for each set in the
dataset:
FSDKaggle2019 comprises two train sets and one test set. The idea is to limit the
supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus
In the curated train set, the duration of the audio clips ranges from 0.3 to 30 seconds
due to the diversity of the sound categories and the preferences of Freesound users when
recording/uploading sounds. Labels are correct but potentially incomplete. A few of these
audio clips may present additional acoustic material beyond the provided ground truth
labels.
29
3.2. Noise Classification Proposed Method
TABLE 3.2
3.2. Noise Classification
THREE SETS OF FSDKAGGLE2019 DATASET Proposed Method
Params Sets Curated train set Noisy train set Test set
On the other hand, the duration of the audio clips in the noisy train set ranges from
one second to 15 seconds, with the vast majority lasting 15 seconds. Table II presents the
TABLE II
details of each set of this dataset.
THREE SETS OF FSDKAGGLE2019 DATASET
solutions is the availability of a suitable dataset. While deep learning methodologies have
Clips/class 75 300 50-150
proved themselves in many different contexts, it is easy to get carried away and ignore
Total clips 4970 19,815 4481
some practical realities about the deep neural network-based solution. A network is only
as Avg.
good labels/clip
as the data used to train it, and1.2
training an effective1.2
neural network requires
1.4 a
large annotated dataset.
Duration (hours) 10.5 80 12.9
Today’s neural networks require large datasets in order to converge on an accurate
sophisticated the network, the higher the need for data. Nevertheless, acquiring good
quality data is a complicated and costly process, and accurately annotating the data to
match a particular problem adds further cost and complexity to the process [30].
30
3.2. Noise Classification Proposed Method
To improve the classification accuracy, we merged the least significant classes into
3.2. Noise Classification Proposed Method
a new class, which was distinguishable among the rest, leaving us with five classes in
total. Those five classes are the sounds of car horns, crowd, dog bark, siren, and traffic.
We did not only augment the UrbanSound8K with 448 data from FSDKaggle2019 but
also filtered out the audio files that have less than three seconds in duration to attain a
uniform dataset.
31
3.2. Noise Classification Proposed Method
In this thesis, our classifier model adopts the CNN-LSTM model by Sang et al. [13] with
a few modifications. Instead of using raw audio waveforms, our system uses MFCCs as
the input, and we only classify the noises into five classes. Based on their results, we
decided to build our model with two CNN layers (64 and 128 filters), one stacked RNN
cells comprise two TFLiteLSTMCells, and followed with one fully-connected layer.
TensorFlow Lite. It provides hints, and it also makes the variables suitable for the TFLite
acceptable accuracy while maintaining high performance at the same time since this
As presented in Figure 3.2., the process starts with feeding the feature vector into our
first pair of Conv1D (64 filters) and MaxPooling1D. The second pair comprises the same
elements with a different filter size for the second Conv1D (128 filters). For optimization
32
3.2. Noise Classification Proposed Method
When we initialize our weights randomly, the values are probably close to zero,
given the probability distributions with which we initialized them. If the weights are close
to zero, the gradients in upstream layers vanish, due to the multiplication of small values.
On the other hand, if the weights are higher than one, the multiplication got too intense
and commonly known as an explosion. In the Xavier/He initialization, the weights are
initialized, keeping in mind the size of the previous layer, which helps in attaining a global
minimum of the cost function faster and more efficiently. The weights are still random
but differ in range depending on the size of the previous layer of neurons. This mechanism
provides a controlled initialization; hence the gradient descent became faster and more
efficient.
We trained our model using Adam [32] as the optimizer. We may observe Adam as
engineering. Adam uses the squared gradients to scale the learning rate like RMSprop,
and it takes advantage of momentum by using the moving average of the gradient instead
of the gradient itself like SGD with momentum. SGD proved itself as an efficient and
effective optimization method that was central in many machine learning success stories.
Adam uses estimations of the first and second moments of the gradient to adapt the
suitable for multi-class classification, while binary cross-entropy is suitable for multi-
33
3.2. Noise Classification Proposed Method
binary classification tasks. Formally, this loss is equal to the average of the categorical
cross-entropy. Using categorical cross-entropy will prohibit us from seeing the dominant
noise in the environment and other noises that come together during the sampling.
34
Chapter 4
Experimental Result
this thesis is about developing a mobile application, we also need to consider the
considerations are all related, latency is also essential since this thesis also covers audio
signal processing. In this thesis, we execute two kinds of experiments. The first one is to
find a suitable amount of MFCC to use in our system to determine the fixed cepstral
Along with this first experiment, we also tested the same amount of MFCCs, epochs,
and dataset on two other noise classifier models. We meant to run this test as a
model. The second experiment is to observe the resources and power consumption while
running the application and compare the results if the application is running on different
models of smartphones. We also run the same benchmarking process here to see the
amount of time needed for the application to run. The overall process comprises the
35
Experimental Result
feature extraction process and the classification process. The feature extraction process
should be static since we use the same method for extracting the MFCCs in the proposed
architecture and the other two. In this chapter, we provide the experimental results along
36
100
90
80
70
Accuracy (%)
60
50
40
30
20
10
0
30 35 40 45
MFCCs
Proposed Hassan Singh
Figure 4.1 Classification accuracy comparison between the proposed system and
two other architectures
size (dimensionality) will increase the chances for higher accuracy; however, it will also
demand higher and heavier computation. As mentioned in the MFCC Extraction process,
the size of the feature we extract depends on the value we put in the sampling frame and
overlapping frame size. These amounts determine the number of audio frames our
application needs to sample, and each sampling for each frame requires computation time
and memory allocation for storing the extracted value temporarily. Decreasing the feature
vector size will increase the performance of the system, but it might also compromise the
accuracy of the model itself. Therefore, before we start the development of the entire
37
4.1. MFCCs Determination Experimental Result
We ran four experiments to determine the most suitable number of coefficients from
30 to 45. We decided not to get over 45, as mentioned in Jacoby’s research, that the
accuracy goes down beyond that amount of MFCCs. Figure 4.1 presents the results of the
experiments, and we found that 30 is the most suitable one since it leads to the highest
accuracy and the shortest computation time needed, as seen in Figure 4.2. Based on this
Finding the amount of frame, we need to use for the feature vector is also tricky due
to the workflow of the TarsosDSP library. As presented in Table 3.1, the acceptable data
types for frame size and frame overlap size are integers, which makes it harder for us to
perform the tuning. Since we train our model with the most prolonged duration at a six-
second-long audio file, we try to get 600 frames per six seconds, which will make 100
frames per second. However, we can not achieve such amount due to the resolution, and
we decided to use the closest one, which is 102 frames per second. This amount is suitable
The trend of the classification accuracy does not go linearly with the amount of
MFCCs. As we can see in Figure 4.1, the accuracy somehow goes down when we utilized
40 MFCCs. According to the results from Jacoby’s research, he found a relatively even
distribution for some sounds and a drastic change in some other cases. The sound of the
car horn is a notable example, where the classification accuracy drastically increases with
the increase in the band from 10 to 50. This phenomenon makes sense for a very tonal
sound, such as the sound of a car horn, as the resolution of audible frequencies increases
with the number of available Mel bands. Another kind of sound that is also tonal is the
sound of the siren. The classification accuracy for this sound starts relatively high at near
38
4.1. MFCCs Determination Experimental Result
From observing the trend and behavior of the MFCC feature, we understand that the
classification accuracy from utilizing MFCC fluctuates based on the tonal characteristic
of the soundwave. Therefore, we decided to modify our dataset by changing the label of
some audio files. Based on the classification trends, we understand that some labels have
a similar characteristic to the others. For instance, the sound of a car engine and a
motorcycle has a minimal difference. By regrouping these similar audio files increases
the classification accuracy, and it also helps us in determining the number of MFCC we
39
3.500
3.000
Processing time (s)
2.500
2.000
1.500
1.000
0.500
0.000
30 35 40 45
MFCCs
Proposed Hassan Singh
Figure 4.2 Elapsed time in classifying a six-seconds long audio file using three
different architectures
In the past few years, smartphones became a high-demand product in the global
market. Many developers came with their products and new devices’ architecture. These
vastly diverse architecture turns into a challenge for us to develop a system that can work
on the majority of the products. To validate the performance of our developed application,
we measured the average runtime of each audio file sample (six samples) on four different
Figure 4.2 presents the runtime comparisons between our developed system and the
other classifiers. Based on these results, the system consumed an average runtime of 666
ms for extracting 30 MFCCs and classifying a six-second long audio file. Compared to
the other architectures, our proposed system consumed less processing time.
40
4.2. The Runtime of the Developed Application Experimental Result
TABLE 4.1
SPECIFICATIONS OF THE TESTED SMARTPHONES
TABLE 4.2
RESOURCES CONSUMPTION OF THE DEVELOPED SYSTEM IN A NOISY ENVIRONMENT
Phone D 7% 109.7 MB
41
4.2. The Runtime of the Developed Application Experimental Result
Figure 4.3 Confusion matrix for augmented test data using 30 MFCCs
Regarding our noise classifier model, we managed to achieve 92.06% accuracy; for
a brief review and analysis, we present the results in a confusion matrix, as seen in Figure
4.3. Each of these five labels has its unique characteristic such as continuous medium
range frequency for the car horn, murmur sound for the crowd, short and low range
frequency for the dog bark, the flange-like audio signal for siren, and continuous motor
sound for vehicles. These different tonal characteristics affect classification accuracy
since one type of audio is classified well with a certain amount of MFCCs. In contrast,
As we can see in the confusion matrix, the classifications for car horn and siren are
not as accurate as the rest. This issue occurs since the audio files for those two labels are
42
4.2. The Runtime of the Developed Application Experimental Result
less than the amount of the other labels; besides, the audio file itself is not clean enough
to gain higher accuracy. As we mentioned before, the audio signal presents in the
43
Figure 4.4 Screen capture of the application’s power usage
conducted the experiments to measure the average CPU and memory usage during the
One challenging objective in this development is to create an application that can run
over an extended period. Therefore, we also need to examine the battery consumption
during the runtime. After testing the application for one hour in a noisy environment, we
retrieved the result by screen-capturing the battery usage information, as shown in Figure
4.4. Based on the captured information, we found that the device’s estimated power use
during the execution time is around 0.03 Watts as converted from the 6 mAh, written in
44
4.3. Resource Consumption during Classification Experimental Result
the snapshot. This information also shows that the total CPU runtime is around 5minutes
Bearing in mind that the CPU foreground shows the duration of a running application
while an Activity from the application was in the foreground. It might also include when
a Service from the application was in the foreground, which displays an ongoing
notification. CPU total includes all of the CPU usage (services and broadcast receivers in
the background and activities in the foreground). According to the results present in
Figure 4.4, we can see that the CPU total time is only three seconds more than the CPU
foreground. This result means that our application does not have much of a process that
runs on the background. Most of the exhaustive processes are for presenting the graph,
Keep awake measures the length of time that this application has used wake locks or
alarms to keep the device awake when it would otherwise have been asleep. In a way, this
is potentially the most significant drain on the battery. Sleeping uses much less power
than staying awake, so if an application keeps a wake lock for a long time, it is keeping
the device in a high-power mode all the time, even if the application is not doing any
significant work. In our results, we can seed that our application does not require any
wake locks to keep the device awake; hence our application can conserve the energy for
further use.
In our captured results, we may assume that our application consumes a low amount
of power from the device battery. However, this result is only a rough estimation of the
real battery consumption. The CPU does not do the usage calculations by itself. It may
have hardware features to make the task less cumbersome, but it is mostly the job of the
operating system. The details of these implementations will vary (especially in the case
of multicore systems). The general idea is to see how long is the queue of tasks our CPU
45
4.3. Resource Consumption during Classification Experimental Result
needs to finish. The operating system may take a look at the scheduler periodically to
After all, we still need to refer back to the number of cores the devices have to
measure the CPU loads since CPU loads and battery consumption are related. On a quad-
core system, a value higher than 25% would mean that the system has fully utilized one
core for the application and, as a result, would be considered as high CPU usage, although
this is still fine in short bursts. All the treads used in the application should be named to
use the information provided in the best possible way. It is always good to check if any
46
(a) (b) (c)
Figure 4.5 Classification result from (a) the beginning (nothing), (b) an audio file, and
(c) microphone
After the development, we run a quick test to see how the application behaves. Figure
4.5(a) shows the overview of the application the moment it starts running. When the
application starts, it will first ask for permission to access the device’s microphone and
access to write and read the internal storage. The need for accessing the device’s internal
storage is to move our audio files sample to a folder the application will create called
CSV. This folder serves the purpose of the application default storage to store CSV files
This application gives two options to choose the source of the audio signal: from an
audio file or microphone. The purpose of classifying from an audio file is to see how good
this system in classifying a known noise label. On the top-right corner of the screen, we
put a button called CSV, which will bypass the classification function and save the
47
4.4. Developed Application Overview Experimental Result
extracted feature as a CSV file instead. To save the features as CSV serves the purpose
for further research or to compare our extracted feature with other MFCCs extracted from
other libraries. If we activate the CSV button, the From Microphone radio button will be
disabled since saving the features from the microphone to the CSV usually caused the
application to crash due to the silence detector function, which will wipe the features
As we can see in Figure 4.5(b), the system can classify the sound of siren accurately.
It also shows that the system classifies the sound as a sound of traffic in a smaller portion.
Such classification occurs due to the sound presents in the audio file also contain a sound
of the engine (probably the ambulance); however, the siren sound dominates in the
environment. During the test, we noticed a little bit of latency for the results to show up
after we clicked on the SELECT FILE button. The latency is longer than the computation
time mentioned in Figure 4.2. The plotter causes this latency. The plotter calculates and
plots the graph to show the results in an interactive view. According to the debugger
report, the processing time remains match with the results present in Figure 4.2.
Figure 4.5(c) shows an overview of the classification from the device’s microphone.
It behaves similarly with the one from an audio file with a difference that the graph is
continually changing whenever there is a presence of sound above the silence threshold.
The microphone button will change into a sign of a stop button since we add a function
that we can stop this process at any time. The application will show the classification
result for every second. This one-second is not a delay nor a latency. As mentioned in
Chapter 3, we captured 102 frames for every second. We can change this capturing
duration to a shorter time or longer according to our desire. We decided to use a one-
48
4.4. Developed Application Overview Experimental Result
We include our model file in the asset folder, together with the audio file samples.
This TensorFlow Lite model is compact and relatively small in size due to the
49
4.5 Discussion
In this thesis, we surveyed the other state-of-the-art systems for noise classification
and compared them with the developed application. The real-time noise classifier by
classification method. However, this system uses multiple features such as band-
significant amount of computation. According to their results, the system required 26%
CPU consumption on an octa-core 2.35 GHz CPU, where our system requires less when
it runs on a similar specification, as seen in Table IV. Our developed classifier is designed
in such a way not only to overcome the latency and high computation requirement but
achieving high accuracy. Also, merging the least significant classes expanded the range
of variation in our noise classes. This merging also allows us to have a higher resolution
for the tonal characteristic, which affects the decision to choose the amount of MFCCs
we need to use.
Although our system has achieved high classification accuracy, there are still
possibilities to increase these values by using a cleaner audio dataset. One of the methods
This work also allows us to try to implement this classifier in a noise-removal system
by treating the noisy sound we received according to the type of noise presented at the
moment. Since we developed the system for a smartphone, it may provide an opportunity
50
4.5. Discussion Experimental Result
and we will try to fix this issue in future development. Currently, we tried our application
• Lenovo K6 Note
• ASUS Z01KD
• ASUS X017DA
• Google Pixel 4
51
Chapter 5
Conclusions
that surpasses the performances of the other two architectures with 92.06% accuracy by
only using one audio feature (MFCC). This system presents as a stepping stone to
overcome the current hearing aids problem to eliminate annoying environment noises,
which discouraged people from using such a helpful device. When the system received
an audio signal which is above the defined threshold, the noise classifier model can
perform the classification by using only MFCCs as the input features. Also, this system
works with lower CPU consumption and less processing time (666 ms) compared with
other architecture. Therefore, this developed system can help the further development of
52
References
[2] World Health Organization, “Deafness and hearing loss,” World Health
2019.
[3] A. McCormack and H. Fortnum, “Why do people fitted with hearing aids not wear
them?,” International journal of audiology, vol. 52, no. 5, pp. 360-368, 2013.
[5] C.-Y. Chang, A. Siswanto, C.-Y. Ho, T.-K. Yeh, Y.-R. Chen and S. M. Kuo,
products,” IEEE Consumer Electronics Magazine, vol. 5, no. 4, pp. 34-43, 2016.
53
[6] AUDI-LAB, “The Advantages & Disadvantages of Hearing Aid Types,” AUDI-
[8] Y.-H. Lai, Y. Tsao, X. Lu, F. Chen, Y.-T. Su, K.-C. Chen, Y.-H. Chen, L.-C. Chen,
Speech Intelligibility for Cochlear Implant Recipients,” Ear and Hearing, vol. 39,
p. 1, 2018.
2014.
54
[13] J. Sang, S. Park and J. Lee, “Convolutional Recurrent Neural Networks for Urban
[14] X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech enhancement based on deep
[18] A. Samal, D. Parida, M. R. Satapathy and M. N. Mohanty, “On the Use of MFCC
55
[20] M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu and Q. Tian, “HMM-Based Audio
[22] J. Salamon, C. Jacoby and J. P. Bello, “A Dataset and Taxonomy for Urban Sound
[23] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block
MFCC,” Journal of Computer Science and Technology, vol. 16, no. 6, pp. 582-589,
2001.
[26] J. Dai, S. Liang, W. Xue, C. Ni and W. Liu, “Long short-term memory recurrent
neural network based segment features for music genre classification,” in 2016 10th
Tianjin, 2016.
56
[28] J. Six, Digital Sound Processing and Java - Documentation for the TarsosDSP
Consumer Devices and Services 3—Getting More From Your Datasets With Data
2020.
[31] S. K. Kumar, “On weight initialization in deep neural networks,” ArXiv, vol.
abs/1704.08863, 2017.
[33] L. D. Shapiro, “Boston Chapter Attuned to Hearing Health Care [Society News],”
57