You are on page 1of 66

國立台灣科技大學

電子工程系

碩士學位論文

Evaluation of Real-Time Noise Classifier


based on CNN-LSTM and MFCC for Smartphones

Winner Roedily
M10702803

指導教授:阮聖彰博士

中華民國 109 年 07 月 21 日
Evaluation of Real-Time Noise Classifier
based on CNN-LSTM and MFCC for Smartphones

Student: Winner Roedily Advisor: Prof. Shanq-Jang Ruan

Submitted to Department of Electronic and Computer Engineering


College of Electrical Engineering and Computer Science
National Taiwan University of Science and Technology

ABSTRACT

Recent studies demonstrate various methods to classify noises present in daily human

activity. Most of these methods utilize multiple audio features that require heavy

computation, which increases the latency. This paper presents a real-time sound classifier

based on a smartphone by utilizing only the Mel-frequency Cepstral Coefficient (MFCC)

as the feature vector. By relying on this single feature and an augmented audio dataset,

this system drastically reduced the computation complexity and achieved 92.06%

accuracy. This system utilizes the TarsosDSP library for feature extraction and

Convolutional Neural Network – Long-Short Term Memory (CNN-LSTM) for both

classification and MFCCs determination. The results show that the developed system can

classify the noises with higher accuracy and shorter processing time compared with other

architectures. Additionally, this system only takes up 0.03 Watts of power consumption,

which makes it suitable for future commercial use.

Keywords: Noise classification, MFCC, CNN-LSTM, TarsosDSP library, Android

i
Table of Contents

Abstract .............................................................................................................................i

Table of Contents ............................................................................................................ ii

List of Tables ...................................................................................................................iv

List of Figures ..................................................................................................................v

1 Introduction ..................................................................................................................1

1.1 Hearing Loss and Hearing Aids.............................................................................1

1.2 Noise Classification ............................................................................................... 4

1.3 Organization of this Thesis ....................................................................................6

2 Related Works...............................................................................................................8

2.1 MFCC Characteristics ........................................................................................... 9

2.2 MFCC Extraction ................................................................................................ 11

2.3 Convolutional Neural Network (CNN) ............................................................... 15

2.4 Long Short-Term Memory (LSTM) ....................................................................17

2.5 Motivations ..........................................................................................................19

ii
3 Proposed Method ........................................................................................................21

3.1 Feature Extraction................................................................................................ 22

3.2 Noise Classification ............................................................................................. 28

3.2.1. Datasets ........................................................................................................28

3.2.2. Noise Classifier Model.................................................................................32

4 Experimental Result ...................................................................................................35

4.1 MFCC Determination .......................................................................................... 37

4.2 The Runtime of the Developed Application ........................................................ 40

4.3 Resource Consumption during Classification ..................................................... 44

4.4 Developed Application Overview .......................................................................47

4.5 Discussion ............................................................................................................50

5 Conclusions..................................................................................................................52

References....................................................................................................................... 53

iii
List of Tables

3.1 VALUES FOR AUDIODISPATCHER PARAMETERS .............................. 23

3.2 THREE SETS OF FSDKAGGLE2019 DATASET .......................................30

4.1 SPECIFICATIONS OF THE TESTED SMARTPHONES ............................ 41

4.2 RESOURCE CONSUMPTION OF THE DEVELOPED SYSTEM IN A

NOISY ENVIRONMENT ..............................................................................41

iv
List of Figures

1.1 Block diagram of the feedforward ANC system ...............................................2

2.1 Sampled audio files ......................................................................................... 11

2.2 Periodogram sample of one frame ..................................................................12

2.3 Mel-spaced filterbank ..................................................................................... 13

2.4 Classification accuracy of UrbanSound8K dataset by varying MFCCs .........19

3.1 Block diagram of feature extraction flow ....................................................... 22

3.2 MFCC heatmap for the sound of (a) car horn, (b) crowd, (c) dog bark, (d) siren,

(e) traffic, and (f) mixed noises .......................................................................24

3.3 Block diagram of the proposed classifier ........................................................ 32

4.1 Classification accuracy comparison between the proposed system and two

other architectures ........................................................................................... 37

4.2 Elapsed time in classifying a six-seconds long audio file using three different

architectures ....................................................................................................40

4.3 Confusion matrix for augmented test data using 30 MFCCs .......................... 42
v
4.4 Screen capture of the application’s power usage ............................................44

4.5 Classification result from (a) the beginning (nothing), (b) an audio file, and (c)

microphone......................................................................................................47

vi
Chapter 1

Introduction

1.1 Hearing Loss and Hearing Aids

Ear is a sophisticated human organ that works daily to receive sound from its

surroundings in an audible range. Human ear receives typically sound waves from the

environment to the eardrum, then passes the vibration through the middle and inner part

of the ear, causes the hair cells in the cochlea to move, which generates nerve impulses

to the brain. Hearing problems occur mainly because of the dysfunction of the cochlea.

For reasons such as aging, exposure to deafening sounds, drug consumption, and some

infections, the hair cells in human ears are reduced in number [1].

In the latest statistic from the World Health Organization (WHO), around 466 million

people worldwide suffer from hearing loss [2]. Though these numbers are large, the

majority (80%) of adults aged 55-74 years old refuses to use the hearing aid product. The

low percentage was due to various reasons, such as the price, misfit and discomfort, and

1
Hearing Loss and Hearing Aids Introduction

Hearing Loss and Hearing Aids Introduction

Figure 1.1: Block diagram of the feedforward ANC system

Figure
periodic maintenance 1.2:
[3]. Block
One of thediagram of the feedforward
most compelling ANC caught
reasons which systemour attention

is the presence of environmental noises that are not filtered out by conventional hearing

aids [4].

Different approaches attempt to remove these environmental noises. Active Noise

Cancellation (ANC) is the most popular one and is still present in the market nowadays.

The ANC cancels out the noises by combining the sound wave presents at the moment

with its inverse [5], as seen in Figure 1.1. Instead of removing one particular noise, this

system removes all ambient sound. Such a system is not suitable for hearing aid

applications since the ANC will attenuate all sound, including the information itself. A

better approach for such an application is first to classify the noise and then apply a

particular treatment to it without harming the information.

2
Hearing Loss and Hearing Aids Introduction

Hearing aids come with different models and technologies such as In-the-Ear, In-
Hearing Loss and Hearing Aids Introduction
the-Canal, Completely-in-the-Canal, Behind-the-Ear, and Receiver-in-Canal. Behind-

the-Ear (BTE) hearing aids are the most popular type since those are less susceptible to

feedback problems due to the greater separation between the microphones and receivers.

However, BTE that comes with earmolds may need periodic maintenance to preserve its

acoustic seal [6].

From a technical point of view, the hearing aid comprises three essential parts: a

microphone (some types have dual), an amplifier, and a speaker [7], which are all present

in smartphones nowadays. According to the Global Mobile Market Report, the number

of smartphone users worldwide in 2017 was 2.7 billion and reached 3.2 billion in 2019,

as shown in Fig. 2. Based on this trend, we may conclude that smartphones are getting

more affordable in the coming year. Such fact brings an opportunity to distribute the

hearing aids closer to the people by presenting such applications into their mobile devices.

3
1.2 Noise Classification

In recent years, researchers tried to implement deep learning in noise reduction, for

instance, Deep Denoising Autoencoder (DDAE). Lai et al. [8] presented DDAE-NR in

their research, which separated the system into Noise Classifier and Noise Removal. Their

result was satisfying both in classification and denoising. However, this system is not

suitable for a mobile device due to its complexity.

The idea of utilizing machine learning for noise classification leads to several kinds

of implementations, such as cochlear implants, for instance. Such collaboration between

computer engineering and the biomedical field contributes to solving health issues to

improve human’s daily life. Researchers tried different approaches to improve the

performance and accuracy of noise classification for this cochlear implant. In 2014, Saki

et al. [9] implemented the Random Forest Tree Classifier in a cochlear implant and

achieved a 10% improvement in accuracy compared to the previous version, which used

Gaussian Mixture Model (GMM). However, they only tested this method on three kinds

of noises, which still needs further improvements. Five years later, Alavi et al. [10]

improved the classifier by utilizing GMM as the classifier model and MFCC as the feature.

The classification result was very satisfying, with the classification accuracy reached 100%

except for babble noise.

Singh et al. [11] presented an eight-layers VGG-like network in their research by

utilizing the augmented UrbanSound8K dataset. This architecture achieved 80.2%

accuracy by utilizing 127×64 log mel-spectrogram as the feature vector. Hassan et al. [12]

also brought a remarkable result from their research. With 5000 epochs, their model

achieved 94.6% accuracy. Reducing the number of target classes supported the

4
Hearing Loss and Hearing Aids Introduction

achievement of such results as done in their research, where they classify only five out of
Hearing Loss and Hearing Aids Introduction
ten classes from the dataset.

All those researches mentioned above implemented the convolutional neural network

(CNN) for classifying the noises. Sang et al. [13]introduced another method called

convolutional recurrent neural network (CRNN) for classifying the UrbanSound8K

dataset and achieved 79.06% accuracy for their CRNN8 architecture, which composed of

eight CNN layers, one RNN layer, and one fully connected layer.

5
1.3 Organization of this Thesis

This thesis introduces a real-time supervised noise classifier on a smartphone as a

stepping stone to overcome the issue mentioned in [4]. We divide the entire process into

model training and application development. The challenge that presents in this

development is to achieve high classification accuracy while maintaining the latency issue

at the same time. Compared to desktop, mobile applications are resources limited, and

heavy computation will increase the latency of the overall process. TensorFlow Lite

provides the solution to deploy a machine learning model on a mobile device for lighter

computation and minimal resource consumption. Since TensorFlow Lite was made as a

compact version of the TensorFlow model, some of the operators in TensorFlow are not

present in TensorFlow Lite. The most apparent incompatibility is in the Recurrent Neural

Network model. Due to such an issue, we need to use the library’s experimental version

for our model to perform as expected in our developed application.

By designing and developing the proposed mobile application, this thesis offers two

significant contributions as follow:

• High accuracy noise classification by utilizing only one audio feature, mel-

frequency cepstral coefficient (MFCC),

• Low computation complexity, low latency, and low battery consumption mobile

application to classify daily urban noises.

The remainder of this thesis is organized as follows: Chapter 2 elaborates on the

existing studies regarding noise classification. Then, Chapter 3 describes the proposed

noise classifier and determining the most suitable MFCCs’ amount to achieve the

expected results. Next, Chapter 4 presents the performance of the developed application

6
Hearing Loss and Hearing Aids Introduction

and compares the accuracy of each number of MFCCs selected for the test. Finally,
HearingFigure 2.1: Classification accuracy of UrbanSound8K dataset by varying
Chapter
MFCCs5Loss
provides
and Hearing
the conclusions
Aids as well as the future direction of this thesis.
Introduction

7
Chapter 2

Related Works

Based on the brief explanation in Chapter 1, CRNN is a classification method that

incorporates both convolution and recurrent neural networks, which means that it

perseveres both sparsity and sequentiality. In this chapter, we present the most common

audio features and the neural networks for further discussion.

Before elaborating on the development and analysis of the mobile application, we

need to understand the characteristics of the MFCC, which we present in Section 2.1. The

procedure for extracting MFCC also took part as the essential process for this application,

and Section 2.2 discloses this matter. Section 2.3 elaborated the method regarding the

sparsity of the extracted features and followed by the sequentiality in Section 2.4. Also,

Section 2.5 declares the motivation of this developed application.

8
2.1 MFCC Characteristics

There are several types of “worth-extracting” information contained in raw audio data,

such as mel frequency power spectra [14], log-mel spectrogram [15, 11], Mel-Frequency

Cepstral Coefficient (MFCC) [8, 12, 16, 17], and many more. MFCC is the most popular

and widely used for noise classification and vocal representation since MFCC accurately

represents the shape of the vocal tract that manifests itself in the envelope of the short-

time power spectrum [18].

We can observe MFCC as a representation of how our brain perceives sounds that

propagate to our ears by analyzing the periodogram. This periodogram contains the power

spectrum of an audio frame where the estimation performs similar to the human cochlea

that identifies which frequencies are present in the frame. The human cochlea cannot

discern the difference between two closely spaced frequencies; hence we implement Mel

filterbank to imitate this characteristic. We only concern about how much energy occurs

roughly with fewer variations, and the Mel scale tells us exactly how to space filterbanks.

The Mel scale relates perceived frequency (pitch) of a pure tone to its actual measured

frequency. Human ears discern small changes in pitch much better at low frequencies

than at high, and Mel scale makes the features match more closely to this characteristic.

With f as the frequency in Hertz, and m as the frequency in mel, the formula for converting

from frequency to Mel scale and vice versa presents as:

𝑀(𝑓) = 1125 ln(1 + 𝑓 ⁄700) (2.1)

𝑀−1 (𝑚) = 700(𝑒 𝑚⁄1125 − 1) (2.2)

Another characteristic that we need to concern is the fact that human ears perceive

loudness on a logarithmic scale. Such a condition means that the significant variations in

9
2.1. MFCC Characteristics Related Works

energy may not sound significantly different if the sound begins with a high sound-
2.1. MFCC Characteristics Related Works
pressure level.

Audio engineers commonly use MFCCs as features in speech recognition [19]

system, such as the systems which can automatically recognize numbers spoken into a

telephone. MFCCs are also increasingly finding uses in music information retrieval

applications such as genre classification, audio similarity measures, and many more. The

European Telecommunications Standard Institute in the early 2000s declared to use a

standardized MFCC algorithm in mobile phones.

10
Figure 2.1. Sampled audio file

2.2 MFCC Extraction

Mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of

a sound; therefore, MFCCs are the coefficients that collectively make up an MFC [20].

Deciding on how many MFCCs to extract from a single audio frame is challenging. A

system can achieve higher classification accuracy by increasing the amount of MFCCs to

extract. On the other hand, this will increase the dimensionality, which requires more

computation and cause latency. As mentioned in Jacoby’s research [21], it shows how the

amount of MFCCs (25 to 40) affects the accuracy of a model trained using the

UrbanSound8K dataset [22].

MFCCs are commonly derived as follows [23]:

• Take the Fourier transform of a windowed excerpt of a signal.

• Map the powers of the spectrum obtained above onto the mel scale, using

triangular-overlapping windows.

• Take the logs of the powers at each of the mel frequencies.

11
2.2. MFCC Extraction Related Works

2.2. MFCC Extraction Related Works

Figure 2.2. Periodogram sample of one frame

• Take the discrete cosine transform of the list of the mel log powers, as if it were

a signal.

• The MFCCs are the amplitudes of the resulting spectrum.

Like any other audio feature extraction, MFCC extraction starts with sampling, and

we can visualize the sampled audio file as shown in Figure 2.1. Sampling an audio signal

into frames can be tricky since an audio signal is continually changing. If the frame

duration is too short, we will not have enough samples to get a reliable spectral estimate.

If it is too long, we would receive too much variation in the frame and might lead us to

overfit the classifier model.

The next step is to calculate the periodogram estimate of the power spectrum by

taking the Discrete Fourier Transform for each frame. A periodogram sample of one

frame is presented in Figure 2.2. If we consider the DFT of the sampled frame as 𝑆𝑖 (𝑘),

the periodogram-based power presents as:

1
𝑃𝑖 (𝑘) = 𝑁 |𝑆𝑖 (𝑘)|2 (2.3)

12
2.2. MFCC Extraction Related Works

2.2. MFCC Extraction Related Works

Figure 2.3. Mel-spaced filterbank

Where Pi is the power spectrum of frame i, Si is the DFT of the frame i, and N is the

number of sample.

After we receive the result, we can compute the Mel-spaced filterbank by applying

the triangular filters to the periodogram estimate, where we can visualize the result as

shown in Figure 2.3. To calculate the filterbank energies, we multiply each filterbank with

the power spectrum, then add up the coefficients. This calculation will leave us with n

values, where n is the number of coefficients we determine to extract, which indicates

how much energy is in each filterbank.

In calculating filterbanks, at first, we need to choose a lower and upper frequency,

where the lower frequency can go as low as zero while upper frequency should not be

higher than half the sampling rate. Using Equ. 2.1, we convert the upper and lower

frequencies to Mels, then we divide the values in between these two numbers into the

number of coefficients we determine to extract. After we got the results, we need to

convert them back to frequency scale (Hz) and round those frequencies to the nearest

13
2.2. MFCC Extraction Related Works

FFT bin. With denoting m as the number of filters we desired, the formula to create the
2.2. MFCC Extraction Related Works
filterbanks in the triangular filter and overlapping each other presents as:

0 𝑘 < 𝑓(𝑚 − 1)
𝑘−𝑓(𝑚−1)
, 𝑓(𝑚 − 1) ≤ 𝑘 ≤ 𝑓(𝑚)
𝑓(𝑚)−𝑓(𝑚−1)
𝐻𝑚 (𝑘) = 𝑓(𝑚+1)−𝑘
(2.4)
, 𝑓(𝑚) ≤ 𝑘 ≤ 𝑓(𝑚 + 1)
𝑓(𝑚+1)−𝑓(𝑚)
{ 0 𝑘 > 𝑓(𝑚 + 1)

The final step for this extraction process is to take the log of each of the energies

calculated in the filterbank, which leaves us with log filterbank energies. To get the

MFCC, we need to compute the DCT of this log filterbank energies.

There are several variations in extracting MFCC features, such as the differences in

the shape of spacing the windows used to map the scale [24], or the addition of dynamics

features such as “delta” and “delta-delta” (first- and second-order frame-to-frame

difference) coefficients. The number of filters, the shape filters, how filters are spaced,

and how the power spectrum is warped may affect the performance of the MFCC itself.

14
2.3 Convolutional Neural Network (CNN)

For the last few years, CNN has made a breakthrough for image, text, and audio

classification. CNN has been around since the early 1990s. In the years from the late

1990s to the early 2010s, CNN was in incubation. As more and more data and computing

power became available, researchers get more interested in exploring tasks which are

sufficient for CNN to handle.

Hassan et al. [12] presented in their research how they implemented a 64-filters

convolution layer followed by max-pooling and two fully-connected layers with MFCCs

as their feature vector. With 5000 epochs, their model achieved 94.6% accuracy. However,

using two fully-connected layers in series is not suitable for real-time smartphone

applications due to its heavy computation and large model size, even after converted into

a TensorFlow Lite file (around 200 MB). On the other hand, Singh et al. [11] implemented

four convolutional layers and a max-pool layer followed by a fully-connected layer. This

architecture achieved 80.2% accuracy on an augmented UrbanSound8K dataset with a

log-mel spectrogram as the feature vector.

A Convolutional Neural Network (CNN) is a deep learning algorithm that can take

in an input subject, assign importance (learnable weights and biases) to various

aspects/objects in the subject, and be able to differentiate one from the other. The pre-

processing required on CNN is much lower as compared to other classification algorithms.

While the filters in primitive methods are hand-engineered, with enough training, CNN

can learn these filters by itself.

Convolutional layers convolve the input and pass its result to the next layer. The

convolution operation reduces the number of free parameters, allowing the network to be

more in-depth with fewer parameters. There are two kinds of convolution layer: temporal

15
2.3. Convolutional Neural Network (CNN) Related Works

and spatial convolution; and we utilize the temporal one in this development. This layer
2.3. Convolutional Neural Network (CNN) Related Works
creates a convolution kernel that is convolved with the layer input over a single temporal

dimension to produce a tensor output. When programming this temporal CNN, the input

is a tensor with shape (l audio files) × (m frames) × (n MFCCs).

CNN can successfully capture the spatial and temporal dependencies in an input

subject by applying the relevant filters. The architecture performs better fitting to the

dataset due to parameter reduction, and the reusability of weights. In other words, we can

train the network to understand the sophistication of the input subject better.

16
2.4 Long Short-Term Memory (LSTM)

LSTM is a kind of recurrent neural network that uses a gating mechanism for better

modeling of long-term dependencies in the data [25]. A recurrent neural network is

suitable for dealing with sequential information such as audio since every sample of this

kind of information is related to its predecessor. Numerous researches proved that LSTM

is suitable to increase performance in dealing with such an issue due [13, 17, 26] to its

high efficiency in capturing long-term dependencies.

The key idea of RNN is that the recurrent connections between the hidden layers

allow the memory of previous inputs to retain internal state, which can affect the outputs.

However, RNN mainly has two issues to solve in the training phase: vanishing gradient

and exploding gradient problems. When computing the derivatives of activation function

in the back-propagation process, long-term components may go exponentially fast to zero.

This condition makes the model hard to learn the correlation between temporally distant

inputs. Meanwhile, when the gradient grows exponentially during training, the exploding

gradient problem occurs.

By design, RNN takes two inputs at each time step: an input vector and a hidden

state. The next RNN step takes the second input vector and first hidden state to create the

output of that step. Therefore, in order to capture semantic meanings in long sequences,

we need to run RNN over many time steps, turning the unrolled RNN into a profound

network.

LSTM layers comprise recurrently connected memory blocks where one memory

cell contains three multiplicative gates. The gates perform continuous analogs of write,

read, and reset operations, which enable the network to utilize the temporal information

17
2.4. Long Short-Term Memory (LSTM) Related Works

over a while. Each single LSTM cell governs what to remember, what to forget, and how
2.4. Long Short-Term Memory (LSTM) Related Works
to update the memory using gates. By doing so, the LSTM network solves the problem

of exploding or vanishing gradients. By aligning multiple LSTM cells, we can process

the input of sequence data. LSTM units are typically arranged in layers; hence each of the

output of each unit is the input of the other units. In this way, the network becomes richer

and captures more dependencies.

18
Figure 2.4 Classification accuracy of UrbanSound8K dataset by varying MFCCs

2.5 Motivations

Based on the background mentioned above in Section 1, hearing loss is still a big

issue in our daily life, ironically, most of the people refuse to wear hearing aids in the

market due to various reasons, including the presence of background noises. Although

many kinds of research in noise classification show the possibility to achieve high

accuracy, we need to reduce the complexity and latency to implement such a system on a

mobile device.

Most of the noise classification methods utilize multiple features from an audio

signal to increase accuracy. Such a method indeed increases the classification accuracy,

as proved by Dang et al. [27] in their research. However, it also increases the computation

complexity, which is not suitable for mobile applications. For instance, Alamdari et al.

[15] introduced a real-time smartphone application for unsupervised noise classification

by utilizing band-periodicity, band-entropy, and mel-frequency spectral coefficients as

the feature sets. The main drawback of this unsupervised noise classifier was the latency

since it utilized multiple audio features. This latency drastically increased when it

detected a new noise class.

19
2.5. Motivations Related Works

According to the characteristic of MFCC, we learned that the resemblance of such a


2.5. Motivations Related Works
feature and human cochlea behavior is uncanny. We can see the accuracies we can

achieve by altering the number of MFCC we use are various, as shown in Fig. 2.4. Based

on this fact, we understand that using MFCC alone is promising to develop a reliable

system to classify

The objective of this proposed method is to develop a smartphone application to

classify daily noises with low computation complexity, latency, and power consumption

while maintaining the accuracy at the same time by using only MFCC as the input feature.

20
Chapter 3

Proposed Method

In this thesis, we proposed a power-efficient noise classifier designed for a smartphone.

The proposed method comprises two significant blocks. The first block implements a

DSP library called TarsosDSP, which handles feature extraction from the incoming audio

signal, as presents in Figure 3.1. From this block, we will receive a feature vector in the

size of m frames × n MFCCs. This feature vector then passed to the second block, which

is our classifier model in the form of a TensorFlow Lite model. We will elaborate these

two blocks on the rest of this chapter.

21
Figure 3.1 Block diagram of feature extraction flow

3.1 Feature Extraction

The feature extraction process in this application is purely relying on the TarsosDSP [28]

library. TarsosDSP is a Java library for audio processing. It aims to provide an easy-to-

use interface to implement an audio processing algorithm as simple as possible in pure

Java and without any other external dependencies. This library tries to hit the sweet spot

between being capable enough to get an actual task done while remaining compact and

simple enough to serve as a demonstration on how DSP algorithms works. TarsosDSP

features an implementation of a percussion onset detector and several pitch-detection

algorithms. It also includes a Goertzel DTMF decoding algorithm, a time stretch

algorithm, resampling, filters, pure synthesis, some audio effects, and a pitch-shifting

algorithm.

The extraction starts with capturing and sampling the raw audio waveform into audio

frames. TarsosDSP provides two functions to handle such processes, each according to

its source. Frame size and overlaps determine the number of frames it captures

proportional to the duration and sample rate.

The extraction process in TarsosDSP needs several predefined parameters for its

AudioDispatcher, such as sample rate, frame size, frame overlap, cepstrum coefficient,

22
3.1. Feature Extraction Proposed Method

TABLE 3.1
3.1. Feature Extraction
VALUES FOR AUDIODISPATCHER PARAMETERS Proposed Method

Parameters From File From Microphone

Sample rate 8 kHz 8 kHz

Frame size 400 samples 640 samples

Frame overlap 322 samples 564 samples

Mel filter 40 mels 40 mels

Lower frequency 0 Hz 0 Hz

Higher frequency 4 kHz 4 kHz

mel filter, and frequency thresholds. Table I shows the values we assigned for those
TABLE
parameters n this thesis, and based on these I
parameters, we determined the most suitable
VALUES FOR AUDIODISPATCHER PARAMETERS
number of MFCCs for our input features. During the feature extraction process, we also
Parameters From File From Microphone
applied a silence-detection function by calculating the energy in decibel sound pressure
Sample rate 8 kHz 8 kHz
level (dBSPL) of a buffer b of the length n presents as:

Frame size 400 samples √∑𝑛


640 samples
𝑖=0 𝑏[𝑖]
𝑑𝐵𝑆𝑃𝐿 = 20 log10 (3.1)
𝑛
Frame overlap 322 samples 564 samples
Then, we compare this dBSPL with a defined threshold amount. If the dBSPL is
Mel the
below filterthreshold, we consider
40 mels 40 corresponding
it as silence and purge the mels frame. The

commonly used threshold amount is -70dB [28], and it works well on loud noises such as
Lower frequency 0 Hz 0 Hz
an explosion, gunshots, car horns, or other impulses. Since our noise classes also include
Higher frequency 4 kHz 4 kHz
continuous murmur and chatter, we need to lower the threshold amount to -95 dB.

23
3.1. Feature Extraction Proposed Method

3.1. Feature Extraction Proposed Method

(a) (b)

(c) (d)

(e) (f)

Figure 3.2. MFCC heatmap for the sound of (a) car horn, (b) crowd, (c) dog bark, (d)
siren, (e) traffic, and (f) mixed noises

TarsosDSP can works very well without any dependencies on the Android libraries.

However, for the loading and sampling process, we need to prepare FFMPEG in our asset

directory. FFMPEG is the leading multimedia framework that can decode, encode,

transcode, mux, demux, stream, filter, and play pretty much anything that humans and

machines have created. It supports the most obscure ancient formats up to the cutting

edge. No matter if they were designed by some standards committee, the community, or

a corporation. FFMPEG in our application will handle the decoding and streaming before

the extraction process began.

24
3.1. Feature Extraction Proposed Method

In the extraction process, we implement the BlockingQueue method supported by


3.1. Feature Extraction Proposed Method
Java utility due to its thread-safe characteristic. BlockingQueue is a Queue that

additionally supports operations that wait for the queue to become non-empty when

retrieving an element, and wait for space to become available in the queue when storing

an element. All queuing methods achieve their effects atomically using internal locks or

other forms of concurrency control. Such a locking mechanism makes multi-threading

operations more secure, especially against NullPointerException.

Java BlockingQueue does not accept null values and throw NullPointerException if

we try to store the null value in the queue. Its interface is part of the Java collections

framework, and we use it primarily for implementing the producer-consumer problem.

Java provides several BlockingQueue implementations such as ArrayBlockingQueue,

LinkedBlockingQueue, PriorityBlockingQueue, SynchronousQueue, and several more.

The main difference between ArrayBlockingQueue and LinkedBlockingQueue lies

in the objects placed in the queue. We use ArrayBlockingQueue for arrays and the other

for link tables, respected to their names. There are also other differences in these types of

BlockingQueue. Linked queues typically have higher throughput than an array-based

queue, but have less predictable performance in most concurrent applications. The

synchronous queue is a special BlockingQueue since we use it during offers. If no other

thread is currently performing take or poll, the offer will fail. During the take, if no other

thread is performing offer concurrently, it will also fail. This unique Morse code is well

suited for a queue with high response requirements and threads from a non-fixed thread

pool.

25
3.1. Feature Extraction Proposed Method

While implementing a producer-consumer problem in the BlockingQueue


3.1. Feature Extraction Proposed Method
mechanism, there are some essential methods that we need to be familiar with:

• put is a method that we use to insert elements to the queue. If the queue is full, it

will wait for the space to be available

• add is a method that will return true if the insertion was successful, and throws

an IllegalStateException if failed.

• offer is a method that will try to insert an element into the queue and waits for an

available space in the queue within a specified duration.

• take is a method that retrieves and removes the element from the head of the

queue. If the queue is empty, it will wait for the element to be available.

• poll is a method that retrieves and removes the element from the head of the

queue and wait for a specified time if necessary for an element to become

available. This method will return null after a timeout.

Generally, we can distinguish BlockingQueue in two types: unbounded queue and

bounded queue, due to the queue capacity. In an unbounded queue, it will set the capacity

of the BlockingQueue to Integer.MAX_VALUE. All operations that add an element to the

unbounded queue will never block; thus, it could grow to a massive size. The crucial part

when designing a producer-consumer application using unbounded BlockingQueue is the

requirement for the consumer to be able to retrieve the messages as quickly as the

producers fill the queue. Otherwise, the memory could fill up, and the program will throw

an OutOfMemory exception.

The second type of BlockingQueue is the bounded queue. We can create such a

queue with passing the capacity in an integer as an argument to a constructor. Passing the

value to the constructor would mean that when a producer tries to add an element to a

queue that is already full, depending on the method that we use to add, it will block the
26
3.1. Feature Extraction Proposed Method

incoming data until there is space available. Using a bounded queue is an excellent way
3.1. Feature Extraction Proposed Method
to design concurrent programs.

27
3.2 Noise Classification

Before we start training and designing our machine learning model, we need to know

and prepare which dataset we are going to use. The dataset we use needs to contain the

audio set we need, which is urban noises. After we gather the dataset, then we start

designing the classifier model.

3.2.1. Datasets

While there are many datasets available, in this research, we used only two of the most

popular urban noises datasets: UrbanSound8K and FSDKaggle2019 [29], which present

10 and 80 classes, respectively.

UrbanSound8K contains 8732 labeled sound excerpts of urban sound from 10 classes:

air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot,

jackhammer, siren, and street music. The classes are from the urban sound taxonomy. In

the UrbanSound8K dataset, the files are pre-sorted into ten folds (folders named fold1 to

fold10) to help in the reproduction of and comparison with automatic classification results.

FSDKaggle2019 (Freesound Dataset Kaggle 2019) is an audio dataset containing

29,266 audio files annotated with 80 labels of the AudioSet Ontology. Fonseca et al. used

this dataset for Task 2 of the Detection and Classification of Acoustic Scenes and Events

(DCASE) Challenge 2019. All the audio clips present as uncompressed PCM with a bit-

depth of 16-bit, 44.1 kHz sampling rate, and mono channel for each audio file.

FSDKaggle2019 employs audio clips from the following sources:

1. Freesound Dataset (FSD): a dataset from the MTG-UPF based on Freesound

content organized with the AudioSet Ontology

28
3.2. Noise Classification Proposed Method

2. The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative
3.2. Noise Classification Proposed Method
Commons 100M dataset (YFCC)

Regarding the labeling, these audio data is using a vocabulary of 80 labels from

Google’s AudioSet Ontology. These labels cover diverse topics: Guitar and other Musical

Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human

locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid,

Motor vehicle (road), Mechanism, Doors, and a variety of Domestic sounds.

Compared with UrbanSound8K, the audio clips in FSDKaggle2019 have variable

lengths (roughly from 0.3 to 30 seconds). The labeling in this dataset is also varied.

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the

dataset:

• Curated train set: correct (but potentially incomplete) labels

• Noisy train set: noisy labels

• Test set: correct and complete labels

FSDKaggle2019 comprises two train sets and one test set. The idea is to limit the

supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus

promoting approaches to deal with label noise.

In the curated train set, the duration of the audio clips ranges from 0.3 to 30 seconds

due to the diversity of the sound categories and the preferences of Freesound users when

recording/uploading sounds. Labels are correct but potentially incomplete. A few of these

audio clips may present additional acoustic material beyond the provided ground truth

labels.

29
3.2. Noise Classification Proposed Method

TABLE 3.2
3.2. Noise Classification
THREE SETS OF FSDKAGGLE2019 DATASET Proposed Method

Params Sets Curated train set Noisy train set Test set

Clips/class 75 300 50-150

Total clips 4970 19,815 4481

Avg. labels/clip 1.2 1.2 1.4

Duration (hours) 10.5 80 12.9

On the other hand, the duration of the audio clips in the noisy train set ranges from

one second to 15 seconds, with the vast majority lasting 15 seconds. Table II presents the
TABLE II
details of each set of this dataset.
THREE SETS OF FSDKAGGLE2019 DATASET

Params Sets of developing


A key aspect Curated trainand
set successfully
Noisy train set
deploying neural Test set
network-based

solutions is the availability of a suitable dataset. While deep learning methodologies have
Clips/class 75 300 50-150
proved themselves in many different contexts, it is easy to get carried away and ignore
Total clips 4970 19,815 4481
some practical realities about the deep neural network-based solution. A network is only

as Avg.
good labels/clip
as the data used to train it, and1.2
training an effective1.2
neural network requires
1.4 a
large annotated dataset.
Duration (hours) 10.5 80 12.9
Today’s neural networks require large datasets in order to converge on an accurate

representation of multidimensional data distribution. The deeper and the more

sophisticated the network, the higher the need for data. Nevertheless, acquiring good

quality data is a complicated and costly process, and accurately annotating the data to

match a particular problem adds further cost and complexity to the process [30].

30
3.2. Noise Classification Proposed Method

To improve the classification accuracy, we merged the least significant classes into
3.2. Noise Classification Proposed Method
a new class, which was distinguishable among the rest, leaving us with five classes in

total. Those five classes are the sounds of car horns, crowd, dog bark, siren, and traffic.

We did not only augment the UrbanSound8K with 448 data from FSDKaggle2019 but

also filtered out the audio files that have less than three seconds in duration to attain a

uniform dataset.

31
3.2. Noise Classification Proposed Method

Figure 3.3 Block diagram of the proposed classifier

3.2.2. Noise Classifier Model

In this thesis, our classifier model adopts the CNN-LSTM model by Sang et al. [13] with

a few modifications. Instead of using raw audio waveforms, our system uses MFCCs as

the input, and we only classify the noises into five classes. Based on their results, we

decided to build our model with two CNN layers (64 and 128 filters), one stacked RNN

cells comprise two TFLiteLSTMCells, and followed with one fully-connected layer.

TFLiteLSTMCell is a recurrent network cell that is specially designed only for

TensorFlow Lite. It provides hints, and it also makes the variables suitable for the TFLite

ops (transposed and separated). This architecture is suitable enough to provide an

acceptable accuracy while maintaining high performance at the same time since this

application needs to run on a smartphone in low latency.

As presented in Figure 3.2., the process starts with feeding the feature vector into our

first pair of Conv1D (64 filters) and MaxPooling1D. The second pair comprises the same

elements with a different filter size for the second Conv1D (128 filters). For optimization

32
3.2. Noise Classification Proposed Method

purposes, we use He initialization to avoid vanishing and exploding gradients in each

convolutional layer, which works well with ReLU activation [31].

When we initialize our weights randomly, the values are probably close to zero,

given the probability distributions with which we initialized them. If the weights are close

to zero, the gradients in upstream layers vanish, due to the multiplication of small values.

On the other hand, if the weights are higher than one, the multiplication got too intense

and commonly known as an explosion. In the Xavier/He initialization, the weights are

initialized, keeping in mind the size of the previous layer, which helps in attaining a global

minimum of the cost function faster and more efficiently. The weights are still random

but differ in range depending on the size of the previous layer of neurons. This mechanism

provides a controlled initialization; hence the gradient descent became faster and more

efficient.

We trained our model using Adam [32] as the optimizer. We may observe Adam as

a combination of RMSprop and Stochastic Gradient Descent with momentum. Stochastic

gradient-based optimization is a core practical importance in many fields of science and

engineering. Adam uses the squared gradients to scale the learning rate like RMSprop,

and it takes advantage of momentum by using the moving average of the gradient instead

of the gradient itself like SGD with momentum. SGD proved itself as an efficient and

effective optimization method that was central in many machine learning success stories.

Adam uses estimations of the first and second moments of the gradient to adapt the

learning rate for each weight of the neural network.

Instead of using categorical cross-entropy, we used binary cross-entropy as our loss

function since a raw audio signal in the environment is an overlapping combination of

different audio signals propagating from various sources. Categorical cross-entropy is

suitable for multi-class classification, while binary cross-entropy is suitable for multi-

33
3.2. Noise Classification Proposed Method

label classification. Binary cross-entropy is a loss function that we commonly use in

binary classification tasks. Formally, this loss is equal to the average of the categorical

cross-entropy. Using categorical cross-entropy will prohibit us from seeing the dominant

noise in the environment and other noises that come together during the sampling.

34
Chapter 4

Experimental Result

Accuracy is an important parameter that we need to observe in a classification task. Since

this thesis is about developing a mobile application, we also need to consider the

computation complexity, resource usage, and power consumption. While these

considerations are all related, latency is also essential since this thesis also covers audio

signal processing. In this thesis, we execute two kinds of experiments. The first one is to

find a suitable amount of MFCC to use in our system to determine the fixed cepstral

coefficient and achieve high accuracy.

Along with this first experiment, we also tested the same amount of MFCCs, epochs,

and dataset on two other noise classifier models. We meant to run this test as a

benchmarking process to the classification accuracy we achieved from our proposed

model. The second experiment is to observe the resources and power consumption while

running the application and compare the results if the application is running on different

models of smartphones. We also run the same benchmarking process here to see the

amount of time needed for the application to run. The overall process comprises the

35
Experimental Result

feature extraction process and the classification process. The feature extraction process

should be static since we use the same method for extracting the MFCCs in the proposed

architecture and the other two. In this chapter, we provide the experimental results along

with some comprehensive discussions regarding the performance evaluation.

36
100
90
80
70
Accuracy (%)

60
50
40
30
20
10
0
30 35 40 45
MFCCs
Proposed Hassan Singh

Figure 4.1 Classification accuracy comparison between the proposed system and
two other architectures

4.1 MFCCs Determination

Determining the size of a feature vector is crucial in classification. Increasing the

size (dimensionality) will increase the chances for higher accuracy; however, it will also

demand higher and heavier computation. As mentioned in the MFCC Extraction process,

the size of the feature we extract depends on the value we put in the sampling frame and

overlapping frame size. These amounts determine the number of audio frames our

application needs to sample, and each sampling for each frame requires computation time

and memory allocation for storing the extracted value temporarily. Decreasing the feature

vector size will increase the performance of the system, but it might also compromise the

accuracy of the model itself. Therefore, before we start the development of the entire

application, we need to determine the dimension of our feature vector.

37
4.1. MFCCs Determination Experimental Result

We ran four experiments to determine the most suitable number of coefficients from

30 to 45. We decided not to get over 45, as mentioned in Jacoby’s research, that the

accuracy goes down beyond that amount of MFCCs. Figure 4.1 presents the results of the

experiments, and we found that 30 is the most suitable one since it leads to the highest

accuracy and the shortest computation time needed, as seen in Figure 4.2. Based on this

result, our feature vector size will be 102 × 30.

Finding the amount of frame, we need to use for the feature vector is also tricky due

to the workflow of the TarsosDSP library. As presented in Table 3.1, the acceptable data

types for frame size and frame overlap size are integers, which makes it harder for us to

perform the tuning. Since we train our model with the most prolonged duration at a six-

second-long audio file, we try to get 600 frames per six seconds, which will make 100

frames per second. However, we can not achieve such amount due to the resolution, and

we decided to use the closest one, which is 102 frames per second. This amount is suitable

enough to achieve high accuracy and low processing time.

The trend of the classification accuracy does not go linearly with the amount of

MFCCs. As we can see in Figure 4.1, the accuracy somehow goes down when we utilized

40 MFCCs. According to the results from Jacoby’s research, he found a relatively even

distribution for some sounds and a drastic change in some other cases. The sound of the

car horn is a notable example, where the classification accuracy drastically increases with

the increase in the band from 10 to 50. This phenomenon makes sense for a very tonal

sound, such as the sound of a car horn, as the resolution of audible frequencies increases

with the number of available Mel bands. Another kind of sound that is also tonal is the

sound of the siren. The classification accuracy for this sound starts relatively high at near

75%, and gradually decreases with the increasing Mel bands.

38
4.1. MFCCs Determination Experimental Result

From observing the trend and behavior of the MFCC feature, we understand that the

classification accuracy from utilizing MFCC fluctuates based on the tonal characteristic

of the soundwave. Therefore, we decided to modify our dataset by changing the label of

some audio files. Based on the classification trends, we understand that some labels have

a similar characteristic to the others. For instance, the sound of a car engine and a

motorcycle has a minimal difference. By regrouping these similar audio files increases

the classification accuracy, and it also helps us in determining the number of MFCC we

need to use for our feature vector.

39
3.500

3.000
Processing time (s)

2.500

2.000

1.500

1.000

0.500

0.000
30 35 40 45
MFCCs
Proposed Hassan Singh

Figure 4.2 Elapsed time in classifying a six-seconds long audio file using three
different architectures

4.2 The Runtime of the Developed Application

In the past few years, smartphones became a high-demand product in the global

market. Many developers came with their products and new devices’ architecture. These

vastly diverse architecture turns into a challenge for us to develop a system that can work

on the majority of the products. To validate the performance of our developed application,

we measured the average runtime of each audio file sample (six samples) on four different

smartphones with different specifications, as shown in Table III.

Figure 4.2 presents the runtime comparisons between our developed system and the

other classifiers. Based on these results, the system consumed an average runtime of 666

ms for extracting 30 MFCCs and classifying a six-second long audio file. Compared to

the other architectures, our proposed system consumed less processing time.

40
4.2. The Runtime of the Developed Application Experimental Result

TABLE 4.1
SPECIFICATIONS OF THE TESTED SMARTPHONES

Index CPU RAM OS

Phone A Snapdragon 630


4 GB 8.0
Octa-core 2.2 GHz

Phone B Snapdragon 630


4 GB 9
Octa-core 2.2 GHz

Phone C Snapdragon 820


4 GB 8.0
Quad-core 2.2 GHz

Phone D Snapdragon 855


6 GB 10
Octa-core 2.84 GHz

TABLE 4.2
RESOURCES CONSUMPTION OF THE DEVELOPED SYSTEM IN A NOISY ENVIRONMENT

Index Max. CPU Usage Max Memory Consumption

Phone A 10% 254.7 MB

Phone B 14% 76.5 MB

Phone C 33% 151.9 MB

Phone D 7% 109.7 MB

Average 16% 148.2 MB

41
4.2. The Runtime of the Developed Application Experimental Result

Figure 4.3 Confusion matrix for augmented test data using 30 MFCCs

Regarding our noise classifier model, we managed to achieve 92.06% accuracy; for

a brief review and analysis, we present the results in a confusion matrix, as seen in Figure

4.3. Each of these five labels has its unique characteristic such as continuous medium

range frequency for the car horn, murmur sound for the crowd, short and low range

frequency for the dog bark, the flange-like audio signal for siren, and continuous motor

sound for vehicles. These different tonal characteristics affect classification accuracy

since one type of audio is classified well with a certain amount of MFCCs. In contrast,

that same amount could mess the accuracy of another type.

As we can see in the confusion matrix, the classifications for car horn and siren are

not as accurate as the rest. This issue occurs since the audio files for those two labels are

42
4.2. The Runtime of the Developed Application Experimental Result

less than the amount of the other labels; besides, the audio file itself is not clean enough

to gain higher accuracy. As we mentioned before, the audio signal presents in the

environment is an overlapping combination of several audio signals from different

sources that propagate at the same moment.

43
Figure 4.4 Screen capture of the application’s power usage

4.3 Resource Consumption during Classification

In this thesis, to assess the resource consumption of the developed application, we

conducted the experiments to measure the average CPU and memory usage during the

classification in a noisy environment. Table IV presents the results of the experiments.

One challenging objective in this development is to create an application that can run

over an extended period. Therefore, we also need to examine the battery consumption

during the runtime. After testing the application for one hour in a noisy environment, we

retrieved the result by screen-capturing the battery usage information, as shown in Figure

4.4. Based on the captured information, we found that the device’s estimated power use

during the execution time is around 0.03 Watts as converted from the 6 mAh, written in

44
4.3. Resource Consumption during Classification Experimental Result

the snapshot. This information also shows that the total CPU runtime is around 5minutes

and 43 seconds out of one hour of application runtime.

Bearing in mind that the CPU foreground shows the duration of a running application

while an Activity from the application was in the foreground. It might also include when

a Service from the application was in the foreground, which displays an ongoing

notification. CPU total includes all of the CPU usage (services and broadcast receivers in

the background and activities in the foreground). According to the results present in

Figure 4.4, we can see that the CPU total time is only three seconds more than the CPU

foreground. This result means that our application does not have much of a process that

runs on the background. Most of the exhaustive processes are for presenting the graph,

which shows the classification results.

Keep awake measures the length of time that this application has used wake locks or

alarms to keep the device awake when it would otherwise have been asleep. In a way, this

is potentially the most significant drain on the battery. Sleeping uses much less power

than staying awake, so if an application keeps a wake lock for a long time, it is keeping

the device in a high-power mode all the time, even if the application is not doing any

significant work. In our results, we can seed that our application does not require any

wake locks to keep the device awake; hence our application can conserve the energy for

further use.

In our captured results, we may assume that our application consumes a low amount

of power from the device battery. However, this result is only a rough estimation of the

real battery consumption. The CPU does not do the usage calculations by itself. It may

have hardware features to make the task less cumbersome, but it is mostly the job of the

operating system. The details of these implementations will vary (especially in the case

of multicore systems). The general idea is to see how long is the queue of tasks our CPU

45
4.3. Resource Consumption during Classification Experimental Result

needs to finish. The operating system may take a look at the scheduler periodically to

determine the number of things it has to do.

After all, we still need to refer back to the number of cores the devices have to

measure the CPU loads since CPU loads and battery consumption are related. On a quad-

core system, a value higher than 25% would mean that the system has fully utilized one

core for the application and, as a result, would be considered as high CPU usage, although

this is still fine in short bursts. All the treads used in the application should be named to

use the information provided in the best possible way. It is always good to check if any

of the light threads are unexpectedly consuming more CPU cycles.

46
(a) (b) (c)
Figure 4.5 Classification result from (a) the beginning (nothing), (b) an audio file, and
(c) microphone

4.4 Developed Application Overview

After the development, we run a quick test to see how the application behaves. Figure

4.5(a) shows the overview of the application the moment it starts running. When the

application starts, it will first ask for permission to access the device’s microphone and

access to write and read the internal storage. The need for accessing the device’s internal

storage is to move our audio files sample to a folder the application will create called

CSV. This folder serves the purpose of the application default storage to store CSV files

and audio files.

This application gives two options to choose the source of the audio signal: from an

audio file or microphone. The purpose of classifying from an audio file is to see how good

this system in classifying a known noise label. On the top-right corner of the screen, we

put a button called CSV, which will bypass the classification function and save the

47
4.4. Developed Application Overview Experimental Result

extracted feature as a CSV file instead. To save the features as CSV serves the purpose

for further research or to compare our extracted feature with other MFCCs extracted from

other libraries. If we activate the CSV button, the From Microphone radio button will be

disabled since saving the features from the microphone to the CSV usually caused the

application to crash due to the silence detector function, which will wipe the features

when the application detects silence.

As we can see in Figure 4.5(b), the system can classify the sound of siren accurately.

It also shows that the system classifies the sound as a sound of traffic in a smaller portion.

Such classification occurs due to the sound presents in the audio file also contain a sound

of the engine (probably the ambulance); however, the siren sound dominates in the

environment. During the test, we noticed a little bit of latency for the results to show up

after we clicked on the SELECT FILE button. The latency is longer than the computation

time mentioned in Figure 4.2. The plotter causes this latency. The plotter calculates and

plots the graph to show the results in an interactive view. According to the debugger

report, the processing time remains match with the results present in Figure 4.2.

Figure 4.5(c) shows an overview of the classification from the device’s microphone.

It behaves similarly with the one from an audio file with a difference that the graph is

continually changing whenever there is a presence of sound above the silence threshold.

The microphone button will change into a sign of a stop button since we add a function

that we can stop this process at any time. The application will show the classification

result for every second. This one-second is not a delay nor a latency. As mentioned in

Chapter 3, we captured 102 frames for every second. We can change this capturing

duration to a shorter time or longer according to our desire. We decided to use a one-

second duration only for research purposes.

48
4.4. Developed Application Overview Experimental Result

We include our model file in the asset folder, together with the audio file samples.

This TensorFlow Lite model is compact and relatively small in size due to the

compression done by the TFLite Converter.

49
4.5 Discussion

In this thesis, we surveyed the other state-of-the-art systems for noise classification

and compared them with the developed application. The real-time noise classifier by

Alamdari et al. offers an advantage in class limitation since it is using an unsupervised

classification method. However, this system uses multiple features such as band-

periodicity, band-entropy, and mel-frequency spectral coefficients, which consumes a

significant amount of computation. According to their results, the system required 26%

CPU consumption on an octa-core 2.35 GHz CPU, where our system requires less when

it runs on a similar specification, as seen in Table IV. Our developed classifier is designed

in such a way not only to overcome the latency and high computation requirement but

also to filter out the white noises at the same time.

Supporting our proposed architecture, the dataset augmentation [30] contributes to

achieving high accuracy. Also, merging the least significant classes expanded the range

of variation in our noise classes. This merging also allows us to have a higher resolution

for the tonal characteristic, which affects the decision to choose the amount of MFCCs

we need to use.

Although our system has achieved high classification accuracy, there are still

possibilities to increase these values by using a cleaner audio dataset. One of the methods

we may try is by using synthesized audio, mimicking multiple variations of sound in a

single class by varying the frequency and timbre.

This work also allows us to try to implement this classifier in a noise-removal system

by treating the noisy sound we received according to the type of noise presented at the

moment. Since we developed the system for a smartphone, it may provide an opportunity

to replace hearing aid devices in the future.

50
4.5. Discussion Experimental Result

We currently developed this application for research purposes only. We found

several incompatibilities when we implemented the application on various smartphones,

and we will try to fix this issue in future development. Currently, we tried our application

only on several brands of smartphone, such as:

• Lenovo K6 Note

• ASUS Z01KD

• ASUS X017DA

• Sony Xperia XZs G8232

• Google Pixel 4

51
Chapter 5

Conclusions

We developed a real-time supervised noise classifier which runs on a smartphone

that surpasses the performances of the other two architectures with 92.06% accuracy by

only using one audio feature (MFCC). This system presents as a stepping stone to

overcome the current hearing aids problem to eliminate annoying environment noises,

which discouraged people from using such a helpful device. When the system received

an audio signal which is above the defined threshold, the noise classifier model can

perform the classification by using only MFCCs as the input features. Also, this system

works with lower CPU consumption and less processing time (666 ms) compared with

other architecture. Therefore, this developed system can help the further development of

a noise removal system which is suitable for a modern smartphone.

52
References

[1] K. R. Borisagar, R. M. Thanki and B. S. Sedani, Speech Enhancement Techniques

for Digital Hearing Aids, Switzerland: Springer, 2019.

[2] World Health Organization, “Deafness and hearing loss,” World Health

Organization, 20 March 2019. [Online]. Available: https://www.who.int/news-

room/fact-sheets/detail/deafness-and-hearing-loss. Accessed on: September 16,

2019.

[3] A. McCormack and H. Fortnum, “Why do people fitted with hearing aids not wear

them?,” International journal of audiology, vol. 52, no. 5, pp. 360-368, 2013.

[4] Å. Skagerstrand, S. Stenfelt, S. Arlinger and J. Wikström, “Sounds perceived as

annoying by hearing-aid users in their daily soundscape,” International Journal of

Audiology, vol. 53, no. 4, pp. 259-269, 2014.

[5] C.-Y. Chang, A. Siswanto, C.-Y. Ho, T.-K. Yeh, Y.-R. Chen and S. M. Kuo,

“Listening in a Noisy Environment: Integration of active noise control in audio

products,” IEEE Consumer Electronics Magazine, vol. 5, no. 4, pp. 34-43, 2016.

53
[6] AUDI-LAB, “The Advantages & Disadvantages of Hearing Aid Types,” AUDI-

LAB, [Online]. Available: https://www.audi-lab.com/types-of-hearing-aids-

advantages-disadvantages. Accessed on: January 20, 2020.

[7] National Institute of Deafness and Other Communication Disorders (NIDCD),

“Hearing Aids,” National Institutes of Health, Bethesda, 2013.

[8] Y.-H. Lai, Y. Tsao, X. Lu, F. Chen, Y.-T. Su, K.-C. Chen, Y.-H. Chen, L.-C. Chen,

L. Li and C.-H. Lee, “Deep Learning–Based Noise Reduction Approach to Improve

Speech Intelligibility for Cochlear Implant Recipients,” Ear and Hearing, vol. 39,

p. 1, 2018.

[9] F. Saki and N. Kehtarnavaz, “Background Noise Classification using Random

Forest Tree Classifier for Cochlear Implant Applications,” in 2014 IEEE

International Conference on Acoustic, Speech and Signal Processing (ICASSP),

2014.

[10] Z. Alavi and B. Azimi, “Application of Environment Noise Classification towards

Sound Recognition for Cochlear Implant Users,” in 2019 6th International

Conference on Electrical and Electronics Engineering (ICEEE), 2019.

[11] J. Singh and R. Joshi, “Background Sound Classification in Speech Audio

Segments,” in 2019 International Conference on Speech Technology and Human-

Computer Dialogue (SpeD), Timisoara, 2019.

[12] S. U. Hassan, M. Z. Khan, M. U. Ghani Khan and S. Saleem, “Robust Sound

Classification for Surveillance using Time Frequency Audio Features,” in 2019

International Conference on Communication Technologies (ComTech), 2019.

54
[13] J. Sang, S. Park and J. Lee, “Convolutional Recurrent Neural Networks for Urban

Sound Classification using Raw Waveforms,” in 2018 26th European Signal

Processing Conference (EUSIPCO), Rome, 2018.

[14] X. Lu, Y. Tsao, S. Matsuda and C. Hori, “Speech enhancement based on deep

denoising Auto-Encoder,” Proc. Interspeech, pp. 436-440, 2013.

[15] N. Alamdari and N. Kehtarnavaz, “A Real-Time Smartphone App for Unsupervised

Noise Classification in Realistic Audio Environments,” in 2019 IEEE International

Conference on Consumer Electronics (ICCE), Las Vegas, 2019.

[16] F. Saki, A. Sehgal, I. Panahi and N. Kehtarnavaz, “Smartphone-based Real-time

Classification of Noise Signals Using Subband Features and Random Forest

Classifier,” in 2016 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), Shanghai, 2016.

[17] S. Hyun, I. Choi and N. K. Soo, “ACOUSTIC SCENE CLASSIFICATION USING

PARALLEL COMBINATION OF LSTM AND CNN,” in Detection and

Classification of Acoustic Scenes and Events 2016, Budapest, 2016.

[18] A. Samal, D. Parida, M. R. Satapathy and M. N. Mohanty, “On the Use of MFCC

Feature Vector Clustering for Efficient Text Dependent Speaker Recognition,” in

Proceedings of the International Conference on Frontiers of Intelligent Computing:

Theory and Applications (FICTA) 2013, 2014.

[19] T. Ganchev, N. Fakotakis and K. George, “Comparative evaluation of various

MFCC implementations on the speaker verification task,” in 10th International

Conference on Speech and Computer (SPECOM 2005), 2005.

55
[20] M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu and Q. Tian, “HMM-Based Audio

Keyword Generation,” in 5th Pacific Rim Conference on Multimedia, Tokyo, 2004.

[21] C. B. Jacoby, “Automatic Urban Sound Classification Using Feature Learning

Techniques,” New York University, New York, 2014.

[22] J. Salamon, C. Jacoby and J. P. Bello, “A Dataset and Taxonomy for Urban Sound

Research,” in MM ’14: Proceedings of the 22nd ACM international conference on

Multimedia, Orlando, 2014.

[23] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block

based transformation in MFCC computation for speaker recognition,” Speech

Communication, vol. 54, no. 4, pp. 543-565, 2012.

[24] F. Zheng, G. Zhang and Z. Song, “Comparison of different implementations of

MFCC,” Journal of Computer Science and Technology, vol. 16, no. 6, pp. 582-589,

2001.

[25] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural

Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

[26] J. Dai, S. Liang, W. Xue, C. Ni and W. Liu, “Long short-term memory recurrent

neural network based segment features for music genre classification,” in 2016 10th

International Symposium on Chinese Spoken Language Processing (ISCSLP),

Tianjin, 2016.

[27] A. Dang, T. H. Vu and J.-C. Wang, “Acoustic scene classification using

convolutional neural networks and multi-scale multi-feature extraction,” in 2018

IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, 2018.

56
[28] J. Six, Digital Sound Processing and Java - Documentation for the TarsosDSP

Audio Processing Library, Belgium: IPEM, 2015.

[29] E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis and X. Serra, “FSDKaggle2019,”

Zenodo, New York, 2020.

[30] P. Corcoran, C. Costacke, V. Varkarakis and J. Lemley, “Deep Learning for

Consumer Devices and Services 3—Getting More From Your Datasets With Data

Augmentation,” IEEE Consumer Electronics Magazine, vol. 9, no. 3, pp. 48-54,

2020.

[31] S. K. Kumar, “On weight initialization in deep neural networks,” ArXiv, vol.

abs/1704.08863, 2017.

[32] D. P. Kingma and J. L. Ba, “ADAM: A Method for Stochastic Optimization,” in

International Conference on Learning Representation, 2014.

[33] L. D. Shapiro, “Boston Chapter Attuned to Hearing Health Care [Society News],”

IEEE Consumer Electronics Magazine, vol. 8, no. 3, pp. 5-6, 2019.

57

You might also like