Anomalous Sound Detection

Group 17
EURECOM AML 2022:: Challenge 2

Anomalous Sound Detection
CHALLENGE:
How to solve it?

1. Data exploration
Introduction
Wave form
Spectograms
Chromagram
Mel scale
2. Model selection: Auto Encoder
CNN
Batch consideration
3. Model performance
Data exploration: introduction
Each recording is a single-channel (approximately) 10-sec length audio that
includes both a target machine's operating sound and environmental noise.
The data come from ToyADMOS and the MIMII Dataset consisting of the
normal/anomalous operating sounds of six types of toy/real machines.
We will work only on Slide rail.

Other types of toy/real machines:
Toy-car (ToyADMOS)
Toy-conveyor (ToyADMOS)
Valve (MIMII Dataset)
Pump (MIMII Dataset)
Fan (MIMII Dataset)
Slide rail (MIMII Dataset)
The audio files are time-series data, so they give information about the
amplitude of the sound over time.
To visualize the sequence, let's plot the waveplot for these signals is in the
next slide.
Data exploration: wave form
Data exploration: wave form
However, in deep learning models the common practice is to convert the
audio into a spectrogram which is:
a concise snapshot of an audio wave;
an image → well suited to being input to CNN-based architectures
developed for handling images.
The spectrograms of the two signals (normal and anomalous) are visible in the
next two slides
Data exploration: spectrogram
Data exploration: spectrogram
Data exploration: chromagram
Another way to get information about the differences of the two kind of signals
is the Chromagram.
It describes perceptual 'differences'/'distances' of pitches within an octave

and the perceptual sameness of pitches separated by one or more full octaves.
Pitch can be understood as a relative highness/lowness of a sound. The

higher the sound, the higher the pitch and the lower the sound, the lower the
pitch.
While normal sounds produce a
chromagram that spans all notes,
in the chromagrams of anomalous
sounds we note that the graph
focuses almost exclusively on notes
B and G.
This is a confirmation that

anomalous sounds are characterized
by higher frequencies.
Data exploration: Mel scale
Humans are better at detecting differences in lower frequencies than higher
frequencies.
For example, we can easily tell the difference between 500 and 1000 Hz
but we will hardly be able to tell a difference between 10,000 and 10,500 Hz
The mel scale is a scale of pitches judged by listeners to be equal in distance

one from another.
We can use the librosa library to produce a linear transformation matrix,

then use it to plot a new spectrogram (next two slides).
Data exploration: test vs train
In this challenge it is
important to take
into consideration
the fact that the
train is composed
only of normal
sounds, while the
test of anomalous
sounds as can be
seen from the
spectograms.
Model selection: Autoencoder
Autoencoder
CNN Autoencoder
2 flavour
In this challenge the difficulty lies in learning unlabeled data.
Fortunately, this can be solved through the use of the autoencoder.
Autoencoder is an unsupervised artificial neural network that learns

how to efficiently compress and encode data
then learns how to reconstruct the data back from the reduced
encoded representation to a representation that is as close to the
original input as possible.
The autoencoder method works very well with anomaly detection because
the encoding operation relies on the correlated features to compress the
data.
As mentioned before, we will use spectrogram in our CNN AE.
To get CNN to process them, we defined and fixed some parameters of

the spectrogram:
n_mels;
n_ftt;
hop_length
That way, we were able to get as much information as possible.

But why Are the Convolutional Autoencoders Suitable for Image Data?
Instead of stacking the data, the Convolution Autoencoders keep the

spatial information of the input image data as they are
and extract information in what is called the Convolution layer.

Noise reduction
The idea is to train a model with noisy data as the inputs, and their
respective clear data the output. Here we can see a normal signal:
Noise reduction
While here we have an anomalous signal
It involves the:
Encoder
Conv2d (Convolutional layer)
BatchNorm2d (Batch normalization)
ReLU (Activation layer)
Latent space
Decoder
ConvTranspose2d (Transposed Convolutional layer)
BatchNorm2d (Batch normalization)
ReLU (Activation layer)
The encoding process compresses the input values to get to the core layer.
The decoding process reconstructs the information to produce the
outcome.
The decoding process mirrors the encoding process
Noise reduction: convolutional layers
Creates many small pieces called the feature maps or features;
These squares preserve the relationship between pixels in the input
image.
After scanning through the original image, each feature produces a
filtered image with high scores and low scores:
→
perfect match high score in that square;
→
low match or no match low or zero score.
Hyperparameters:
Padding
Strides
Noise reduction: convolutional layers hyperparameters
Padding:
Controls the kernel size and the output size in an "independent" way.
Without padding some part of the input image will receive “less
attention” than central parts.
In some other cases we would like to preserve the size of the input
and if we use a convolutional filter without padding we are doomed to
reduced output size of the filter, so the output size of the filter will be
smaller than the input image.
One way to solve it is to add some fake pixels that contributes to this
solution.
Padding:
Strides:
The idea is to we some position of the kernel when it slides on top of
the image
useful to reduce the computational cost.
extraction of the feature more coarse, less fined-grained.
Noise reduction: ReLUs
Rectified Linear Unit (ReLU);
Is the step that is the same as the step in the typical neural networks;
It rectifies any negative value to zero so as to guarantee the math will

behave correctly.
Batch normalization
Batch normalization it’s important, especially with networks that have no
shortcut connections because it helps to smooth out the geometry of the
loss function at the expense of some computation
It makes the loss landscape significantly more smooth;
Larger range of learning rates;
Faster convergence.
Batch size
When we do optimization we consider the whole data to be used for the
optimization. This can be very costly. So instead we can consider a portion of
the data. This is what we call mini-batch.
Small batches promote flatness of the loss.
Flat minimizers correlate well to smaller test error.
Smaller test error is a good proxy for good generalization.

Model performances
We tried three flavour of autoencoder:
Vanilla (?) Autoencoder
CNN Autoencoder
CNN Autoencoder with batch normalization and mini-batch
Considering what we said before, it is quite obvioust that the CNN

Autoencoder with batch normalization is the one that performed the best.
Accuracy:
Autoencoder 76%
CNN Autoencoder 88%
CNN Autoencoder Batch Normalization 92%
THE TEAM
Nour Thlijani Giulio Corallo Yash Agarwalla Valentina Lonardo

Anomalous Sound Detection

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anomalous Sound Detection

Uploaded by

Copyright:

Available Formats

Group 17

EURECOM AML 2022:: Challenge 2

How to solve it?

We will work only on Slide rail.

It describes perceptual 'differences'/'distances' of pitches within an octave

Pitch can be understood as a relative highness/lowness of a sound. The

This is a confirmation that

The mel scale is a scale of pitches judged by listeners to be equal in distance

We can use the librosa library to produce a linear transformation matrix,

Fortunately, this can be solved through the use of the autoencoder.

Autoencoder is an unsupervised artificial neural network that learns

To get CNN to process them, we defined and fixed some parameters of

That way, we were able to get as much information as possible.

Instead of stacking the data, the Convolution Autoencoders keep the

and extract information in what is called the Convolution layer.

It rectifies any negative value to zero so as to guarantee the math will

It makes the loss landscape significantly more smooth;

Larger range of learning rates;

Small batches promote flatness of the loss.

Flat minimizers correlate well to smaller test error.

Smaller test error is a good proxy for good generalization.

Considering what we said before, it is quite obvioust that the CNN

Nour Thlijani Giulio Corallo Yash Agarwalla Valentina Lonardo

You might also like