You are on page 1of 22

Visualize the training of a time distributed

convolution
Abstract :
Nowadays, we commonly use neural networks to solve
our problems. Unfortunately, despite their
effectiveness, they are often perceived as black boxes
capable of providing an answer to the problem without
explaining his decisions, which raises many ethical and
legal problems. Fortunately there is an domain called
explainability, helping us to understand these results.
This aspect of machine learning allows us to
understand the decision process of a model, and gives
us the possibility to verify the relevance of this results.
In this article, we will focus on the learning realised by
a time distributed convRNN performing anomaly
detection from video data.
Introduction :
Currently, neural networks are quite complex and
opaque structures, which makes them difficult to debug
or improve when performance has not met your goals. I
perform real-time anomaly detection in videos. We
consider as an anomaly any actions that may have
consequences for the safety or security of peoples
present in it, such as fights, gunshots, accidents, etc.
Whitch can be considered as motion detection or action
recognition. For this, we use a time-distributed
convRNN processing on image sequences. This model
is composed of VGG19 for the convolution part and a
GRU for the RNN. This architecture has already been
the subject of a French publication [0]
Model architecture

In the data preparation step we cut out all videos


containing an anomaly in order to keep only this one.
But it is not the case of the videos that we are trying to
predict. You are aware that in a video several different
actions can take place, in our case we can find normal
sequence before and after anomaly sequences. In
addition, all videos do not have the same duration or
the same number of frames per second (fps). As a
result, the actions could be very different depending on
the step fixed between the images extracted. A very
small step will form a sequence summarizing a short
action or an action fragment comparing to a very large
step which will form a sequence summarizing the entire
action or different actions.
As for a model using images, it is possible to visualize
the input data to ensure that it contains the action or the
movement that we want to identify. This visualization
only allows us to ensure that the predicted class is the
correct one, it gives no other indication.
In projects using convolutional networks processing
images, it is common to carry out a visualization of the
characteristics perceived by the model in order to
ensure the relevance of the learning.
But what about video data ? Given that a video can be
compared as an image list, is it possible to apply such
procedure on models using time- distributed
convolutions ?
This is the question we will answer through this article.
For this, we will start by listing different visualization
technologies, then we will detail our procedure, to
finish we will present a set of results before concluding.
Related work :
In 2019, Christoph Molnar published a book titled: "A
Guide for Making Black Box Models Explainable"
highlighting various explainability and visualization
technologies. [1]
On the one hand we find technologies independent of
model used such as LIME (Local Interpretable Model-
Agnotic Explanations) proposed on

August 9, 2016 by Marco Tulio Ribeiro, Sameer Singh


and Carlos Guestrin [2] or SHAP (SHapley Additive
exPlanations) proposed in November 2017 by Scott M.
Lundberg and Su-In Lee. [3]
On the other hand, we have technologies specific to
certain models such as convolution networks for which
we find visualization techniques like convolution filter
[4], saliency map [5,6] , activation maps [7,8,9] etc.
To visualize these characteristics there are many
libraries. In 2017, Kotikalapud Raghavendra proposed
keras-vis [10] a public library allowing to visualize the
convolution filters of each layer, their evolution during
the training and activation maps. Later in 2020 Philippe
Remy developed another library called keract [11] to
perform similar processing. Also in 2020, Gotkowski,
Karol et al propose another one allowing to visualize
both 2D and 3D attention maps. [12] Nowadays, the
keras library developed by François chollet in 2015
[13] also includes some of these visualization
techniques.
Approach :
Our model is composed of an architecture that could be
described as binary convRNN because it has only 2
classes: the normal class representing the absence of
problem and the anomaly class (either a fight or a
shootgun). In terms of explainability, it is much easier
to interpret the features learned by our convolution
because they can be visualized unlike that of our RNN.
It would seem that CNN-specific visualization libraries
are not adapted to our network type. The first problem
concerns our architecture.
Architecture Comparison:2D convolution / 3D
convolution / Time distributed convolution
In the case of a model using 2D or 3D convolution
layers all layers are directly connected. However, in our
architecture we rather use a time distributed
convolution so we have a sub-model present in this
layer, sub- model indirectly connect to the other layers,
which poses some problem when we propagate an
information.
The 2nd problem concerns the 3rd dimension added by
the time factor. Unlike models processing images or 3D
objects for which there would be a single input data and
a single output visualization, here we are processing
videos so we have several input images but a single
output visualization, representing the integrity of our
input sequence.
We wondered if it was possible to perform a
visualization for each images in order to understand the
processing carried out by our sub-model. Our objective
will be to visualize the areas on which the model based
attention in order to predict an anomaly. For this we
will mainly use saliency map as well as class activation
map.
To perform this kind of visualization we will need to
propagate an information through our network in order
to obtain the final activation. This will allow us to
produce saliency maps by calculating the gradient of
this activation according to our input data or activation
maps by calculating the gradient of this activation
according to the output linked to the layer that we
wanted to visualize.
Unfortunately the use of a time distributed layer makes
it difficult to propagtaion between the sub-model and
the main model. Moreover, the extraction of the sub-
model would cut the connection with the following
layers.
To produce the saliency maps, we will therefore have
to calculate our gradient from a sequence and not from
an image. The result of this will have the same
dimension as the input data, that is to say that we will
obtain a list of gradients equal to the size of our
sequence. Thenyou just need to display these different
gradients to obtain a saliency map per image.
Concerning the activation maps, the only solution will
be to use the output of the time distributed layer. As
previously explained, this layer aims to add a time
factor to our data by allowing us to perform the same
processing for each image making up our sequence. In
our case, apply VGG19 to each of them. The output of
this layer can be viewed as a result list, containing an
output for each images. Our gradient will be the same
dimension as the output of this layer. Then it will be
possible for us to apply each gradient to the
corresponding output to obtain our heatmap, and next
project it onto the associted image.
Experimentations :
In this section, we will present our results and the
advantages / inconvenience of each approaches
explained above.
Saliency map for a shooting video extract from the test
set
This technique iallow us to perform saliency maps
using techniques like vanilla or smoothgrad, use or not
guided gradient propagation or use various gradients
such as : positive / negative etc...

Class activation map (Grad Cam) for a shooting video


extract from the test set
With this kind of architecture it is only possible to
make activation maps for the last layer of your sub-
model. The first thing we noticed is that from one
image to another the model does not focus on the same
areas despite that the images are successive. In order to
facilitate visualization of our sequences, we extracted
the contours of these heatmaps using open CV.
Example of visualization for classes: shooting and fight
This new visualization allowed us to notice areas of
low activation that were difficult to perceive using the
Grad cam method. On the other hand, this visualization
also has some inconvenience, we can notice that the
detection of contours is not very precise, moreover you
can find contours included in others when there is a
major activation zone surrounded by a minors area.
Action map + visualization of contours on an image
extracted from the shooting class test video.
Action map + visualization of contours on an image
extracted from the fight class validation video.
Theses images perfectly illustrate the advantages and
inconvenience of this technique. In the first image we
can notice that the gun is perceived as a weak
activation zone, difficult to see on the activation map
but very clear thanks to the contours. We can also
notice a major activation zone surrounded by a minor
zone located to the left of the image. The second image
represents a person slapping another. With the
activation map it seems that the model perceives the
action but is mistaken on the label,
however by visualizing the contours we realize that it
completely misses the action. At present, this
visualization of the contours does not allow us to know
the intensity of the activations.
Thanks to this visualization, we were also able to
observe the influence of the other layers (RNN, Dense,
Dropout) on the characteristics learned by our
convolution. Influence induced by the backpropagation
carried out in this type of model. This allows us to
better define the parameters of these layers.
Visualization of the activation map for the penultimate
convolution layer with variation of the number of
neurons in the GRU Layer.
Activation map showing the variation of the dropout.
Using theses visualizations it is also possible to
visualize the characteristics of the normal class. In our
case of binary model, the normal class represents the
absence of anomaly. In this class we find a very varied
set of actions such as working, walking, going sports,
etc... We can note that for this class, the movements
carried out are generally slow, unlike an anomaly
which introduces a fast and sudden movement. For a
human, this will be chosen by default if no
characteristics corresponding to an anomaly

are found. However, this is not the behavior that our


model will adopt. To predict the normal class, he need
to find characteristics representing it.
Visualization of a normal video extract from the test set
representing a location.
4 convolution filters for layers 1 and 2
CONCLUSION:
Through this article we have shown that it is possible to
apply some visualization techniques specific to CNN
model (model processing images) on convRNN
networks using a time distributed convolution (model
processing videos). In the next step it would be
interesting to improve the visualization of the contours
in order to represent the intensity of eatch activation
zones, and develop a visualization techniques specific
to video data.
With the advancement of artificial intelligence,
explainability has become a major issue that generate a
great interest to the scientific community. Due to this
interest, many works are carried out each year,
particularly concerning convolutional neuronal
networks or others models using images data
[14,15,16,17]. This research wallow us to clarify and
improve our comprehension in this domain.

You might also like