You are on page 1of 3

Description of the Architecture of the

Solution
The first neural network is a convolutional neural network with the purpose of
extracting high-level features of the images and reducing the complexity of the
input. We will be using a pre-trained model called inception developed by
Google. Inception-v3 is trained on the ImageNet Large Visual Recognition
Challenge dataset. This is a standard task in computer vision, where models try
to classify entire images into 1,000 classes like "zebra," "dalmatian," and
"dishwasher."

We used this model to apply the technique of transfer learning. Modern object
recognition models have millions of parameters and can take weeks to fully
train. Transfer learning is a technique that optimizes a lot of this work by taking
a fully trained model for a set of categories like ImageNet and retrains from the
existing weights for new classes.

Figure 1: Inception model

The second neural network used was a recurrent neural network, the purpose of
this net is to make sense of the sequence of the actions portrayed. This network
has an LSTM cell in the first layer, followed by two hidden layers (one with
1,024 neurons and relu activation and the other with 50 neurons with a sigmoid
activation), and the output layer is a three-neuron layer with softmax activation,
which gives us the final classification.

Figure 2: Recurrent neural network

Methodology
The first step is to extract the frames of the video. We extract a frame every 0.2
seconds and using this frame, we make a prediction using the inception model.
Considering we are using the transfer learning technique, we are not going to
extract the final classification of the inception model. Instead, we are extracting
the result of the last pooling layer, which is a vector of 2,048 values (high-level
feature map). Until now, we had a feature map of a single frame. Nevertheless,
we want to give our system a sense of the sequence. To do so, we are not
considering single frames to make our final prediction. We take a group of
frames in order to classify not the frame but a segment of the video.

We consider that analyzing three seconds of video at a time is enough to make a


good prediction of the activity that is happening at that moment. For this, we
store fifteen feature maps generated by the inception model prediction, the
equivalent of three seconds of video. Then, we concatenate this group of feature
maps into one single pattern, which will be the input of our second neural
network, the recurrent one, to obtain the final classification of our system.

You might also like