0% found this document useful (0 votes)
312 views34 pages

AI Video Summarization Project Overview

Objective of this project was to take a video as input and summarize it in text format. We done it using some pre-trained models for object detection, activity detection, and background detection and finally give outputs of these models to another pre-trained model to generate a summary. All these models are mentioned in the slide.

Uploaded by

Pratik Avhad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
312 views34 pages

AI Video Summarization Project Overview

Objective of this project was to take a video as input and summarize it in text format. We done it using some pre-trained models for object detection, activity detection, and background detection and finally give outputs of these models to another pre-trained model to generate a summary. All these models are mentioned in the slide.

Uploaded by

Pratik Avhad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Video Summarization

By Guided By
Yash Funde(230940128010) Mr. Pramod Sharma
Pratik Avhad (230940128021)
Prerit Agarwal (230940128022)
Sahibpreet Singh(230940128026)
Contents

● Introduction
● Objective
● Overview
● Algorithm Flow
● Results
● Conclusion
● Challenges
● Future work
● References
Introduction

Video summarization is a critical area of research in the field of computer vision and
multimedia analysis, aimed at explaining the things happening in the video in simple
English. Many researchers have tried to achieve this objective by extracting audio and
then applying speech to text techniques. But not much work has been done to generate
text directly from video frames without making use of any audio. This project is a
simple trail to achieve this objective of generating simple English sentences directly
from video frames
Objective

The project aims to deliver a software program, exploiting the latest OpenCV
libraries, transfer learning techniques and making direct usage of some openly
available and accurate pretrained models. The program works towards the objective
of determining Activities occurring in a video by extracting a sequence of frames and
making the use of sequence to sequence models to make predictions. Along with, it
makes use of pretty accurate openly available pretrained models for object detection
(YOLO) and background detection out of video frame images and finally a text
generation model that produces meaningful sentences from some hint words
produced as output by the aforementioned models.
Overview
- Project involves creation of a custom dataset and transfer by fine-tuning one of the
best pre-trained architectures to detect real time activities (R2+1D), a background
detection model (VGG-16), an object detection model (YOLO) and a pretrained
sentence generation model (key2text).
- Creation of custom dataset became important the pre trained activity recognition
model (R2plus1d) was trained on kinetics dataset, so it had capability to classify
similar kinds of videos only. But it could not classify normal real life simple
activities like walking, jumping, running or eating, etc.
DATASET Preparation
Dataset preparation is a crucial step in developing and evaluating effective algorithms. The choice and
curation of datasets significantly impact the performance and generalizability of models.

We built a custom dataset starting with basic single activity videos like Person walking, running,
eating and exercising. 20 videos of each category were collected and finally the model was fine tuned
on this dataset. Dataset was prepared keeping in mind the following points

- Diversity of Content: different poses, backgrounds, styles, age and gender to ensure the model's
ability to generalize.
- Representative: Videos were cropped to a uniform time-length (of 3 seconds) capturing only
the part where the activity is demonstrated
- Scalability: Larger datasets may better reflect the challenges posed by diverse content.
R2plus1D - Activity Recognition
R(2+1)D, or ResNet 2+1D, is a convolutional neural network (CNN) architecture designed for
action recognition and video analysis tasks. It is an extension of the popular ResNet (Residual
Network) architecture, which was initially proposed for image classification tasks. The "2+1D"
in R(2+1)D signifies the incorporation of both 2D and 3D convolutions, offering a balance
between computational efficiency and expressive power for video data.
R(2+1)D combines 2D and 3D convolutions in its architecture. 2D convolutions are applied
spatially to capture information within individual frames, while 1D convolutions are applied
temporally to capture motion and temporal dependencies across [Link] factorization
reduces the computational cost compared to standard 3D convolutions, making it more
practical for real-world applications.
YOLO - Object Detection
YOLO, short for You Only Look Once, is a revolutionary object detection architecture that has
gained popularity for its efficiency in real-time applications.
The YOLO algorithm is a real-time object detection system that identifies objects in images and
videos by dividing the image into a grid and predicting bounding boxes and class probabilities
for each grid cell.

Key Features of YOLO:

1. Unified Detection 2. Grid-based Division 3. Bounding Box Prediction:


4. Class Prediction: 5. Single Forward Pass: 6. Anchor Boxes:
YOLO Output
YOLO - Loss function
VGG16 - Background Detection

We used a pretrained VGG16 model which had been trained on large dataset name
Places365 which has over 1.8 Million images corresponding to different 365
different background settings.

The model was tested by us on different images and it gave pretty good results, so
we decided to use it directly (without any fine-tuning) in our project.
INPUT OUTPUT
Algorithm Flow
Step -1 Inputs

This step involves reading the videos from the dataset folders and then performing
feature extraction and preprocessing to make them ready to be sent as input to the
model. Here’s an image depicting the logic built to read out the videos from data
folders.
Step 2: Feature extraction

Feature extraction is a fundamental process in any machine learning task that involves
extracting helpful or informative features from the raw data. These features hold
essential characteristics of the data, enabling more effective data preprocessing and
model-training.
We extracted different relevant characteristic features from the videos using OpenCv.
Features like video length, Frame Count and Frames per second. These features
helped us perform preprocessing tasks effectively.
● Extract Video Features: Extract relevant features from videos, number
of frames, Frames per second, etc.
● Temporal Segmentation: Dividing videos into videos of 3 seconds
each. Then picking up some frames at fixed regular intervals (FPS).
Step 3: Data Preprocessing

● Normalization and Resizing: Normalizing the pixel values and resizing the
frames to a uniform size to be given as input to the model.
● Concatenation: Concatenating frames sequentially into numpy arrays
● Transforms: Apply transformations as required by the models
● Creating Datasets: Converting the arrays of frames and labels into Pytorch
datasets.
● Train Test Split: Splitting the dataset into training and validation sets for
training and validating models.
Data Preprocessing

Based on the features extracted from the feature extraction process, the video was cut to
first 3 seconds (while training) and and 3 frames per second after regular intervals were
selected as criterion to train the model. So the total sequence length was 9 i.e. 3x3. Then
each frame was normalized by dividing by 255 to bring the values in range of 0 to 1.
After that the frame was resized to a fixed size to ensure uniformity in the inputs for the
model. Each video from every category was converted into a sequence of 9 timesteps and
appended into an input features list and mapped to respective labels.
Data Preprocessing

While validating input video is cut into n videos of 3 seconds each and finally
prediction is made separately on each sub video.

For Background detection and Object Detection , initial feature extraction and
preprocessing were the same till the features are extracted. These models did
not require any specific transforms except resizing the frames.

Outputs were taken for one frame per second with an intention that if at some
point the object i.e. person is not present and our activity recognition model
predicts activity as person walking then we could refine our results that the
activity prevails for a shorter duration than 3 seconds.
Step 4: Model Training

- In out first approach we tried using some open source available dataset, so we tried with
ucf50 dataset, we fine tuned the model trying different FPS, regularization like dropout
and weight decay the training results were really amazing. We achieved almost 98%
accuracy on validation but the model did not perform as intended on real time videos.
- But soon after visualizing some content of datasets we realized that the videos are not
representative of our use cases. The custom dataset was prepared with 20 different videos
for 4 different classes i.e. Person walking, Person Running, Person Eating, Person
Exercising and model was fine tuned for this data.
- Parameters of all layers of the pretrained model (R2plus1D) were freezed except the last
classifying (fully connected) layer. Attempts were made to add multiple layers in the
classifier with heavy regularization but the results did not improve rather remained the
same or worsened.
- Weight Decay and Early Stopping regularization was applied to train the model in order
to achieve the best results and cap overfitting. We were able to achieve accuracy of about
75-80% on the validation set. As our dataset was small, we were convinced with this
much accuracy and moved forward.
Step 5: Concatenating the outputs and preprocessing

Background Detection Model Activity Recognition Model


Intermediate Output
concatenation logic
and the final results.
Step 6: Sentence Generation

We found an openly available sentence generation model named `Key2Text`. This


model is designed to take keywords as inputs and generate sentences as outputs.

The model is based on the T5 (Text-to-Text Transfer Transformer) architecture, which is a


transformer model pre-trained on a large corpus of text data and can be fine-tuned for
various NLP tasks.
Web Deployment
Web Deployment is the process of deploying the code from source control to a hosting platform. This
is usually cloud or a local server. The process can be either manual or automated.

Here, we used a flask app and a template where we deploy our code on port 5000 to make it usable
with a friendly user interface.

The template is coded in HTML and all the functioning is happening because of Javascript. It is
controlling how a function is behaving and how to pass input from user to the source code and how
to pass output to the user.

We have used a dynamic webpage which is giving us the status like uploading, evaluating and the
output without refreshing the page.
Creation of Docker Image
To encapsulate the application, its dependencies, and runtime environment into a single package we
created docker image of the application. Docker images are portable and can be easily shared
between different systems and environments.
Creating a Docker container for a Flask application involved several steps:
1. Write a Dockerfile:
The Dockerfile contains instructions for building the Docker image (Install dependencies, Run
flask file, etc).
2. Build the Docker image:
To create a docker image on local machine.
3. Run the Docker container:
Test the application using docker container.
4. Pull the image to docker hub (a cloud-based repository), to access it globally from any
machine.
Results - Activity Recognition
Confusion Matrix
Overall Output on Test Video

Test video link:

[Link]
Conclusion

The Objective for creating this project was successfully achieved. The models
made correct predictions most of the time and all-together generated correct
sentences in natural language describing the scenes for simple videos which
had specific characteristics as desired by the model.
Challenges

- Limited Dataset
- Only Single activity Detection
- Still only applicable on simple videos.
Future Work

- The project can be extended in lots of dimensions by applying various techniques like Pose
detection, Facial Recognition, LandMark detection, Gait Detection, etc and can be tweaked
to give very refined outputs and detect multiple activities from the video. The face related
activities like eating, talking, drinking etc can be separately identified and overall body
activities like walking, running, jumping, etc can be separately classified.
- Refining and increasing the dataset
- The background detection model could even be fine-tuned if the application of the
software is required in some specific setting.
- Applying a GPT summarizer to generate precise summary from sentence outputs.
References

[Link]
[Link]
r2plus1d_18.html
[Link]
[Link]
[Link]
[Link]
THANK YOU!

You might also like