AI Video Summarization Project Overview

Objective of this project was to take a video as input and summarize it in text format. We done it using some pre-trained models for object detection, activity detection, and background detection and finally give outputs of these models to another pre-trained model to generate a summary. All these models are mentioned in the slide.

Uploaded by

Pratik Avhad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

312 views34 pages

AI Video Summarization Project Overview

Uploaded by

Pratik Avhad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Video Summarization

By Guided By
Yash Funde(230940128010) Mr. Pramod Sharma
Pratik Avhad (230940128021)
Prerit Agarwal (230940128022)
Sahibpreet Singh(230940128026)
Contents

● Introduction
● Objective
● Overview
● Algorithm Flow
● Results
● Conclusion
● Challenges
● Future work
● References
Introduction

Video summarization is a critical area of research in the field of computer vision and
multimedia analysis, aimed at explaining the things happening in the video in simple
English. Many researchers have tried to achieve this objective by extracting audio and
then applying speech to text techniques. But not much work has been done to generate
text directly from video frames without making use of any audio. This project is a
simple trail to achieve this objective of generating simple English sentences directly
from video frames
Objective

The project aims to deliver a software program, exploiting the latest OpenCV
libraries, transfer learning techniques and making direct usage of some openly
available and accurate pretrained models. The program works towards the objective
of determining Activities occurring in a video by extracting a sequence of frames and
making the use of sequence to sequence models to make predictions. Along with, it
makes use of pretty accurate openly available pretrained models for object detection
(YOLO) and background detection out of video frame images and finally a text
generation model that produces meaningful sentences from some hint words
produced as output by the aforementioned models.
Overview
- Project involves creation of a custom dataset and transfer by fine-tuning one of the
best pre-trained architectures to detect real time activities (R2+1D), a background
detection model (VGG-16), an object detection model (YOLO) and a pretrained
sentence generation model (key2text).
- Creation of custom dataset became important the pre trained activity recognition
model (R2plus1d) was trained on kinetics dataset, so it had capability to classify
similar kinds of videos only. But it could not classify normal real life simple
activities like walking, jumping, running or eating, etc.
DATASET Preparation
Dataset preparation is a crucial step in developing and evaluating effective algorithms. The choice and
curation of datasets significantly impact the performance and generalizability of models.

We built a custom dataset starting with basic single activity videos like Person walking, running,
eating and exercising. 20 videos of each category were collected and finally the model was fine tuned
on this dataset. Dataset was prepared keeping in mind the following points

- Diversity of Content: different poses, backgrounds, styles, age and gender to ensure the model's
ability to generalize.
- Representative: Videos were cropped to a uniform time-length (of 3 seconds) capturing only
the part where the activity is demonstrated
- Scalability: Larger datasets may better reflect the challenges posed by diverse content.
R2plus1D - Activity Recognition
R(2+1)D, or ResNet 2+1D, is a convolutional neural network (CNN) architecture designed for
action recognition and video analysis tasks. It is an extension of the popular ResNet (Residual
Network) architecture, which was initially proposed for image classification tasks. The "2+1D"
in R(2+1)D signifies the incorporation of both 2D and 3D convolutions, offering a balance
between computational efficiency and expressive power for video data.
R(2+1)D combines 2D and 3D convolutions in its architecture. 2D convolutions are applied
spatially to capture information within individual frames, while 1D convolutions are applied
temporally to capture motion and temporal dependencies across [Link] factorization
reduces the computational cost compared to standard 3D convolutions, making it more
practical for real-world applications.
YOLO - Object Detection
YOLO, short for You Only Look Once, is a revolutionary object detection architecture that has
gained popularity for its efficiency in real-time applications.
The YOLO algorithm is a real-time object detection system that identifies objects in images and
videos by dividing the image into a grid and predicting bounding boxes and class probabilities
for each grid cell.

Key Features of YOLO:

1. Unified Detection 2. Grid-based Division 3. Bounding Box Prediction:

4. Class Prediction: 5. Single Forward Pass: 6. Anchor Boxes:
YOLO Output
YOLO - Loss function
VGG16 - Background Detection

We used a pretrained VGG16 model which had been trained on large dataset name
Places365 which has over 1.8 Million images corresponding to different 365
different background settings.

The model was tested by us on different images and it gave pretty good results, so
we decided to use it directly (without any ﬁne-tuning) in our project.
INPUT OUTPUT
Algorithm Flow
Step -1 Inputs

This step involves reading the videos from the dataset folders and then performing
feature extraction and preprocessing to make them ready to be sent as input to the
model. Here’s an image depicting the logic built to read out the videos from data
folders.
Step 2: Feature extraction

Feature extraction is a fundamental process in any machine learning task that involves
extracting helpful or informative features from the raw data. These features hold
essential characteristics of the data, enabling more effective data preprocessing and
model-training.
We extracted different relevant characteristic features from the videos using OpenCv.
Features like video length, Frame Count and Frames per second. These features
helped us perform preprocessing tasks effectively.
● Extract Video Features: Extract relevant features from videos, number
of frames, Frames per second, etc.
● Temporal Segmentation: Dividing videos into videos of 3 seconds
each. Then picking up some frames at fixed regular intervals (FPS).
Step 3: Data Preprocessing

● Normalization and Resizing: Normalizing the pixel values and resizing the
frames to a uniform size to be given as input to the model.
● Concatenation: Concatenating frames sequentially into numpy arrays
● Transforms: Apply transformations as required by the models
● Creating Datasets: Converting the arrays of frames and labels into Pytorch
datasets.
● Train Test Split: Splitting the dataset into training and validation sets for
training and validating models.
Data Preprocessing

Based on the features extracted from the feature extraction process, the video was cut to
first 3 seconds (while training) and and 3 frames per second after regular intervals were
selected as criterion to train the model. So the total sequence length was 9 i.e. 3x3. Then
each frame was normalized by dividing by 255 to bring the values in range of 0 to 1.
After that the frame was resized to a fixed size to ensure uniformity in the inputs for the
model. Each video from every category was converted into a sequence of 9 timesteps and
appended into an input features list and mapped to respective labels.
Data Preprocessing

While validating input video is cut into n videos of 3 seconds each and finally
prediction is made separately on each sub video.

For Background detection and Object Detection , initial feature extraction and
preprocessing were the same till the features are extracted. These models did
not require any specific transforms except resizing the frames.

Outputs were taken for one frame per second with an intention that if at some
point the object i.e. person is not present and our activity recognition model
predicts activity as person walking then we could refine our results that the
activity prevails for a shorter duration than 3 seconds.
Step 4: Model Training

- In out first approach we tried using some open source available dataset, so we tried with
ucf50 dataset, we fine tuned the model trying different FPS, regularization like dropout
and weight decay the training results were really amazing. We achieved almost 98%
accuracy on validation but the model did not perform as intended on real time videos.
- But soon after visualizing some content of datasets we realized that the videos are not
representative of our use cases. The custom dataset was prepared with 20 different videos
for 4 different classes i.e. Person walking, Person Running, Person Eating, Person
Exercising and model was fine tuned for this data.
- Parameters of all layers of the pretrained model (R2plus1D) were freezed except the last
classifying (fully connected) layer. Attempts were made to add multiple layers in the
classifier with heavy regularization but the results did not improve rather remained the
same or worsened.
- Weight Decay and Early Stopping regularization was applied to train the model in order
to achieve the best results and cap overfitting. We were able to achieve accuracy of about
75-80% on the validation set. As our dataset was small, we were convinced with this
much accuracy and moved forward.
Step 5: Concatenating the outputs and preprocessing

Background Detection Model Activity Recognition Model

Intermediate Output
concatenation logic
and the final results.
Step 6: Sentence Generation

We found an openly available sentence generation model named `Key2Text`. This

model is designed to take keywords as inputs and generate sentences as outputs.

The model is based on the T5 (Text-to-Text Transfer Transformer) architecture, which is a

transformer model pre-trained on a large corpus of text data and can be fine-tuned for
various NLP tasks.
Web Deployment
Web Deployment is the process of deploying the code from source control to a hosting platform. This
is usually cloud or a local server. The process can be either manual or automated.

Here, we used a flask app and a template where we deploy our code on port 5000 to make it usable
with a friendly user interface.

The template is coded in HTML and all the functioning is happening because of Javascript. It is
controlling how a function is behaving and how to pass input from user to the source code and how
to pass output to the user.

We have used a dynamic webpage which is giving us the status like uploading, evaluating and the
output without refreshing the page.
Creation of Docker Image
To encapsulate the application, its dependencies, and runtime environment into a single package we
created docker image of the application. Docker images are portable and can be easily shared
between different systems and environments.
Creating a Docker container for a Flask application involved several steps:
1. Write a Dockerfile:
The Dockerfile contains instructions for building the Docker image (Install dependencies, Run
flask file, etc).
2. Build the Docker image:
To create a docker image on local machine.
3. Run the Docker container:
Test the application using docker container.
4. Pull the image to docker hub (a cloud-based repository), to access it globally from any
machine.
Results - Activity Recognition
Confusion Matrix
Overall Output on Test Video

Test video link:

[Link]
Conclusion

The Objective for creating this project was successfully achieved. The models
made correct predictions most of the time and all-together generated correct
sentences in natural language describing the scenes for simple videos which
had specific characteristics as desired by the model.
Challenges

- Limited Dataset
- Only Single activity Detection
- Still only applicable on simple videos.
Future Work

- The project can be extended in lots of dimensions by applying various techniques like Pose
detection, Facial Recognition, LandMark detection, Gait Detection, etc and can be tweaked
to give very refined outputs and detect multiple activities from the video. The face related
activities like eating, talking, drinking etc can be separately identified and overall body
activities like walking, running, jumping, etc can be separately classified.
- Refining and increasing the dataset
- The background detection model could even be fine-tuned if the application of the
software is required in some specific setting.
- Applying a GPT summarizer to generate precise summary from sentence outputs.
References

[Link]
[Link]
r2plus1d_18.html
[Link]
[Link]
[Link]
[Link]
THANK YOU!

Final Project Report
No ratings yet
Final Project Report
73 pages
Unit 2 Computer Vision Applications-1
No ratings yet
Unit 2 Computer Vision Applications-1
32 pages
Placement Preparation Tasks For AI, ML
No ratings yet
Placement Preparation Tasks For AI, ML
4 pages
Generative AI Basic Questions
No ratings yet
Generative AI Basic Questions
3 pages
Computer Vision Basics Course Overview
No ratings yet
Computer Vision Basics Course Overview
65 pages
Introduction to Software Project Management
No ratings yet
Introduction to Software Project Management
21 pages
Software Engineering Interview Q&A Guide
100% (1)
Software Engineering Interview Q&A Guide
21 pages
Tamil Handwritten Character Recognition
No ratings yet
Tamil Handwritten Character Recognition
31 pages
Data Compression Techniques Overview
No ratings yet
Data Compression Techniques Overview
4 pages
Wipro Latest Technical Interview Questions
No ratings yet
Wipro Latest Technical Interview Questions
8 pages
Final Question Bank - DSP
No ratings yet
Final Question Bank - DSP
32 pages
Pranjal Rathi Interview Htsi
No ratings yet
Pranjal Rathi Interview Htsi
8 pages
LP-IV Lab Manual
No ratings yet
LP-IV Lab Manual
42 pages
Iris Recognition System Project Report
No ratings yet
Iris Recognition System Project Report
47 pages
Machine Learning Experiments Guide
No ratings yet
Machine Learning Experiments Guide
46 pages
Data Analysis with WEKA Guide
No ratings yet
Data Analysis with WEKA Guide
21 pages
Stqa Viva
No ratings yet
Stqa Viva
10 pages
Key Frame Systems and Morphing
No ratings yet
Key Frame Systems and Morphing
4 pages
CSDF Lab Manual
No ratings yet
CSDF Lab Manual
29 pages
OpenCV Quick Reference Guide
No ratings yet
OpenCV Quick Reference Guide
3 pages
Accenture Japan IITK
No ratings yet
Accenture Japan IITK
44 pages
Assignment Questions
No ratings yet
Assignment Questions
1 page
Thyroid Disease Prediction with ML
No ratings yet
Thyroid Disease Prediction with ML
34 pages
Yamaha Motors GET Role
100% (1)
Yamaha Motors GET Role
23 pages
EC6401 Electronic Circuits II Anna University Question Papers Regulation 2013 Unit Wise
No ratings yet
EC6401 Electronic Circuits II Anna University Question Papers Regulation 2013 Unit Wise
7 pages
Basepaper Blockchain Question Paper
No ratings yet
Basepaper Blockchain Question Paper
7 pages
CNN Architecture & Training Guide
No ratings yet
CNN Architecture & Training Guide
7 pages
21it6203-Knowledge Engineering Laboratory
No ratings yet
21it6203-Knowledge Engineering Laboratory
32 pages
Accolite Digital Campus Interview Guide
100% (2)
Accolite Digital Campus Interview Guide
3 pages
Engineering Job Interview Guide
100% (1)
Engineering Job Interview Guide
8 pages
Black Book Final Year
No ratings yet
Black Book Final Year
99 pages
Opentext Questions
No ratings yet
Opentext Questions
6 pages
Module 1:image Representation and Modeling
No ratings yet
Module 1:image Representation and Modeling
48 pages
WT Lab Manual
No ratings yet
WT Lab Manual
50 pages
HashedIn by Deloitte
No ratings yet
HashedIn by Deloitte
8 pages
Cognizant GenC Python Cheat Sheet
No ratings yet
Cognizant GenC Python Cheat Sheet
2 pages
Mini Project HPC
No ratings yet
Mini Project HPC
17 pages
Restrowork Preparation Kit
No ratings yet
Restrowork Preparation Kit
10 pages
Google AI - ML Engineer Roadmap by Ribhu Susmita
No ratings yet
Google AI - ML Engineer Roadmap by Ribhu Susmita
7 pages
Psna Wipro 2016 Verbal Faqs - RC
No ratings yet
Psna Wipro 2016 Verbal Faqs - RC
16 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
In House Project Report - Beg
No ratings yet
In House Project Report - Beg
8 pages
AI Face Detection Project Overview
No ratings yet
AI Face Detection Project Overview
12 pages
Associate Engineer Role at Ascendion
No ratings yet
Associate Engineer Role at Ascendion
1 page
Sqa 2 Marks With Answers
No ratings yet
Sqa 2 Marks With Answers
18 pages
Machine Learning Lab Manual: FIND-S & Candidate-Elimination
No ratings yet
Machine Learning Lab Manual: FIND-S & Candidate-Elimination
25 pages
JTG SQA Online Assessment Prep
No ratings yet
JTG SQA Online Assessment Prep
4 pages
Gen AI Unit 1
100% (1)
Gen AI Unit 1
86 pages
Info Retrieval for CS Students
No ratings yet
Info Retrieval for CS Students
47 pages
Intro Intern Final Merged
No ratings yet
Intro Intern Final Merged
22 pages
Mini Project Stqa Report
No ratings yet
Mini Project Stqa Report
13 pages
Air India Sample Aptitude Placement Paper Level1
No ratings yet
Air India Sample Aptitude Placement Paper Level1
5 pages
Madanapalle Institute of Technology & Science: Madanapalle (Ugc-Autonomous) WWW - Mits.ac - in
No ratings yet
Madanapalle Institute of Technology & Science: Madanapalle (Ugc-Autonomous) WWW - Mits.ac - in
41 pages
AR Menus for Enhanced Dining
No ratings yet
AR Menus for Enhanced Dining
23 pages
Vassar Labs
No ratings yet
Vassar Labs
4 pages
Criterion 2
No ratings yet
Criterion 2
10 pages
1 Open CV
No ratings yet
1 Open CV
6 pages
Intelligent Video Content Classification
No ratings yet
Intelligent Video Content Classification
23 pages
Augmented Reality for Industrial Monitoring
No ratings yet
Augmented Reality for Industrial Monitoring
16 pages
Autonomous Car
100% (1)
Autonomous Car
12 pages
ITPD Data Analytics
No ratings yet
ITPD Data Analytics
20 pages
Certificate Course (Certificate)
No ratings yet
Certificate Course (Certificate)
2 pages
Set A
No ratings yet
Set A
5 pages
Workshop Broucher
No ratings yet
Workshop Broucher
2 pages
British Education Reforms in India
No ratings yet
British Education Reforms in India
4 pages
Activity Contemp.2025
No ratings yet
Activity Contemp.2025
14 pages
Indigenous Language Education in Kenya
No ratings yet
Indigenous Language Education in Kenya
50 pages
Emotional Intelligence Guide
No ratings yet
Emotional Intelligence Guide
2 pages
2 Improving Negotiation Skills
No ratings yet
2 Improving Negotiation Skills
14 pages
Nonlinear System Control with Elman Network
No ratings yet
Nonlinear System Control with Elman Network
16 pages
Agile Questions and Answers
73% (11)
Agile Questions and Answers
33 pages
Aspiring Media Professional
No ratings yet
Aspiring Media Professional
1 page
Childrens Myth Book Project Outline
No ratings yet
Childrens Myth Book Project Outline
7 pages
Impact of Podcasts on EFL Oral Skills
No ratings yet
Impact of Podcasts on EFL Oral Skills
7 pages
Project
No ratings yet
Project
2 pages
Written Report
No ratings yet
Written Report
5 pages
Music Endorsement Certification Guide
No ratings yet
Music Endorsement Certification Guide
2 pages
Intel ISEF Rules and Guidelines 2020-2021
No ratings yet
Intel ISEF Rules and Guidelines 2020-2021
49 pages
Evaluation of School Based Assessment in Malaysia
No ratings yet
Evaluation of School Based Assessment in Malaysia
400 pages
Master Programme With Dependent
No ratings yet
Master Programme With Dependent
4 pages
Music10 q1 Mod1 Musicofthe20thcentury v5-1
No ratings yet
Music10 q1 Mod1 Musicofthe20thcentury v5-1
32 pages
Beck Depression Inventory Scoring
No ratings yet
Beck Depression Inventory Scoring
2 pages
Order of Preference of The Post Details: Railway Recruitment Board Ministry of Railways, Govt. of India
No ratings yet
Order of Preference of The Post Details: Railway Recruitment Board Ministry of Railways, Govt. of India
2 pages
English Topik 1 Ruang Kolaborasi Shabrina Madania
No ratings yet
English Topik 1 Ruang Kolaborasi Shabrina Madania
1 page
Self-Service Kiosk Design Guidelines
No ratings yet
Self-Service Kiosk Design Guidelines
151 pages
A New Paradigm For Corporate Training - Learning in The Flow of Work - JOSH BERSIN
No ratings yet
A New Paradigm For Corporate Training - Learning in The Flow of Work - JOSH BERSIN
43 pages
Q2 LAS#1 Polynomial Function
No ratings yet
Q2 LAS#1 Polynomial Function
2 pages
CV of A Teacher.
No ratings yet
CV of A Teacher.
3 pages
WP GynaeExams4
No ratings yet
WP GynaeExams4
37 pages
PG Admission Letter 2024
No ratings yet
PG Admission Letter 2024
1 page