You are on page 1of 35

Vision Assistant

Submitted in partial fulfillment of the requirement for the award of


Degree of Bachelor of Technology in
Computer Engineering Discipline

Submitted To

SVKM’s NMIMS,
Mukesh Patel School of Technology Management & Engineering,
Shirpur Campus, Dist. Dhule (M.H.)

Submitted By :
Pulkit Agarwal – 70021118001
Aditya Shankarnarayan – 70021118052
Saket Singh – 70021118053

Under The Supervision Of:

Mr. Sachin Bhandari


(Asst. Professor, Computer Engineering)

DEPARTMENT OF COMPUTER ENGINEERING


Mukesh Patel School of Technology Management & Engineering
SESSION: 2021-22
CERTIFICATE

This is to certify that the work embodies in this Project entitled “Vision
Assistant” being submitted by

Pulkit Agarwal – 70021118001


Aditya Shankarnarayan – 70021118052
Saket Singh – 70021118053

for partial fulfillment of the requirement for the award of “Bachelor of


Technology in Computer Engineering” discipline to “SVKM’s
NMIMS, Mumbai (M.H.)” during the academic year 2021-22 is a record
of bonafide piece of work, carried out by him under my supervision and
guidance in the “Department of Computer Engineering”, MPSTME,
Shirpur (M.H.).
APPROVED & SUPERVISED BY:

Mr. Sachin Bhandari


(Asst. Professor, Computer Engineering)

FORWARDED BY:

(Dr. Kamal Mehta) (Dr. Kamal Mehta)


H.O.D., CS Associate Dean,
MPSTME, Shirpur Campus MPSTME, Shirpur
Campus

DEPARTMENT OF COMPUTER ENGINEERING


Mukesh Patel School of Technology Management & Engineering
CERTIFICATE OF APPROVAL

The Project entitled “Vision Assistant” being submitted by


Pulkit Agarwal – 70021118001
Aditya Shankarnarayan – 70021118052
Saket Singh – 70021118053

has been examined by us and is hereby approved for the award of degree
“Bachelor of Technology in Computer Engineering Discipline”, for
which it has been submitted. It is understood that by this approval the
undersigned do not necessarily endorse or approve any statement made,
opinion expressed or conclusion drawn therein, but approve the project
only for the purpose for which it has been submitted.

(Internal Examiner) (14pt) (External Examiner)

Date: Date:

DEPARTMENT OF COMPUTER ENGINEERING


Mukesh Patel School of Technology Management & Engineering
DECLARATION

We,
Pulkit Agarwal – 70021118001
Aditya Shankarnarayan – 70021118052
Saket Singh – 70021118053

The students of Bachelor of Technology in Computer Engineering


discipline, Session: 2021-22, MPSTME, Shirpur Campus, hereby
declare that the work presented in this Project entitled “Topic Name” is
the outcome of our work, is bonafide and correct to the best of our
knowledge and this work has been carried out taking care of Engineering
Ethics. The work presented does not infringe any patented work and has
not been submitted to any other university or anywhere else for the award
of any degree or any professional diploma. [14pt]

Pulkit Agarwal
SAP ID.: 70021118001
Aditya Shankarnarayan
SAP ID.: 70021118052
Saket Singh
SAP ID.: 70021118053

Date:

DEPARTMENT OF COMPUTER ENGINEERING


Mukesh Patel School of Technology Management & Engineering

ACKNOWLEDGEMENT
After the completion of this Major Project work, words are not enough to express my
feelings about all those who helped me to reach my goal; feeling above this is my
indebtedness to The Almighty for providing me this moment in life.

It’s a great pleasure and moment of immense satisfaction for me to express my


profound gratitude to Prof. Sachin Bhandari, Asst. Professor, Computer
Engineering Department, MPSTME, Shirpur, whose constant encouragement enabled
me to work enthusiastically. Their perpetual motivation, patience and excellent
expertise in discussion during progress of the project work have benefited me to an
extent, which is beyond expression. Their depth and breadth of knowledge of
Computer Engineering field made me realize that theoretical knowledge always helps
to develop efficient operational software, which is a blend of all core subjects of the
field. I am highly indebted to them for their invaluable guidance and ever-ready
support in the successful completion of this project in time. Working under their
guidance has been a fruitful and unforgettable experience.

We express my sincere thanks and gratitude to Dr. Kamal Mehta, Head of


Department, Computer Engineering Department, MPSTME, Shirpur, for providing
necessary infrastructure and help to complete the project work successfully.

We also extend my deepest gratitude to Dr. Akshay Malhotra, Director, SVKM’S


NMIMS, Shirpur Campus and Dr. Kamal Mehta, Associate Dean, SVKM’S NMIMS,
Shirpur Campus for providing all the necessary facilities and true encouraging
environment to bring out the best of my endeavors.

We sincerely wish to express my grateful thanks to all members of the staff of


computer engineering department and all those who have embedded me with
technical knowledge of computer technology during various stages of B.Tech.
Computer Engineering.

We would like to acknowledge all my friends, who have contributed directly or


indirectly in this Major Project work.

The successful completion of a Major Project is generally not an individual effort. It


is an outcome of the cumulative effort of a number of persons, each having their own
importance to the objective. This section is a vote of thanks and gratitude towards all
those persons who have directly or indirectly contributed in their own special way
towards the completion of this project.

Pulkit Agarwal - 70021118001


Aditya Shankarnarayan - 70021118052
Saket Singh - 70021118053
TABLE OF CONTENTS
Sr.
Chapter No. Page
No.

1 INTRODUCTION
1.1 Purpose 1
1.2 Scope 2
1.3 Overview 3

2 LITERATURE SURVEY

3 PROBLEM DEFINITION & PROPOSED SOLUTION


3.1 Problem Statement 7
3.2 Proposed Solution 8

4 DESIGN
4.1 Architectural Diagram
4.2 Data Flow Diagram
4.3 Use Case Diagram
4.4 Activity Diagram
4.5 State Chart Diagram
4.6 Sequence Diagram
4.7 Collaboration Diagram
4.8 Deployment Diagram

5 Result Analysis/Implementation
5.1 Home Page Module
5.2 Login Module
5.3 Module2
5.4 Module3

6 Testing
6.1 Unit Testing
6.2 Integration and System Testing

7 Conclusion and Future Work

References

LIST OF FIGURES

Sr. No. Figure No. Figures Page


1
2
3
LIST OF TABLES
Sr.
Table No. Table Page
No.
1
2
CHAPTER 1

INTRODUCTION

1.1 PURPOSE

The purpose of our project is to provide the visually impaired with a new way to
experience the world. Our project is a vision assistant that describes the action and
objects present in the video shot. The main purpose of the system is to automate the
task of getting the caption for the video and reading it to the user. Using state-of-
the-art techniques to process and understand actions and objects in the video to
generate a sentence to accurately describe it. This sentence is then read out in a
natural sounding voice for better user experience and understandability. This
approach being simple and user-friendly can help the visually imapraired
experience the world around them independently.

Scope:
The goal is to design an application which covers all the functions of video
description and provides an interface of assistance to the user.

By using Deep learning techniques and Natural language processing, the project
performs:

1. Video Captioning: Recognising Different types of objects in an image and


creating a meaningful sentence that describes that video.

2. Text to speech conversion.

This project finds application in multiple domains like:

1. In the field of medical sciences for visually impaired people.


2. In the field of defense to keep a watch on the surroundings.

1.3 OVERVIEW

This project intends to provide a system that would help the visually impaired with an
audio feed describing their surroundings. In vision assistant, the user records a video
file of the surroundings. It then generates a sentence describing the actions and objects
in the video which maintain the context. Because of the utility and simplicity, vision
assistants can be used by the visually impaired. Using state-of-the-art deep learning
techniques to generate accurate and meaningful sentences which maintains the context
of the video. Then generating a natural sounding audio of the sentence for better user
experience and understandability. This project combines sentence generation and text-
to-speech features on one platform. Vision assistant has great potential for other
applications such as security for national agencies and can be used for autonomous
vehicles.

CHAPTER 2

LITERATURE SURVEY

Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques

Yuncheng TGIF: A New Encoder - Uses multiple Provided extensive


2016
Li Dataset and Decoder techniques to benchmark results
Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques

using three popular


video description
Benchmark on provide techniques, and
Animated GIF Model baseline showed promising
Description results. results on improving
movie description
using our dataset

Automatic Image It is a concise review


and Video of both image
Review of Algorithmic
Caption captioning and video
image overlap of
soheyla Generation With captioning It treats
2020 captioning image and
amirian Deep Learning: A both image and video
and video video
Concise Review captioning by
captioning. captioning.
and Algorithmic emphasizing the
Overlap algorithmic overlap.

This model
Sentence results in
uses video
their method are
features as an
Multiple Videos better able to
Encoder - input for one
Seung-Ho Captioning Model construct sentence
2017 Decoder video clip and
Han for Video structures, i.e.,
Model generates a
Storytelling subject, verb, and
caption for the
object, compared to
input video
S2VT.
clip

Jaeyoung CAPTURING 2019 Encoder - Uses non-local The performance of


Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques

the model improved


with additional time
LONG-RANGE steps, and its
encoding with
DEPENDENCIE Decoder applications are not
Lee fusion and
S IN VIDEO Model limited to visual
decoding.
CAPTIONING question answering
and action
recognition.

Utilizes
explicit and
implicit
The title of a
Video Scene Title relations
video is The title of a video is
Generation based among
automatically automatically
Jeong- on Explicit and words
2018 generated with generated with
Woo Son Implicit Relations occurred in
sentences in sentences in closed
among Caption closed
closed captions.
Words captions
captions.
using
stochastic
matrix.

Zhiwang Show, Tell and 2019 division Division-and Propose a new two-
Zhang Summarize: and summarization stage
Dense Video summarizat (DaS) LSTM network
Captioning ion framework for equipped with a new
Using Visual Cue dense video
Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques

Aided Sentence
captioning. hierarchical attention
Summarization framework
mechanism.

Alleviate the The proposed RAAM


exposure bias method,
Video Captioning problem by which uses only a
With Adaptive directly single feature,
huanhou Attention and optimizing the achieves competitive
2017 LSTM
xiao Mixed Loss sentence-level or even superior
Optimization metric using a results compared to
reinforcement existing state-
learning of-the-art models for
algorithm. video captioning.

“Generator”, LSTM-GAN system


generates architecture, for
textual which we show
Video Captioning sentences, and experimentally to
Yang LSTM &
by Adversarial 2018 a significantly
Yang GAN
LSTM “discriminator outperform the
” controls existing methods on
the accuracy of standard public
the sentences. datasets.
Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques

Experiments on the
datasets demonstrate
Video Captioning
Uses attention that method using
with Attention-
model with single feature can
based LSTM and
Lianli Gao 2019 LSTM LSTM to achieve competitive
Semantic
provide better or even better results
Consistency
results than the state-of-the-
art baselines for video
captioning.

VIDEO
CAPTIONING Used audio Shown improvements
BASED ON Audio - data along in terms of bleu score
Chien-Yao
JOINT IMAGE– 2019 video with video than what was
Wang1,
AUDIO DEEP processing For better there when only
LEARNING results. image data was used.
TECHNIQUES

TABLE 2.1 : Literature Survey

CHAPTER 3
PROBLEM DEFINITION &
PROPOSED SOLUTION

3.1 PROBLEM STATEMENT

Many people do not possess the gift of sight. They must rely on others for there day-
to-day task. For them to be independent, there must be a new and self-reliant way of
safely viewing and understanding the world. The main challenge would be describing
the surrounding in an audio format. This will help the visually impaired understand
the world easily. Improving video-to-sentence generation would be a great
achievement in assistive vision techniques.

3.1 PROPOSED SOLUTION

The input of this application will be a video/GIF, a video file is a combination of


multiple images stitched together and being played at frames per second speed. The
intended solution for this problem selects every third frame of video/GIF and utmost
total 30 frames of 299 * 299 shape which are then passed to a pre-trained Inception-
V3 model on Imagenet dataset to get 2048 features for each frame. This application
then uses an encoder-decoder deep learning model to get a probabilistic sentence for
the input video. The encoder-decoder model is a way of using recurrent neural
networks for sequence-to-sequence prediction problems. It was initially developed for
machine translation problems, although it has proven successful at related sequence-
to-sequence prediction problems such as text summarization and question answering.
The approach involves two recurrent neural networks, one to encode the input
sequence, called the encoder, and a second to decode the encoded input sequence into
the target sequence called the decoder. For this application Transformer architecture
of encoder-decoder was found to be the best although it was more computationally
expensive then LSTM based encode-decoder architecture. The output of this module
is then passed to Google-text-to-speech to convert the caption into an audio format to
read to the end-user.
● Any video format can be used as an input.
● This project develops an end-to-end pipeline for assistive vision.
● This project uses multiple video captioning techniques like IV3 + LSTM, IV3
+ LSTM + Attention, and Transformer, Video Captioning is a task of
automatic captioning a video by understanding the action and event in the
video which can help in the retrieval of the video efficiently through text.

CHAPTER 4

DESIGN
4.1 ARCHITECTURAL DIAGRAM

FIG 4.1 : Architecture Diagram

4.2 DATA FLOW DIAGRAM DIAGRAM

i. DFD Level 0

FIG 4.2.1 : Data Flow Diagram Level 0

ii. DFD Level 1


FIG 4.2.2 : Data Flow Diagram Level 2

iii. DFD Level 2

FIG 4.2.3 : Data Flow Diagram Level 0

4.3 USE CASE DIAGRAM


FIG 4.3 : Use Case Diagram

4.4 ACTIVITY DIAGRAM


FIG 4.4 : Activity Diagram

4.5 STATE CHART DIAGRAM

FIG 4.5 : State Chart Diagram


4.6 SEQUENCE DIAGRAM
FIG 4.6 : Sequence Diagram

4.7 COLLABORATION DIAGRAM

FIG 4.7 : Collaboration Diagram

4.8 DEPLOYMENT DIAGRAM


FIG 4.8 : Deployment Diagram

CHAPTER 5
RESULT ANALYSIS &
IMPLEMENTATION

FIG 5.1 : Pre-Processing


We used the TGIF dataset for the models. The dataset has 2 types of data present in it.
The GIF and the Sentence describing the GIF. Hence, the pre-processing was divide
into 2 different parts. Since the data is of different types, the steps involved in pre-
processing the data are also different.

1. Video Preprocessing:

Step 1: Extraction of Frames.

Step 2: Repeat Step 1 after three timesteps for utmost thirty times.

Step 3: For each GIF store number of frames extracted in a List.

Step 4: We then passed each frame to InceptionV3 to get the encodings.

Step 5: Serialize the list and store it in GDrive. (Pickle)

2. Text Preprocessing:

Step 1: Read all the sentence into a List.

Step 2: De-concatenate every word that is concatenated. (Example: Won’t →

Will not)
Step 3: Removal of Punctuations.

Step 4: Removal of Digits.

Step 5: Append all the Sentences with <SOS> and <EOS> tokens.

Step 6: Tokenize every sentence using Bag of Words

Step 7: Add padding to every sentence and Create corpus of text.

Step 8: Extract Embeddings for each word of the sentence.

5.2 MODEL 1
FIG 5.2: IV3 + LSTM

The first model used was a basic encoder-decoder model. It consisted of InceptionV3
and LSTM layers. This architecture is comprised of two models: one for reading the
input sequence and encoding it into a fixed-length vector, and a second for decoding
the fixed-length vector and outputting the predicted sequence. Long Short Term
Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They work tremendously well on a large
variety of problems, and are now widely used.

It works in the two following steps:

1. The encoder LSTM is used to process the entire input frames and encode it into a
context vector, which is the last hidden state of the LSTM. This is expected to be a
good summary of the input. All the intermediate states of the encoder are ignored, and
the final state id supposed to be the initial hidden state of the decoder

2. The decoder LSTM or RNN units produce the words in a sentence one after
another.

5.3 MODEL 2
FIG 5.3: IV3 + LSTM + Attention

The second model is Inception V3, Long-Short-Term Memory, and Attention Layer.
This model improves upon on the first architecture by using attention layer.
The LSTM generates encodings while maintaining sequence and context from the
sequence of images. The attention layer differentially weights the significance of each
part of the input data, providing additional context by generating a context vector for
every timestep. This helps in generating meaningful sentences with richer context.

The attention mechanism was initially developed for the purpose of Machine
Translation by Bahdanau. Bahdanau et al (2015) came up with a simple but elegant
idea where they suggested that not only can all the input words be taken into account
in the context vector, but relative importance should also be given to each one of
them. In neural networks, attention is a technique that mimics cognitive attention. The
effect enhances some parts of the input data while diminishing other parts — the
thought being that the network should devote more focus to that small but important
part of the data. Learning which part of the data is more important than others
depends on the context and is trained by gradient descent.
5.4 MODEL 3

FIG 5.2: IV3 + Transformer

A transformer is a deep learning model that adopts the mechanism of self-attention,


differentially weighting the significance of each part of the input data. It is used
primarily in the fields of natural language processing (NLP) and computer vision
(CV).
Like recurrent neural networks (RNNs), transformers are designed to handle
sequential input data, such as natural language, for tasks such as translation and text
summarization. However, unlike RNNs, transformers do not necessarily process the
data in order. Rather, the attention mechanism provides context for any position in the
input sequence.
Transformers were introduced in 2017 by a team at Google Brain and are increasingly
the model of choice for NLP problems, replacing RNN models such as long short-
term memory (LSTM).

5.5 RESULTS

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for
evaluating a generated sentence to a reference sentence.

A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score
of 0.0.

The score was developed for evaluating the predictions made by automatic machine
translation systems. It is not perfect, but does offer 5 compelling benefits:

● It is quick and inexpensive to calculate.


● It is easy to understand.
● It is language independent.
● It correlates highly with human evaluation.
● It has been widely adopted.

The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper
“BLEU: a Method for Automatic Evaluation of Machine Translation“.

The approach works by counting matching n-grams in the candidate translation to n-


grams in the reference text, where 1-gram or unigram would be each token and a
bigram comparison would be each word pair. The comparison is made regardless of
word order.

METHODS BLEU-(1,2,3,4)
InceptionV3 +
08.14 16.17 19.05 19.11
LSTM

InceptionV3 +
LSTM + 12.66 19.24 20.72 19.95
Attention

InceptionV3 +
33.53 25.53 25.83 20.90
Transformer

TABLE 5.1 : Training BLEU Scores

5.6 SAMPLE OUTPUTS


CHAPTER 6

TESTING
6.1 RESULTS

METHODS BLEU-(1,2,3,4)

InceptionV3 +
08.09 16.11 19.05 19.06
LSTM

InceptionV3 +
12.33 19.04 20.59 19.87
LSTM + Attention

InceptionV3 +
26.31 21.27 22.93 18.70
Transformer

TABLE 6.1 : Testing BLEU Scores

6.2 SAMPLE OUTPUTS

Upon testing we found out the transformer performed the best on the majority of the
metrics, followed by the LSTM and attention layer then by the LSTM layer.
Transformers being the current state-of-the-art approach for Natural Language
Processing tasks and gave the best results. Transformers have a deep understanding of
language, allowing training to focus on learning the task at hand. The attention layer
provides an additional context vector that helps in generating better sentences by
focusing only on the important aspects of the video. LSTM is much simpler when
compared to the rest of the models thus lacking in performance.

CHAPTER 7

CONCLUSION & FUTURE WORK

Our project was done only on a subset of the TGIF dataset and yet we got promising
results. We worked on 3 models. All 3 models used InceptionV3 for generating
encodings. The first was an LSTM architecture. In the second one, we added an
Attention layer to the LSTM and we noticed that results majorly improved. The third
architecture we used is currently state-of-the-art. It uses a transformer. The
Transformer model gave the best results among the models we used. The sentences
generated by our model were accurate enough to be used for additional information in
a real-life scenario like for vision assistance. Our model is producing sensible
sentences and can capture the majority of the context from the video, which has given
meaningful sentences. Using a transformer has produced the best results while being
state-of-the-art although it is computationally quite expensive. In the future, training
the model on the entire or other datasets will give better results. Working on
researching models that are faster and lightweight with the same or better results will
help in generating almost real-time sentence generation.

REFERENCES

1. S. Amirian, K. Rasheed, T. R. Taha and H. R. Arabnia, "Automatic

Image and Video Caption Generation With Deep Learning: A Concise Review
and Algorithmic Overlap," in IEEE Access, vol. 8, pp. 218386-218400, 2020,
doi: 10.1109/ACCESS.2020.3042484.

2. C. Wang, P. Liaw, K. Liang, J. Wang and P. Chang, "Video

Captioning Based on Joint Image–Audio Deep Learning Techniques," 2019


IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin),
2019, pp. 127-131, doi: 10.1109/ICCE-Berlin47944.2019.8966173.

3. Z. Zhang, D. Xu, W. Ouyang and C. Tan, "Show, Tell and Summarize:

Dense Video Captioning Using Visual Cue Aided Sentence Summarization,"


in IEEE Transactions on Circuits and Systems for Video Technology, vol. 30,
no. 9, pp. 3130-3139, Sept. 2020, doi: 10.1109/TCSVT.2019.29

4. L. Gao, Z. Guo, H. Zhang, X. Xu and H. T. Shen, "Video Captioning


With Attention-Based LSTM and Semantic Consistency," in IEEE
Transactions on Multimedia, vol. 19, no. 9, pp. 2045-2055, Sept. 2017, doi:
10.1109/TMM.2017.2729019.

5. Y. Yang et al., "Video Captioning by Adversarial LSTM," in IEEE

Transactions on Image Processing, vol. 27, no. 11, pp. 5600-5611, Nov. 2018,
doi: 10.1109/TIP.2018.2855422.

6. 6.H. Xiao and J. Shi, "Video Captioning With Adaptive Attention and

Mixed Loss Optimization," in IEEE Access, vol. 7, pp. 135757-135769, 2019,


doi: 10.1109/ACCESS.2019.2942000.

7. J. Lee, Y. Lee, S. Seong, K. Kim, S. Kim and J. Kim, "Capturing

Long-Range Dependencies in Video Captioning," 2019 IEEE International


Conference on Image Processing (ICIP), 2019, pp. 1880-1884, doi:
10.1109/ICIP.2019.8803143

8. S. -H. Han, B. -W. Go and H. -J. Choi, "Multiple Videos Captioning

Model for Video Storytelling," 2019 IEEE International Conference on Big


Data and Smart Computing (BigComp), 2019, pp. 1-4, doi:
10.1109/BIGCOMP.2019.8679213

9. Y. Li et al., "TGIF: A New Dataset and Benchmark on Animated GIF

Description," 2016 IEEE Conference on Computer Vision and Pattern


Recognition (CVPR), 2016, pp. 4641-4650, doi: 10.1109/CVPR.2016.502.

10. J. Son, W. Park, S. Lee and S. Kim, "Video scene title generation

based on explicit and implicit relations among caption words," 2018 20th
International Conference on Advanced Communication Technology (ICACT),
2018, pp. 571-573, doi: 10.23919/ICACT.2018.8323836.

You might also like