Professional Documents
Culture Documents
Front Pages of Project Report (Black Book)
Front Pages of Project Report (Black Book)
Submitted To
SVKM’s NMIMS,
Mukesh Patel School of Technology Management & Engineering,
Shirpur Campus, Dist. Dhule (M.H.)
Submitted By :
Pulkit Agarwal – 70021118001
Aditya Shankarnarayan – 70021118052
Saket Singh – 70021118053
This is to certify that the work embodies in this Project entitled “Vision
Assistant” being submitted by
FORWARDED BY:
has been examined by us and is hereby approved for the award of degree
“Bachelor of Technology in Computer Engineering Discipline”, for
which it has been submitted. It is understood that by this approval the
undersigned do not necessarily endorse or approve any statement made,
opinion expressed or conclusion drawn therein, but approve the project
only for the purpose for which it has been submitted.
Date: Date:
We,
Pulkit Agarwal – 70021118001
Aditya Shankarnarayan – 70021118052
Saket Singh – 70021118053
Pulkit Agarwal
SAP ID.: 70021118001
Aditya Shankarnarayan
SAP ID.: 70021118052
Saket Singh
SAP ID.: 70021118053
Date:
ACKNOWLEDGEMENT
After the completion of this Major Project work, words are not enough to express my
feelings about all those who helped me to reach my goal; feeling above this is my
indebtedness to The Almighty for providing me this moment in life.
1 INTRODUCTION
1.1 Purpose 1
1.2 Scope 2
1.3 Overview 3
2 LITERATURE SURVEY
4 DESIGN
4.1 Architectural Diagram
4.2 Data Flow Diagram
4.3 Use Case Diagram
4.4 Activity Diagram
4.5 State Chart Diagram
4.6 Sequence Diagram
4.7 Collaboration Diagram
4.8 Deployment Diagram
5 Result Analysis/Implementation
5.1 Home Page Module
5.2 Login Module
5.3 Module2
5.4 Module3
6 Testing
6.1 Unit Testing
6.2 Integration and System Testing
References
LIST OF FIGURES
INTRODUCTION
1.1 PURPOSE
The purpose of our project is to provide the visually impaired with a new way to
experience the world. Our project is a vision assistant that describes the action and
objects present in the video shot. The main purpose of the system is to automate the
task of getting the caption for the video and reading it to the user. Using state-of-
the-art techniques to process and understand actions and objects in the video to
generate a sentence to accurately describe it. This sentence is then read out in a
natural sounding voice for better user experience and understandability. This
approach being simple and user-friendly can help the visually imapraired
experience the world around them independently.
Scope:
The goal is to design an application which covers all the functions of video
description and provides an interface of assistance to the user.
By using Deep learning techniques and Natural language processing, the project
performs:
1.3 OVERVIEW
This project intends to provide a system that would help the visually impaired with an
audio feed describing their surroundings. In vision assistant, the user records a video
file of the surroundings. It then generates a sentence describing the actions and objects
in the video which maintain the context. Because of the utility and simplicity, vision
assistants can be used by the visually impaired. Using state-of-the-art deep learning
techniques to generate accurate and meaningful sentences which maintains the context
of the video. Then generating a natural sounding audio of the sentence for better user
experience and understandability. This project combines sentence generation and text-
to-speech features on one platform. Vision assistant has great potential for other
applications such as security for national agencies and can be used for autonomous
vehicles.
CHAPTER 2
LITERATURE SURVEY
Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques
This model
Sentence results in
uses video
their method are
features as an
Multiple Videos better able to
Encoder - input for one
Seung-Ho Captioning Model construct sentence
2017 Decoder video clip and
Han for Video structures, i.e.,
Model generates a
Storytelling subject, verb, and
caption for the
object, compared to
input video
S2VT.
clip
Utilizes
explicit and
implicit
The title of a
Video Scene Title relations
video is The title of a video is
Generation based among
automatically automatically
Jeong- on Explicit and words
2018 generated with generated with
Woo Son Implicit Relations occurred in
sentences in sentences in closed
among Caption closed
closed captions.
Words captions
captions.
using
stochastic
matrix.
Zhiwang Show, Tell and 2019 division Division-and Propose a new two-
Zhang Summarize: and summarization stage
Dense Video summarizat (DaS) LSTM network
Captioning ion framework for equipped with a new
Using Visual Cue dense video
Methods
Author Year Salient
Title and Outcomes
Name Published Features
Techniques
Aided Sentence
captioning. hierarchical attention
Summarization framework
mechanism.
Experiments on the
datasets demonstrate
Video Captioning
Uses attention that method using
with Attention-
model with single feature can
based LSTM and
Lianli Gao 2019 LSTM LSTM to achieve competitive
Semantic
provide better or even better results
Consistency
results than the state-of-the-
art baselines for video
captioning.
VIDEO
CAPTIONING Used audio Shown improvements
BASED ON Audio - data along in terms of bleu score
Chien-Yao
JOINT IMAGE– 2019 video with video than what was
Wang1,
AUDIO DEEP processing For better there when only
LEARNING results. image data was used.
TECHNIQUES
CHAPTER 3
PROBLEM DEFINITION &
PROPOSED SOLUTION
Many people do not possess the gift of sight. They must rely on others for there day-
to-day task. For them to be independent, there must be a new and self-reliant way of
safely viewing and understanding the world. The main challenge would be describing
the surrounding in an audio format. This will help the visually impaired understand
the world easily. Improving video-to-sentence generation would be a great
achievement in assistive vision techniques.
CHAPTER 4
DESIGN
4.1 ARCHITECTURAL DIAGRAM
i. DFD Level 0
CHAPTER 5
RESULT ANALYSIS &
IMPLEMENTATION
1. Video Preprocessing:
Step 2: Repeat Step 1 after three timesteps for utmost thirty times.
2. Text Preprocessing:
Will not)
Step 3: Removal of Punctuations.
Step 5: Append all the Sentences with <SOS> and <EOS> tokens.
5.2 MODEL 1
FIG 5.2: IV3 + LSTM
The first model used was a basic encoder-decoder model. It consisted of InceptionV3
and LSTM layers. This architecture is comprised of two models: one for reading the
input sequence and encoding it into a fixed-length vector, and a second for decoding
the fixed-length vector and outputting the predicted sequence. Long Short Term
Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They work tremendously well on a large
variety of problems, and are now widely used.
1. The encoder LSTM is used to process the entire input frames and encode it into a
context vector, which is the last hidden state of the LSTM. This is expected to be a
good summary of the input. All the intermediate states of the encoder are ignored, and
the final state id supposed to be the initial hidden state of the decoder
2. The decoder LSTM or RNN units produce the words in a sentence one after
another.
5.3 MODEL 2
FIG 5.3: IV3 + LSTM + Attention
The second model is Inception V3, Long-Short-Term Memory, and Attention Layer.
This model improves upon on the first architecture by using attention layer.
The LSTM generates encodings while maintaining sequence and context from the
sequence of images. The attention layer differentially weights the significance of each
part of the input data, providing additional context by generating a context vector for
every timestep. This helps in generating meaningful sentences with richer context.
The attention mechanism was initially developed for the purpose of Machine
Translation by Bahdanau. Bahdanau et al (2015) came up with a simple but elegant
idea where they suggested that not only can all the input words be taken into account
in the context vector, but relative importance should also be given to each one of
them. In neural networks, attention is a technique that mimics cognitive attention. The
effect enhances some parts of the input data while diminishing other parts — the
thought being that the network should devote more focus to that small but important
part of the data. Learning which part of the data is more important than others
depends on the context and is trained by gradient descent.
5.4 MODEL 3
5.5 RESULTS
The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for
evaluating a generated sentence to a reference sentence.
A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score
of 0.0.
The score was developed for evaluating the predictions made by automatic machine
translation systems. It is not perfect, but does offer 5 compelling benefits:
The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper
“BLEU: a Method for Automatic Evaluation of Machine Translation“.
METHODS BLEU-(1,2,3,4)
InceptionV3 +
08.14 16.17 19.05 19.11
LSTM
InceptionV3 +
LSTM + 12.66 19.24 20.72 19.95
Attention
InceptionV3 +
33.53 25.53 25.83 20.90
Transformer
TESTING
6.1 RESULTS
METHODS BLEU-(1,2,3,4)
InceptionV3 +
08.09 16.11 19.05 19.06
LSTM
InceptionV3 +
12.33 19.04 20.59 19.87
LSTM + Attention
InceptionV3 +
26.31 21.27 22.93 18.70
Transformer
Upon testing we found out the transformer performed the best on the majority of the
metrics, followed by the LSTM and attention layer then by the LSTM layer.
Transformers being the current state-of-the-art approach for Natural Language
Processing tasks and gave the best results. Transformers have a deep understanding of
language, allowing training to focus on learning the task at hand. The attention layer
provides an additional context vector that helps in generating better sentences by
focusing only on the important aspects of the video. LSTM is much simpler when
compared to the rest of the models thus lacking in performance.
CHAPTER 7
Our project was done only on a subset of the TGIF dataset and yet we got promising
results. We worked on 3 models. All 3 models used InceptionV3 for generating
encodings. The first was an LSTM architecture. In the second one, we added an
Attention layer to the LSTM and we noticed that results majorly improved. The third
architecture we used is currently state-of-the-art. It uses a transformer. The
Transformer model gave the best results among the models we used. The sentences
generated by our model were accurate enough to be used for additional information in
a real-life scenario like for vision assistance. Our model is producing sensible
sentences and can capture the majority of the context from the video, which has given
meaningful sentences. Using a transformer has produced the best results while being
state-of-the-art although it is computationally quite expensive. In the future, training
the model on the entire or other datasets will give better results. Working on
researching models that are faster and lightweight with the same or better results will
help in generating almost real-time sentence generation.
REFERENCES
Image and Video Caption Generation With Deep Learning: A Concise Review
and Algorithmic Overlap," in IEEE Access, vol. 8, pp. 218386-218400, 2020,
doi: 10.1109/ACCESS.2020.3042484.
Transactions on Image Processing, vol. 27, no. 11, pp. 5600-5611, Nov. 2018,
doi: 10.1109/TIP.2018.2855422.
6. 6.H. Xiao and J. Shi, "Video Captioning With Adaptive Attention and
10. J. Son, W. Park, S. Lee and S. Kim, "Video scene title generation
based on explicit and implicit relations among caption words," 2018 20th
International Conference on Advanced Communication Technology (ICACT),
2018, pp. 571-573, doi: 10.23919/ICACT.2018.8323836.