You are on page 1of 10

Detection And Prediction Of Object In Live Motion With

Voice Output

Bachelor of Technology
in

COMPUTER SCIENCE AND ENGINEERING

BABI NUNNAGUPPALA (20BCE2851)


S MUTHU KUMAR (20BCE2139)
SUBRAHMANYA ABHIRAM (20BCE2533)

Under the kind guidance of


Dr. ANNAPURNA JONNALAGADDA
designation

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


VELLORE INSTITUTE OF TECHNOLOGY
October, 2022
Contents
1 Literature survey

2 SWOC/SWOT analysis

3 Proposed Model
3.1 Architecture / Frame work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Evaluation metrics

5 Results and analysis

6 Conclusions

7 Future scope

8 References
1 Literature survey
Paper - 1

A Survey on Recent Advances in AI and VisionBased Methods for Helping and Guiding Visually
Impaired People

[4] Published on: 23 February 2022

Existing models

CNN - Convolutional neural network

Gaps identified / Issues in the existing models

• Acquisition devices are high cost

• The adaptation of such models to specific conditions is difficult.

Paper - 2

Face detection and Recognition: A review

[2] Published on : February 2018

Existing models

face recognition algorithm

Gaps identified / Issues in the existing models

• 1:N problem where N is data in database

• Data privacy of stored data should be maintained


Paper - 3

Object Detection Based on the Improved Single Shot MultiBox Detector

[1] Published on : April 19

Existing models

Single Shot MultiBox Detector

Gaps identified / Issues in the existing models

• Choosing the correct layers among different neural network layers.

• The increase of dimension proportional to increase of noise.

Paper - 4

Object detection in real time based on improved single shot multibox detector algorithm

[3] Published on: 17 October 2020

Existing models

single shot multibox detector algorithm

Gaps identified / Issues in the existing models

• Problem can occur when anonymous object is present in the image.

• Blockage, deformable objects.


2 SWOC/SWOT analysis
strength:

• Can be used in Robotics, rover less cost and possible expansion in future.

Weakness:

• Security .

Opportunity:

• No perfection in existing models.


• Demand for blind people

Threat:

• Can be used for wrong purpose .


• New models maybe efficient .
• Malfunctioning can cause much damage.

3 Proposed Model

3.1 Architecture / Frame work

Figure 1: Information Optimization and Object–Obstacle Differentiation


3.2 Methodology

SINGLE SHOT DETECTOR (SSD)

SSD is an object detection model. It is different from image classification. Image classification says what
the picture or image is, while object detection finds out the different things in the image and tells where
they are in the image with the help of bounding boxes. We are using the SSD with a single-shot multi-box
detector since it is a more efficient and faster algorithm than the YOLO algorithm.

Figure 2: Different layers in SSD

SSD has two components: SSD head and a backbone model.

Backbone model basically is a trained image classification network as a feature extractor. Like ResNet this
is typically a network trained on ImageNet from which the final fully connected classification layer has been
removed.
The SSD head is just one or more convolutional layers added to this backbone and the outputs are interpreted
as the bounding boxes and classes of objects in the spatial location of the final layer’s activations. We are
hence left with a deep neural network which is able to extract semantic meaning from the input image while
preserving the spatial structure of the image albeit at a lower resolution.
For an input image, the backbone results in a 256 7x7 feature maps in ResNet34 . SSD divides the image
using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detecting
objects basically means predicting the class and location of an object within that region.
Anchor box

For predicting that object’s class and its location the anchor box with the highest degree of overlap with
an object is responsible. Once the network has been trained, this property is used for training the network
and for predicting the detected objects and their locations. Practically, each anchor box is specified with an
aspect ratio and a zoom level. Well, we know that all objects are not square in shape. Some are shorter ,some
are longer and some are wider, by varying degrees. The SSD architecture allows pre-defined aspect ratios of
the anchor boxes to account for this.The different aspect ratios can be specified using ratios parameter of the
anchor boxes associated with each grid cell at each zoom/scale level.

Zoom Level

It is not mandatory for the anchor boxes to have the same size as that of the grid cell. The user might be
interested in finding both smaller or larger objects within a grid cell. In order to specify how much the anchor
boxes need to be scaled up or down with respect to each grid cell, the zooms parameter is used.

Mobilenet

This model is based on the ideology of THE Mobile Net model based on depth wise separable convolutions
and it forms a factorized Convolutions. These converts a basic standard convolution into a depth wise
convolution. These 1 × 1 convolutions are also called as pointwise convolutions. For Mobile Nets to work,
these depth wise convolutions apply a general single filter-based concept to each of the input channels. These
pointwise convolutions apply a 1 × 1 convolutions to merge with the outputs of the depth wise convolutions.
As a standard convolution both filters combine the inputs into a new set of outputs in one single step. The
depth wise identifiable convolutions splits this into two layers - a separate layer for the filtering purpose
and the other separate layer for the combining purpose. This factorization methodology has the effect of
drastically reducing the computation and that of the model size.
4 Evaluation metrics
The model is trained using SGD with initial learning rate 0.001, 0.9 momentum, 0.0005 weight decay,
and batch size 32. Using a Nvidia Titan X on VOC2007 test, SSD achieves 59 FPS with mAP 74.3% on
VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4Here is the
expected accuracy comparison for different SSD methods. For SSD, it uses image size of 300 × 300 or 512
× 512.

Common Objects in Context (COCO) is being used here as dataset. This is a largescale object detection,
segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. The first
version of MS COCO dataset was released in 2014.

Features and characteristics of Data Set:

Following are the major characteristics and features of COCO dataset:


• Object segmentation
• Recognition in context
• Super pixel stuff segmentation
• 330K images (¿200K labeled)
• 1.5 million object instances
• 80 object categories
• 91 stuff categories
• 5 captions per image
5 Results and analysis
Seamless voice output and object detection in parallel. We often see object detection but the speaking of what
is being detected is new.When we were trying to implement the voice output on object detection,a separated
function which worked separately on a thread such that both functionalities can work parallelly.The model
detected multiple objects in front of itThe problem was that when multiple objects were detected, the code
continuously read out the objects alternatively and over each other. Such that it wasn’t audible.For fixing
that instead of iterating through all the detected objects, we chose the first object in the list and only one
object at a time was made to be detected and spoken. There was a minor bug in this, where the website
encountered null values sometimes, we added a null check in order to cope with this problem.

6 Conclusions
An assistive system is proposed for visually impaired persons through which they can perceive their surroundings
and objects in real-time and navigate independently. Deep learning-based object detection, in assistance with
various distance sensors, is used to make the user aware of obstacles, to provide safe navigation where all
information is provided to the user in the form of audio. A list of highly relevant objects to visually impaired
people is collected, and the dataset is prepared manually and is used to train the deep learning model for
multiple epochs. Images are augmented and manually annotated to achieve more robustness. The results
demonstrate 95.19% object detection accuracy and 99.69% object recognition accuracy in real-time. The
proposed work uses 0.3 s for multi-instance and multi-object detection from the captured image, which is
less than a non-visually impaired person in certain scenarios. The proposed assistive system gives more
information with higher accuracy in real time for visually challenged people. It can also easily differentiate
between objects and obstacles coming in front of the camera.

7 Future scope
Future work will focus on the inclusion of more objects in the dataset, which can make the dataset more
efficient for the assistance of visually impaired people. More sensors will be associated with it to detect, for
example, downstairs and other trajectories, giving a wider range of assistance to the visually impaired.
8 References

References
[1] Songmin Jia, Chentao Diao, Guoliang Zhang, Ao Dun, Yanjun Sun, Xiuzhi Li, and Xiangyin Zhang.
Object detection based on the improved single shot multibox detector. In Journal of Physics: Conference
Series, volume 1187, page 042041. IOP Publishing, 2019.

[2] Ashu Kumar, Amandeep Kaur, and Munish Kumar. Face detection techniques: a review. Artificial
Intelligence Review, 52(2):927–948, 2019.

[3] Ashwani Kumar, Zuopeng Justin Zhang, and Hongbo Lyu. Object detection in real time based on
improved single shot multi-box detector algorithm. EURASIP Journal on Wireless Communications
and Networking, 2020(1):1–18, 2020.

[4] Hélène Walle, Cyril De Runz, Barthélemy Serres, and Gilles Venturini. A survey on recent advances
in ai and vision-based methods for helping and guiding visually impaired people. Applied Sciences,
12(5):2308, 2022.

You might also like