Professional Documents
Culture Documents
Voice Output
Bachelor of Technology
in
2 SWOC/SWOT analysis
3 Proposed Model
3.1 Architecture / Frame work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Evaluation metrics
6 Conclusions
7 Future scope
8 References
1 Literature survey
Paper - 1
A Survey on Recent Advances in AI and VisionBased Methods for Helping and Guiding Visually
Impaired People
Existing models
Paper - 2
Existing models
Existing models
Paper - 4
Object detection in real time based on improved single shot multibox detector algorithm
Existing models
• Can be used in Robotics, rover less cost and possible expansion in future.
Weakness:
• Security .
Opportunity:
Threat:
3 Proposed Model
SSD is an object detection model. It is different from image classification. Image classification says what
the picture or image is, while object detection finds out the different things in the image and tells where
they are in the image with the help of bounding boxes. We are using the SSD with a single-shot multi-box
detector since it is a more efficient and faster algorithm than the YOLO algorithm.
Backbone model basically is a trained image classification network as a feature extractor. Like ResNet this
is typically a network trained on ImageNet from which the final fully connected classification layer has been
removed.
The SSD head is just one or more convolutional layers added to this backbone and the outputs are interpreted
as the bounding boxes and classes of objects in the spatial location of the final layer’s activations. We are
hence left with a deep neural network which is able to extract semantic meaning from the input image while
preserving the spatial structure of the image albeit at a lower resolution.
For an input image, the backbone results in a 256 7x7 feature maps in ResNet34 . SSD divides the image
using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detecting
objects basically means predicting the class and location of an object within that region.
Anchor box
For predicting that object’s class and its location the anchor box with the highest degree of overlap with
an object is responsible. Once the network has been trained, this property is used for training the network
and for predicting the detected objects and their locations. Practically, each anchor box is specified with an
aspect ratio and a zoom level. Well, we know that all objects are not square in shape. Some are shorter ,some
are longer and some are wider, by varying degrees. The SSD architecture allows pre-defined aspect ratios of
the anchor boxes to account for this.The different aspect ratios can be specified using ratios parameter of the
anchor boxes associated with each grid cell at each zoom/scale level.
Zoom Level
It is not mandatory for the anchor boxes to have the same size as that of the grid cell. The user might be
interested in finding both smaller or larger objects within a grid cell. In order to specify how much the anchor
boxes need to be scaled up or down with respect to each grid cell, the zooms parameter is used.
Mobilenet
This model is based on the ideology of THE Mobile Net model based on depth wise separable convolutions
and it forms a factorized Convolutions. These converts a basic standard convolution into a depth wise
convolution. These 1 × 1 convolutions are also called as pointwise convolutions. For Mobile Nets to work,
these depth wise convolutions apply a general single filter-based concept to each of the input channels. These
pointwise convolutions apply a 1 × 1 convolutions to merge with the outputs of the depth wise convolutions.
As a standard convolution both filters combine the inputs into a new set of outputs in one single step. The
depth wise identifiable convolutions splits this into two layers - a separate layer for the filtering purpose
and the other separate layer for the combining purpose. This factorization methodology has the effect of
drastically reducing the computation and that of the model size.
4 Evaluation metrics
The model is trained using SGD with initial learning rate 0.001, 0.9 momentum, 0.0005 weight decay,
and batch size 32. Using a Nvidia Titan X on VOC2007 test, SSD achieves 59 FPS with mAP 74.3% on
VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4Here is the
expected accuracy comparison for different SSD methods. For SSD, it uses image size of 300 × 300 or 512
× 512.
Common Objects in Context (COCO) is being used here as dataset. This is a largescale object detection,
segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. The first
version of MS COCO dataset was released in 2014.
6 Conclusions
An assistive system is proposed for visually impaired persons through which they can perceive their surroundings
and objects in real-time and navigate independently. Deep learning-based object detection, in assistance with
various distance sensors, is used to make the user aware of obstacles, to provide safe navigation where all
information is provided to the user in the form of audio. A list of highly relevant objects to visually impaired
people is collected, and the dataset is prepared manually and is used to train the deep learning model for
multiple epochs. Images are augmented and manually annotated to achieve more robustness. The results
demonstrate 95.19% object detection accuracy and 99.69% object recognition accuracy in real-time. The
proposed work uses 0.3 s for multi-instance and multi-object detection from the captured image, which is
less than a non-visually impaired person in certain scenarios. The proposed assistive system gives more
information with higher accuracy in real time for visually challenged people. It can also easily differentiate
between objects and obstacles coming in front of the camera.
7 Future scope
Future work will focus on the inclusion of more objects in the dataset, which can make the dataset more
efficient for the assistance of visually impaired people. More sensors will be associated with it to detect, for
example, downstairs and other trajectories, giving a wider range of assistance to the visually impaired.
8 References
References
[1] Songmin Jia, Chentao Diao, Guoliang Zhang, Ao Dun, Yanjun Sun, Xiuzhi Li, and Xiangyin Zhang.
Object detection based on the improved single shot multibox detector. In Journal of Physics: Conference
Series, volume 1187, page 042041. IOP Publishing, 2019.
[2] Ashu Kumar, Amandeep Kaur, and Munish Kumar. Face detection techniques: a review. Artificial
Intelligence Review, 52(2):927–948, 2019.
[3] Ashwani Kumar, Zuopeng Justin Zhang, and Hongbo Lyu. Object detection in real time based on
improved single shot multi-box detector algorithm. EURASIP Journal on Wireless Communications
and Networking, 2020(1):1–18, 2020.
[4] Hélène Walle, Cyril De Runz, Barthélemy Serres, and Gilles Venturini. A survey on recent advances
in ai and vision-based methods for helping and guiding visually impaired people. Applied Sciences,
12(5):2308, 2022.