You are on page 1of 8

Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)

IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

Smart Cap: A Deep Learning and IoT Based


Assistant for the Visually Impaired
Amey Hengle Atharva Kulkarni Nachiket Bavadekar
Computer Engineering Department Computer Engineering Department Computer Engineering Department
PVG’s COET, SPPU PVG’s COET, SPPU PVG’s COET, SPPU
Pune, INDIA Pune, INDIA Pune, INDIA
ameyhengle22@gmail.com atharva.j.kulkarni1998@gmail.com nachiket.bavadekar@gmail.com

Niraj Kulkarni Rutuja Udyawar


Computer Engineering Department Data Science and AI
PVG’s COET, SPPU Optimum Data Analytics
Pune, INDIA Pune, INDIA
nirajkulkarni2609@gmail.com rutuja.udyawar@odaml.com

Abstract—India is home to the largest number of visually perception [3]. According to the World Health
impaired people in the world, about 40 million, which accounts Organization’s recent statistics, there are approximately 253
for 20% of the world’s blind population. Moreover, more than million people with vision impairments in the world, among
90% of these people have little to no access to the necessary whom 36 million are completely blind, and 217 million have
assistive technologies. The paper proposes ‘Smart Cap,’ a first-
person vision-based assistant, aimed at bringing the world as a
moderate-to-severe vision impairments. With significant
narrative to these visually impaired people of India. The Smart advancements in computer vision and language models,
Cap acts as a conversational agent bringing together the assistive technologies hold great promise in easing the lives
disciplines of Internet of Things and Deep Learning and of these visually impaired people. The Smart Cap proposed
provides features like face recognition, image captioning, text in this paper will help the visually impaired people in many
detection and recognition, and online newspaper reading. The ways, such as by describing the surroundings, recognizing
hardware architecture consists of a Raspberry Pi, a USB familiar faces, reading out texts as well as providing the latest
webcam, a USB microphone, earphones, power source, and information via an online newspaper.
extension cables. The user can interact with the Smart Cap by
giving specific commands, which trigger the corresponding The face recognition task is based on dlib’s face recognition
feature module that returns an audio output. The face library [12]. It is a two-step process of face detection
recognition module is based on the dlib’s face recognition followed by face identification. Face detection is done using
project. It is a two-step process of detecting a face in the image a Histogram of Oriented Gradients (HOG) and Linear
and identifying it. The image captioning task synthesizes an
Support Vector Machine (LSVM) model [13]. Feature
attention-based CNN-LSTM encoder-decoder model coupled
with beam search for finding the best caption. Google’s Vision
extraction is carried out by the dlib’s facial landmark
API service is used for text detection and recognition. An detectors [14]. Finally, a face is identified by comparing the
additional feature of online newspaper reading is also provided, extracted facial feature vector with the entries stored in the
thus, keeping the blind person up to date with the daily news. user’s database. The image captioning module consists of
attention-based CNN-LSTM encoder-decoder architecture
Keywords—Assistive Technologies, Raspberry Pi, Face [15]. The Resnet-101 model [16] is used as an encoder, while
Recognition, Image Captioning, Text Recognition, OCR, News LSTM [15] decoder with attention, coupled with beam
Scraping. search, is used to generate the best possible caption for the
I. INTRODUCTION input image. Google’s Vision API is used for text detection
and recognition [22]. The results are further improved using
IoT, artificial intelligence, and cloud services are the three a combination of standard deviation, z-score, and k-means
digital pillars of the contemporary industry. By leveraging clustering. The online newspaper reading module uses
these technologies, the process of gathering, storing, sharing, feedparser and newspaper3k libraries to download and parse
and computing data has become far easy. A. P. Pandian [1] the articles from certain online newspapers. The Raspberry Pi
and D. Nirmal [2] discuss how these technologies are houses all these modules. Whenever the user wants to avail
revolutionizing the areas of automated logistics and electrical each of the functionalities mentioned above, he issues a voice
transmission and distribution systems, respectively. The command to trigger the corresponding module. Depending on
same idea can be translated to assistive devices. Vision plays the command, the webcam retrofitted on the Smart Cap clicks
a crucial role to comprehend the world around us as more a photo and sends it as an input to the respective module. All
than 85% of the external information is obtained through the the four modules return a result in text format which is
vision system. It largely influences our mobility, cognition, converted to audio by the pyttsx3 (python-text-to-speech)
information access, and interaction with the environment as library. Thus, in this work, a simple, cheap, and user-friendly
well as with other people. Blindness is a condition of lacking smart assistant system is designed and implemented to
visual perception due to neurological or physiological improve the quality of life of both the blind and the visually
factors. Partial blindness is a result of lack of integration in impaired people.
the growth of the visual centre of the eye or the optic nerve,
while total blindness is the full absence of the visual light

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1109

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

II. RELATED WORKS it reports a highest Specific Absorption Rate or SAR (0.5cm
For the blind, research on assistive technologies has Gap) value of 1.19 watts/kg, which is lower than the threshold
traditionally been focused on three main areas: mobility SAR value of 1.6 watts/kg [9]. Moreover, a backpack is used
assistance, information transmission, and computer access to house the Raspberry Pi to reduce its proximity from the
[4]. Mobility assistance focuses on navigation by scanning user’s head. The architecture diagram (Fig 1.1) illustrates
the user’s immediate environment and conveying the how various components in the system interact with one
gathered information back to the user via audio or tactile another. Fig 1.2 depicts the setup of the smart cap.
feedback. Information transmission aims at optical character
recognition and information rendering from 2D and 3D
scenes. Computer access-based solutions are Braille output
terminals, voice synthesizers, and screen magnifiers. Over
the years, a sundry of assistive devices have been developed,
ranging from devices worn on the fingers, feet, and arms, to
devices worn on the tongue, head, and waist [4]. As humans,
we mostly rely on head motion to gather information from the
environment. Hence, it is easy to deduce that head-mounted
devices provide the freedom of motion for environment
scanning, and thus, ease the process of gathering information.
With the emergence of Android and iOS operating systems,
several Android and iOS applications have been developed to
support the visually impaired people to perform their daily
tasks as well as aid them in navigation. Thus, Head-mounted
devices (HMDs) along with Android and iOS apps, are the
most popular type of assistive devices. An experimental
analysis of a sign-based wayfinding system for the blind is
proposed by Manduchi R. [5]. They use a camera cell phone
to detect specific colour markers using specialized computer
Fig. 1.1. Diagram of system architecture
vision algorithms, to assist the blind person in navigation. A
smart infrared microcontroller-based electronic travel aid TABLE I. HARDWARE COMPONENTS OF THE SMART CAP
(ETA) is presented by Amjed S. Al-Fahoum et al. [3] which
makes use of infrared sensors to scan a predetermined area Sr No Component Specification
around the blind person, by emitting-reflecting infrared 2GB RAM, 128GB SD
waves. The PIC microcontroller determines the direction and 1 Raspberry Pi -4B
Card
distance of the objects around the blind and alerts the user
about the obstacle’s shape, material, and direction. However, 2 Power Bank 5.1 V and 3000mA
both these works are only useful for navigation purposes and Microsoft LifeCam HD-
do not incorporate the recent state-of-the-art deep learning 3 Web Camera
6000 720p HD
techniques and cloud services to provide other essential and Harry and James noise
necessary features like face recognition, scene understanding 4 USB Microphone cancellation mini USB
and optical character recognition. Kavya. et al. [6] have microphone.
developed an android app that exercises Google’s cloud Sony headphones, 3.5
5 USB Speakers/Headphones
mm jack
services. The chatbot-based app uses Google’s Vision API
1m long male-to-female
for object detection, landmark recognition, and optical 6 Extension Cable
extension cable
character recognition, and relies on Google’s Dialogflow
trained voice-based chatbot to interact with the user. The 7 USB Dongle/Wi-Fi Min 1mbps speed
system proposed by A. Nishajith et al. [7] helps the blind to
navigate independently using real-time object detection and
identification. The project implements a Tensorflow object
detection API (ssd_mobilenet_v1_coco model), on a
Raspberry Pi. A Text to Speech Synthesiser (TTS) software
called eSpeak is used for converting the details of the detected
object from text to audio.
III. PROPOSED SYSTEM
The smart cap is an assistive device that takes audio
commands and image as an input and outputs results in the
audio format. All the hardware components of the smart cap
are stated in Table I. The use of HD webcam ensures that the
images captured are clear, and 128 GB SD card assures that
no memory overrun takes place. The radiofrequency effects
of Raspberry Pi on the user is within an acceptable range, as

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1110

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

Fig. 1.2. Hardware components of Smart Cap the user issues an invalid command, the smart cap instructs
the user to re-input the command. When the user gives a valid
command, the specific module is triggered. For face
recognition, image captioning, and text recognition, the smart
cap clicks a photo using the webcam. If the clicked photo
does not conform to the clarity and orientation standards
expected by the respective module, a new photo is clicked.
This photo is sent as an input to the respective module, which
processes the image and returns a result in the form of text.
For the newspaper module, online news articles are scraped
and downloaded in a text file. The text result returned by all
the modules is converted to audio using the pyttsx3 (python-
text-to-speech) library. The entire processing takes place on
the onboard Raspberry Pi. Finally, the earphones connected
to the Raspberry Pi convey the audio output to the user.
IV. SYSTEM DESCRIPTION
A. Face Recognition Module
From early research studies in the face recognition literature
that involved handcrafted features extracted by domain
experts in computer vision, to the pioneering work of Viola-
Jones object detection framework [10], and now the
introduction of Faster RCNN [11], face recognition has
changed monumentally. However, the importance of accurate
face identification has remained a constant, more so in
applications aiding the visually impaired.
Face recognition is a two-step process of face detection
followed by face recognition. The dlib’s HOG + Linear SVM
based model [13], when deployed on the Raspberry Pi,
outperforms its CNN based counterpart in terms of response
time. Hence, it is used for detecting a face in the input image.
If a face is detected, it is passed to dlib’s facial landmark
detection model, which is used to detect the (x, y) coordinates
of the 194 key points, on a face [14]. These annotations are
based on the 194-point HELEN dataset on which the dlib’s
facial landmark predictor is trained [14]. These mapped facial
features are stored in the form of a 128-dimensional NumPy
array or feature vector. The faces of people known to the user
are stored locally on the Raspberry Pi. To speed up the
prediction process, the feature vectors of these images are
computed when the system is initialized. Finally, face
recognition is done by comparing the input image’s feature
vector with that of the pre-stored image’s feature vectors. The
Fig. 1.3. User flow diagram depicting the working flow of the system name associated with the feature vector with the highest
similarity value is termed as the person’s face.
The workflow of the system is depicted in Fig. 1.3. The
system starts by powering the Raspberry Pi, which initializes B. Image Captioning Module
the webcam and loads all the machine learning models in the Image captioning is the process of generating textual
memory. The system developed by Sarfraz M. et al. [8] descriptions of an image based on the objects and actions in
depicts how interaction through audio interface helps the it. With the recent progress in computer vision and language
blind person gain vital information like who is staring at him, models, image captioning can significantly help visually
and who is in the room. It provides specific voice commands impaired people to get a better understanding of their
to trigger the respective functionality. Likewise, this paper surroundings. The image captioning module proposed in this
proposes a similar real-time multimodal system that uses paper is based on the attention-based encoder-decoder
audio commands such as “who is in front of me” to trigger architecture presented by Kelvin Xu et al. [17] in their
the face recognition module, “describe my surroundings” to research paper ‘Show, Attend and Tell’.
trigger scene description module, “read me the text” for OCR
1) Encoder:
module, and finally “tell me the news” for online newspaper
reading option. Once initialized, the smart cap is continuously Fig 2 depicts the attention-based encoder-decoder model used
listening, awaiting the user’s command. The audio command in this study. An Encoder is a neural network model that
is converted to text using Google’s speech-to-text library. If generates a fixed-size internal vector representation of the

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1111

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

Fig. 2. Encoder-Decoder architecture

input image. Instead of the VGG-16 model used by Kelvin image and the previous time-step output of the decoder,
Xu et al. [15], the Resnet-101 (Residual Convolutional respectively, to the same dimensions. These outputs are
Neural Network with 101 layers) model [16] is used as the added, and relu activated. A third linear layer applies softmax
encoder, which has been pretrained on the ImageNet image activation and flattens the output to generate the weighted
classification dataset. Resnet-101 is preferred over the VGG- average. The decoder uses this weighted representation of the
16 model as it not only addresses the problem of vanishing image and the word generated at the previous time-step to
gradients, but also has lesser learnable parameters, higher generate the next word at each-step [15].
accuracy, and lower top-1 and top-5 error rates. The last two
3) Beam Search:
layers, viz. the pooling and the linear layer, are discarded as
the image needs to be encoded not classified. The encoder Caption generation is a natural language processing task
encodes the input RGB image with three colour channels to which involves generating a sequence of words. While
generate a smaller image with more learned channels. This choosing the candidates words for the caption sequence, the
smaller encoded image contains the summarized greedy prediction is a popular technique, which considers the
representation of all the useful contents in the original image. highest scoring word at each time-step. However, this method
Thus, the model iteratively creates a smaller and smaller does not always yield optimal results as the entire sequence
representation of the original image, with each subsequent generated is dependent on the first word. Thus, if the first
representation being more learned and having a greater word is wrong, the entire caption predicted is sub-optimal.
number of channels than its previous one. The encoder finally So, to overcome this drawback, a better method would be to
outputs a feature vector encoding of size 14*14*2048. use beam search which keeps track of the k most likely words
at each time-step, where k is a hyperparameter called beam-
2) Decoder with attentions: width. The step-by-step working of beam search is as follows:
The decoder model is used to generate a word-by-word
1. At the first time-step, the beam search algorithm selects
meaningful caption for the input image. Long-Short Term
the k words with the highest conditional probability
Memory Network (LSTM) [17] is used as the decoder, to
score out of all the words in the vocabulary.
which the previously generated word and the output of the
2. For each successive time-steps, select the k-word
encoder are fed as inputs. The problem with the decoder
combinations with the highest conditional probability
without attention is that while generating a single word of the
score from the total of k*vocabulary size possible word
caption, the decoder looks at the entire image representation.
combination, based on the k words selected at previous
This approach is not very efficient as usually each word of
time-step.
the caption is obtained by focussing on different and specific
3. This process terminates when the <end> token word for
parts of an image. Thus, the decoder with an attention
a given caption is obtained.
network is employed, as proposed by Kelvin Xu et al. [15],
4. After k sequences terminate, choose the caption with the
so that while generating every word of the caption, the
best overall score.
decoder takes into consideration the previous word as well as
a specific subregion of the image, leading to a more accurate 4) Training:
description. The attention mechanism identifies these specific The MSCOCO’14 dataset [18] is used for training the
subregions, by taking the context and all the sub-regions as encoder-decoder model. 83K and 41K images, along with
input, and outputs their weighted arithmetic mean. The their captions, are used for training and validation,
context refers to the output of the decoder at the previous respectively. The input images are normalized, standardized,
time-step. Soft attention is used, as it is deterministic, where and resized to a fixed size of 256 *256. The captions are
the weights of the pixels add up to 1. The attention network padded with <start> and <end> tokens to mark the start and
consists of two linear layers which transform the encoded

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1112

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

end of the caption, respectively. The max caption length is set online articles. This library is used for extracting the content
to 50. The model is trained using the cross-entropy loss and between HTML and XML tags in the source code of the web
Adam optimizer. Beam search is used to select the optimal pages. However, due to the latency problems adhering this
word at each time-step of the caption generation. The model’s library, it did not seem the best option. Computationally cost-
performance is evaluated using the automated Bilingual efficient combinations of the newspaper3K and feedparser
Evaluation Understudy (BLEU) [15]. libraries are used for extracting the online newspaper articles,
and parsing feeds like RSS (Really Simple Syndication),
C. Text Recognition Module
respectively. Initially, the module checks whether the JSON
Incorporating text detection and recognition in assistive file contains an RSS link (RSS is a type of web feed allowing
devices has become much easier due to the recent users to access content in a standardized format). If present,
developments in computer vision, image processing, and the RSS link will be the first choice for accessing the online
cloud services. Image processing techniques like Sobel edge articles, as it gives congruous and correct data. If not, the
detection, Otsu binarization, and connected component regular URL mentioned in the JSON file will be used to
extraction used by N. Ezaki et al. [19], do not generalize well access the article. The module keeps downloading the articles
for all kinds of images with text. On the other hand, heavy- until a specified limit is reached. Once downloaded, the
duty deep learning models like the combination of VGG16 contents of the articles are parsed and stored in the text files.
and Connectionist Text Proposal Network (CTPN) proposed Then one by one, the contents of the text files are converted
by Lei Fei et al. [20], and EAST (an acronym for Efficient to audio, which the user can easily listen to, and stay up to
and Accurate Scene) by X. Zhou et al. [21] are not suitable to date with the daily news.
be deployed on the Raspberry Pi, due to its limited
computational capabilities. Hence, Google’s Vision API [22] V. RESULTS AND ANALYSIS
is employed, which gives accurate results without hampering This section presents the results and analysis of the four
the latency. The API is robust enough to detect text from modules explained above. The primary assumption for all use
documents as well as natural scene images. It returns a cases is that the input image is clear.
structured hierarchical response of the detected text, which is
organized into pages, blocks, paragraphs, words, and A. System Analysis
symbols, along with their x and y coordinates. The performance of the proposed system is analysed based
on its response time, memory usage, and processor usage for
However, the API returns all the text present in the image,
each module. Table II illustrates the system’s performance
which is not needed every time. The unnecessary extra text
analysis. The system takes about 13 seconds to initialize and
outside the region of interest is termed the outlier text region.
load the models in the memory. This is a one-time process
Furthermore, the API is also not able to return the text in the
which takes place after powering up the Raspberry Pi. The
proper top-to-bottom, left-to-right format when text in the
average response time of speech-to-text and text-to-speech is
image is divided into two columns. Thus, to counter these
1-3 seconds and 5-7 seconds, respectively. All the results are
problems, z-score is utilized. Z-score describes the position
calculated at a constant internet speed of 1mbps. Thus, it can
of the text regions from their mean. To deal with the outlier
be inferred that the system gives fast and accurate results
text regions, the text regions with absolute z-score value more
under ideal conditions. The memory management of the
than 1.75, are eliminated. The z-score is calculated with
system is further optimized by overwriting the image, text
respect to the x-coordinates of the text regions. If an outlier
file, and audio response for every iteration of the input
is detected, then the z-score is again calculated for the
command.
remaining text regions. The z-score is also used to determine
whether the text in an image is divided into two columns. If TABLE II. SYSTEM PERFORMANCE OBSERVATIONS
there is exactly one text region with absolute z-score value
greater than or equal to 1.3, the text in the image is said to be Average Memory
Response Time
Module Name Processor Access
not divided into two columns. Else, the text is divided into (seconds)
usage (%) (MEM%)
two columns. If it is divided, then k-means clustering with a
k value of 2 is used to categorize the text regions into their Face Recognition 5.2 74.51 29.7
appropriate clusters. The text regions in both clusters are Image Captioning 9.8 92.87 56.4
sorted according to their y-coordinate. The cluster having the
text region with the smallest x-coordinate amongst the two OCR 7.4 73.9 33.7
clusters is termed as the left column cluster and the other as
the right column cluster. Both the clusters are merged to Online Newspaper 6.0 48.3 18.8
generate the final result. The thresholds of z-score for
determining the outlier and detecting whether the text in a
page is divided or not are simply heuristic values found out B. Face Recognition Results
after testing the algorithm on over 50 images. Dlib’s face recognition provides a highly accurate model,
with a maximum accuracy of 99.38% for faces captured from
D. Online Newspaper Module multiple angles. The presence of a person in the input image
Online newspapers have gained immense importance in the is identified by using the HOG + LSVM object detector [13].
21st Century. In the past decade, many notable python The system can correctly recognize the faces of the people
libraries and plugins have been developed for carrying out who are at the most 14 feet away and facing the user. The
online newspaper scraping. Jiahou Wu [23] presents an results of the face recognition module are analysed by
approach that utilizes the BeautifulSoup library for scraping calculating the facial-similarity value in different use cases.

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1113

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

It is a floating-point value in the range [0-1], calculated by score is calculated by comparing the generated caption with
comparing the person’s face vector array with pre-stored face all the reference captions available for the input image.
vectors. Fig 3.1 depicts the use case for different conditions BLEU-4 approach works by counting the matching 4-words
of facial visibility. As depicted in Table III, the facial in the generated caption to 4-words in the reference captions.
similarity value for these cases is greater than the threshold The results are tabulated in Table V.
value (0.7). The second case is when an unknown person is
in front of the user. In this case, the facial similarity is less
than the threshold value (Fig 3.2). In the third case, when the
facial features of the person are not visible clearly, the facial
landmark predictor [14] fails to extract the necessary features.
This is depicted in Fig. 3.3, where a face is detected, but the
system is unable to process it further. The facial-similarity
values for use cases a, b, c and d are given in Table III.

Fig. 4. Captions generated by the system in different use case scenarios.

Fig. 3.1. Use cases of identifying a known face in different conditions of


facial visibility.
TABLE IV. BLEU SCORE EVALUATIONS

Beam Width Validation Set Test Set

1 29.98 30.28

2 32.95 33.06

3 33.16 33.29

D. OCR Results
The OCR module utilizes Google’s vision API, which is
powerful enough to detect text in an image with different
Fig 3.2. Unknown Person Fig 3.3. Facial features not
fonts and orientations, very accurately. The images are
visible
captured in a bird’s eye or top-down perspective. If no text
region is present in the hierarchical response returned by the
vision API, the image is said to contain no text. Fig 5.2
TABLE III. FACIAL SIMILARITY COMPARISON depicts the working of the text recognition module under
Use Case similarity value
normal conditions. As illustrated in Table V, the absolute z-
score value of all the text regions is less than 1.75. Thus, the
(a) 0.931 text does not contain any outliers. Outlier detection is
illustrated in Fig 5.3, where the text ‘Freewill’ is termed
(b) 0.844
outlier, as its absolute z-score is greater than the threshold
(c) 0.701 value (1.75). The outlier detection mechanism works best
when the region of interest in the image contains many text
(d) 0.496 blocks, as shown in Fig 5.3. Fig 5.1 portrays how the module
detects that the text present in the image is divided into two
columns, and accordingly clusters the text regions into left
C. Image Captioning Results and right cluster to generate the final result. This is because
Being trained on a large dataset, the image captioning model there does not exist only one text region whose absolute z-
can generate captions for indoor as well as outdoor settings. scores is greater than or equal to 1.3, which suggests that the
It is also capable of correctly detecting objects, animals, page is divided into two columns. It is depicted in Table VII.
humans, and their actions. Fig. 4 depicts the results of the For text recognition from documents, it is assumed that the
image captioning module under varied conditions. The visually impaired person is holding the document in the
model’s performance is evaluated using the automated correct orientation and close to the Smart cap (maximum two
Bilingual Evaluation Understudy (BLEU) [15]. The BLEU feet apart).

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1114

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

Fig. 5.1. Example of image with text divided into two columns

TABLE VI. ANALYSIS OF IMAGE WITH OUTLIER TEXT.

Sr X- Y- Abs Z- Clust
Text region
no coordinate Coordinate score er

1 freewill 597 477 1.859 -

HERSHEY'S
2 1018 1747 0.189 -
\n COCOA

3 100% COCOA 1218 2755 1.163 -


Fig 5.2. Example of image with simple text.

NATURAL \n
4 UNSWEETEN 1007 2968 0.136 -
ED
NET
5 WEIGHT: 1055 3348 0.369 -
225g

TABLE VII. ANALYSIS OF IMAGE WITH TEXT DIVIDED IN TWO COLUMNS


Sr X- Y- Abs Z- Clust
Text region
no coordinate Coordinate score er
To cherish and
1 follow the noble 257 690 1.300 0
ideals...
Fig 5.3. Example of image with outlier text. A boy scribbling
2 on a historical 200 3274 1.368 0
structure

TABLE V. ANALYSIS OF IMAGE WITH SIMPLE TEXT To develop the


3 scientific 1685 785 0.407 1
Sr Text X- Y- Abs Z- Clust temper...
no region coordinate coordinate score er
4 or 1641 1804 0.355 1
NAVNE
ET \n
1 GOLDE 333 280 0.892 - 5 D 2480 2578 1.358 1
N
Hanging lemon,
BOOK... 6 1801 3199 0.546 1
chilies
With \n
2 Useful 704 1606 0.116 -
Data…
For VI. CONCLUSION
3 Schools 447 2540 0.654 -
and… The study in this paper proposes how a more holistic
NAVNE approach of focusing on not just one, but multiple features is
4 1554 2915 1.66 -
ET

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1115

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Smart Systems and Inventive Technology (ICSSIT 2020)
IEEE Xplore Part Number: CFP20P17-ART; ISBN: 978-1-7281-5821-1

essential in creating a multi-purpose and versatile assistive Conference on Inventive Research in Computing Applications
(ICIRCA), Coimbatore, 2018, pp. 275-278, doi:
device. Furthermore, the paper exemplifies how the 10.1109/ICIRCA.2018.8597327.
integration of deep learning and cloud services, with small [8] Sarfraz, M., Constantinescu, A., Zuzej, M. et al. A Multimodal
single-board computers like Raspberry Pi, can be used to Assistive System for Helping Visually Impaired in Social Interactions.
develop a robust assistive device, that can help the visually Informatik Spektrum 40, 540–545 (2017),
impaired people tackle their real-life problems. Thus, a single https://doi.org/10.1007/s00287-017-1077-7
system that brings together several distinct features, such as [9] “WLU6331 WiFi Adapter RF Exposure Info (SAR) Raspberry Pi
Trading” Dec. 24, 2014. Accessed on: Aug. 1, 2020. [Online].
face recognition, image captioning, text recognition, and Available: https://fccid.io/2ABCB-WLU6331/RF-Exposure-Info/RF-
online newspaper reading is put forth in this paper. The Exposure-Info-SAR-pdf-2489354
design, architecture, and working flow of the system have [10] P. Viola and M. Jones, "Rapid object detection using a boosted cascade
been described extensively. Each feature of the system is of simple features," Proceedings of the 2001 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition. CVPR 2001,
elucidated in detail, and its use cases and results have been Kauai, HI, USA, 2001, pp. I-I, doi: 10.1109/CVPR.2001.990517.
illustrated using appropriate diagrams and tables. The system [11] Z. Wei, Y. Sun, J. Wang, H. Lai and S. Liu, "Learning Adaptive
is cheap, easy to configure, and user-friendly, and the user Receptive Fields for Deep Image Parsing Network," 2017 IEEE
does not require any special skill to operate it. In summary, Conference on Computer Vision and Pattern Recognition (CVPR),
the prototype for the visual assistive device discussed in this Honolulu, HI, 2017, pp. 3947-3955, doi: 10.1109/CVPR.2017.420.
paper, coupled with some additional hardware and [12] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of
Machine Learning Research, 10:1755–1758, 2009.
technological support, can play a crucial role in aiding the
[13] N. Dalal and B. Triggs, "Histograms of oriented gradients for human
blind and visually impaired people. detection," 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA,
VII. LIMITATIONS AND FUTURE WORK 2005, pp. 886-893 vol. 1, doi: 10.1109/CVPR.2005.177.
The future work of the Smart Cap presented in this study can [14] Z. Wei, Y. Sun, J. Wang, H. Lai and S. Liu, "Learning Adaptive
Receptive Fields for Deep Image Parsing Network," 2017 IEEE
take multiple directions. To begin with, the addition of more Conference on Computer Vision and Pattern Recognition (CVPR),
robust hardware support like GPUs will not only improve the Honolulu, HI, 2017, pp. 3947-3955, doi: 10.1109/CVPR.2017.420.
device’s response time but also pave the way for the inclusion [15] Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron
of faster and more accurate deep learning models. OCR can Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
be coupled with Document Image Analysis (DIA) for getting Bengio. 2015. Show, attend and tell: neural image caption generation
with visual attention. In Proceedings of the 32nd International
more optimal results. The system’s audio interface can be Conference on International Conference on Machine Learning -
enhanced by providing multilingual support so that the user Volume 37 (ICML’15). JMLR.org, 2048–205.
can operate the smart cap in his or her native language. [16] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for
Finally, real-time object detection can be achieved by adding Image Recognition," 2016 IEEE Conference on Computer Vision and
proximity sensors. Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi:
10.1109/CVPR.2016.90.
REFERENCES [17] Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural
Computation, 9(8):1735–1780, 1997.
[1] Pandian, A. P. (2019),”Artificial Intelligence Application In Smart
[18] Tsung-Yi James Hays Lin, Michael Maire Pietro Perona,Serge
Warehousing Environment For Automated Logistics”, Journal of
Belongie Deva Ramanan, Lubomir Bourdev C. Lawrence Zitnick, Ross
Artificial Intelligence, 1(02), 63-72.
Girshick Piotr Dollar Microsoft COCO: Common Objects in Context
[2] Nirmal, D. "Artificial Intelligence Based Distribution System (Feb 2015).
Management and Control."Journal of Electronics 2, no. 02 (2020): 137-
[19] N. Ezaki, M. Bulacu and L. Schomaker, "Text detection from natural
147.
scene images: towards a system for visually impaired persons,"
[3] Amjed S. Al-Fahoum, Heba B. Al-Hmoud, and Ausaila A. Al-Fraihat, Proceedings of the 17th International Conference on Pattern
“A Smart Infrared Microcontroller-Based Blind Guidance System,” Recognition, 2004. ICPR 2004., Cambridge, 2004, pp. 683-686 Vol.2,
Active and Passive Electronic Components, vol. 2013, p. 7, 2013. doi: 10.1109/ICPR.2004.1334351.
[4] Velázquez R. (2010) Wearable Assistive Devices for the Blind. In: [20] Lei Fei, Kaiwei Wang, Shufei Lin, Kailun Yang, Ruiqi Cheng, and Hao
Lay-Ekuakille A., Mukhopadhyay S.C. (eds) Wearable and Chen "Scene text detection and recognition system for visually
Autonomous Biomedical Devices and Systems for Smart Environment. impaired people in real world", Proc. SPIE 10794, Target and
Lecture Notes in Electrical Engineering, vol 75. Springer, Berlin, Background Signatures IV, 107940S (9 October 2018).
Heidelberg.
[21] X. Zhou et al., "EAST: An Efficient and Accurate Scene Text
[5] Manduchi R. (2012) Mobile Vision as Assistive Technology for the Detector," 2017 IEEE Conference on Computer Vision and Pattern
Blind: An Experimental Study. In: Miesenberger K., Karshmer A., Recognition (CVPR), Honolulu, HI, 2017, pp. 2642-2651, doi:
Penaz P., Zagler W. (eds) Computers Helping People with Special 10.1109/CVPR.2017.283.
Needs. ICCHP 2012. Lecture Notes in Computer Science, vol 7383.
[22] “Document Text Tutorial,” Google Inc. Accessed on: Aug. 1, 2020.
Springer, Berlin, Heidelberg.
[Online]. Available:https://cloud.google.com/vision/docs/fulltext-
[6] Ms. Kavya. S, Ms. Swathi, Mrs. Mimitha Shetty, 2019, Assistance annotations visited on 2nd Aug 2020.
System for Visually Impaired using AI, INTERNATIONAL
[23] Wu, Jiahao. (2019). Web Scraping Using Python: A Step By Step
JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY
Guide
(IJERT) RTESIT – 2019 (VOLUME 7 – ISSUE 08.
[7] A. Nishajith, J. Nivedha, S. S. Nair and J. Mohammed Shaffi, "Smart
Cap - Wearable Visual Guidance System for Blind," 2018 International

978-1-7281-5821-1/20/$31.00 ©2020 IEEE 1116

Authorized licensed use limited to: Auckland University of Technology. Downloaded on October 25,2020 at 09:32:14 UTC from IEEE Xplore. Restrictions apply.

You might also like