Electronics 12 00541

electronics
Article
Object Recognition System for the Visually Impaired: A Deep
Learning Approach using Arabic Annotation
Nada Alzahrani and Heyam H. Al-Baity *
Information Technology Department, Collage of Computer and Information Sciences, King Saud University,
Riyadh P.O. Box 145111, Saudi Arabia
* Correspondence: halbaity@ksu.edu.sa
Abstract: Object detection is an important computer vision technique that has increasingly attracted
the attention of researchers in recent years. The literature to date in the field has introduced a
range of object detection models. However, these models have largely been English-language-based,
and there is only a limited number of published studies that have addressed how object detection
can be implemented for the Arabic language. As far as we are aware, the generation of an Arabic
text-to-speech engine to utter objects’ names and their positions in images to help Arabic-speaking
visually impaired people has not been investigated previously. Therefore, in this study, we propose
an object detection and segmentation model based on the Mask R-CNN algorithm that is capable
of identifying and locating different objects in images, then uttering their names and positions in
Arabic. The proposed model was trained on the Pascal VOC 2007 and 2012 datasets and evaluated on
the Pascal VOC 2007 testing set. We believe that this is one of a few studies that uses these datasets
to train and test the Mask R-CNN model. The performance of the proposed object detection model
was evaluated and compared with previous object detection models in the literature, and the results
demonstrated its superiority and ability to achieve an accuracy of 83.9%. Moreover, experiments
were conducted to evaluate the performance of the incorporated translator and TTS engines, and
the results showed that the proposed model could be effective in helping Arabic-speaking visually
impaired people understand the content of digital images.
Citation: Alzahrani, N.; Al-Baity,
H.H. Object Recognition System for
Keywords: artificial intelligence; object detection; visually impaired; deep learning; object recognition;
the Visually Impaired: A Deep
Mask R-CNN; assistive technologies
Learning Approach using Arabic
Annotation. Electronics 2023, 12, 541.
https://doi.org/10.3390/
electronics12030541
1. Introduction
Academic Editors: Daniel Hládek,
The development of computer vision systems has been necessitated by the requirement
Matúš Pleva, Piotr Szczuko and
to make sense of the massive number of digital images from every domain. Computer
Andrej Zgank
vision involves the scientific study of developing ways that can enable computers to see,
Received: 15 November 2022 understand, and interpret the content of digital images, including videos and photographs.
Revised: 15 January 2023 To obtain a high-level understanding of digital images or videos, it is not sufficient
Accepted: 16 January 2023 to focus on classifying different images; instead, the categories and locations of objects
Published: 20 January 2023 contained within every image should be specified [1]. This task is referred to as object
detection, which is a major area of interest within the field of computer vision. Object
detection is the process of finding various individual objects in an image and identifying
their location through bounding boxes with their class labels [2]. Additionally, object
Copyright: © 2023 by the authors.
detection methods can be implemented either by employing traditional machine-learning
Licensee MDPI, Basel, Switzerland.
techniques or using deep learning algorithms.
This article is an open access article
In recent years, many researchers have leveraged deep learning algorithms such as
distributed under the terms and
conditions of the Creative Commons
convolutional neural networks (CNNs) to solve object detection problems and proposed
Attribution (CC BY) license (https://
different state-of-the-art object detection models. These object detection models can be clas-
creativecommons.org/licenses/by/ sified into two major categories: two-stage detectors such as R-CNN [3], Fast R-CNN [4],
4.0/). Faster R-CNN [5], Mask R-CNN [6], and G-R-CNN [7], and one-stage detectors such
Electronics 2023, 12, 541. https://doi.org/10.3390/electronics12030541 https://www.mdpi.com/journal/electronics

Electronics 2023, 12, 541 2 of 16
as YOLO [8], SSD [9], YOLOv4 [10], and YOLOR [11]. A two-stage detector produces
a set of region proposals of the image in the first stage; then, it passes these regions to
the second stage for object classification. On the other hand, a one-stage detector takes
the entire image and feeds it to a CNN within a single pass to generate the region pro-
posals and identify and classify all objects in the image. These models have obtained
highly accurate results in less computational time and have generally outperformed tradi-
tional machine learning algorithms [1]. This is due to the availability and capabilities of
high-performance graphical processing units (GPUs) in modern computers, as well as the
availability of large benchmark datasets for training models [1]. Object detection has been
incorporated into diverse domains, from personal security to productivity in the work-
place, including autonomous driving, smart video surveillance, face detection, and several
people-counting applications.
Moreover, object detection methods can be employed to assist people with special
needs in understanding the content of images, including those with visual disabilities.
Moreover, a study conducted by Bourne et al. [12] estimated that, worldwide, there will be
38.5 million completely blind people by 2020, a number that will increase to 115 million
by 2050. In 2017, an analysis by the General Authority of Statistics in Saudi Arabia [13]
revealed that 7.1% of the Saudi population has disabilities, while 4% of them suffer from
some form of visual disability. However, most object detection research dedicated to the
visually impaired has been focused on the English language, while a limited number of
related studies exist for Arabic. Therefore, Arabic language object detection has emerged
as a promising research topic. In one study proposed by Al-muzaini et al. [14], an Arabic
image dataset that consisted of images and captions was constructed by means of crowd-
sourcing, a professional translator, and Google Translate. This dataset was primarily built
for testing different image captioning models. Another study [15] proposed an Arabic
image captioning model using root-word-based RNN and CNN for describing the contents
of images. However, as far as we are aware, the possibility of helping Arabic-speaking
visually impaired people through the development of an Arabic text-to-speech engine that
can utter objects’ names and their positions in images has not been previously investigated.
In this paper, an object detection model based on the Mask R-CNN algorithm for
identifying and locating objects in images was developed. Furthermore, a text-to-speech
engine was incorporated into the proposed model to convert the predicted objects’ names
and their positions within the image into speech to help Arabic-speaking people with vision
impairment recognize the objects in images. The contribution of our study to the existing
literature is the incorporation of a translator and text-to-speech engines into the model to
translate and utter the names and positions of the detected objects in images in Arabic. We
believe that this is the first approach that utilizes these services with an object detection
model to assist Arabic-speaking visually impaired people.
2. Related Work
Leaf disease detection for various fruits and vegetables is an agricultural applica-
tion of object detection. One such application based on Faster R-CNN was proposed by
Zhou et al. [16] and was used to detect rapid rice disease, which appears on the rice leaves.
To manage the large number of bounding boxes produced by Faster R-CNN, the authors in-
tegrated R-CNN with the Fuzzy C-means (FCM) and K-means (KM) clustering algorithms
to reset the size of the bounding boxes. The performance of the designed algorithm was
verified against datasets of different rice diseases: rice blast, bacterial blight, and sheath
blight. The results showed that FCM-KM-based Faster R-CNN performed better than the
original Faster R-CNN. Moreover, the proposed algorithm successfully eliminated most
of the background interference, including rice ears and disease-free rice leaves. However,
it could not eliminate rice ears completely; therefore, further investigation was needed to
apply threshold segmentation theory in the segmentation of plant leaf diseases to make the
diseased rice leaves completely segmented.
Electronics 2023, 12, 541 3 of 16
Similarly, Lakshmi et al. [17] proposed an effective deep learning model called DPD-
DS for the automatic detection of diseases in plants and segmentation, which is based on an
improved pixelwise Mask R-CNN. The authors introduced a light head region convolution
neural network (R-CNN) by adjusting the proportions of the anchor in the RPN network
and modifying the backbone structure to save memory space and computational cost, thus
increasing detection accuracy and computational speed. The proposed model comprised
three parts: first, the classification of the plant’s health, either healthy or diseased; second,
the detection of the infected portion of the plant by using a bounding box; finally, generating
a mask for each infected region of the plant. The authors used two benchmark datasets
to train the proposed model, which contained leaf images of crop species, namely the
PlantVillage dataset and the Database of Leaf Images dataset. Furthermore, the model was
tested and achieved a 0.7810 mAP, outperforming other state-of-the-art models, including
SSD, Faster R-CNN, YOLOv2, and YOLOv3. However, the authors highlighted some of the
proposed model’s limitations, which could be addressed in the future, such as the fact that
chinar-infected regions were not segmented properly on some images and that successful
detection depends on exact ground-truth data (annotations).
Additionally, another agricultural application of object detection is fruit detection.
Fruit detection and segmentation using the Mask R-CNN can be used by modern fruit-
picking robots to effectively pick fruit without damaging the plant. For example,
Liu et al. [18] utilized and improved the Mask R-CNN to detect cucumber fruit by in-
tegrating a logical green (LG) operator into the region proposal network (RPN). This is
because the region of interest (RoI) only includes green regions since the algorithm’s pur-
pose is to detect cucumber fruit alone. The backbone of the network consisted of the
Resnet-101 architecture, which was further consolidated using feature pyramid network
(FPN) to extract features of different sizes from the input image. The performance of the
developed model was tested and achieved 90.68% precision and 88.29% recall, exceeding
the YOLO, the original Mask R-CNN, and Faster R-CNN algorithms in terms of fruit
detection. However, due to the Faster R-CNN structure, the developed model’s average
elapsed time was slightly slow and could not perform real-time detection.
Another study conducted by Zhan et al. [19] proposed an improved version of Mask
R-CNN to detect and segment the damaged areas of an automobile in an accident. This
can provide speedy solutions for insurance companies to process damage claims. To cope
with the challenges in the detection of damaged areas of vehicles, the authors proposed
some modifications in Mask R-CNN, such as lowering the number of layers in the original
ResNet-101, providing better regularization to solve overfitting and adjusting anchor box
dimensions according to the requirements. The performance of the proposed model was
compared with the original Mask R-CNN with promising results, where it attained an AP
with a value of 0.83 as compared to the original Mask R-CNN’s 0.75. However, the authors
addressed the limitations of their approach, including situations where, in some cases, the
segmentation was not completely accurate and the difficulty in segmenting some areas in
which damage was not visible enough.
3. Proposed Method
The framework of the proposed model was developed through several steps: data
acquisition, data pre-processing, building the object detection model, specifying the posi-
tions of the detected objects in the image, generating the Arabic annotations and positions,
text-to-speech (TTS) conversion, and finally, the final model evaluation. Figure 1 shows the
general framework of the proposed model.
Electronics 2023,12,
Electronics2023, 12,541
x FOR PEER REVIEW 44 of
of16
16
Figure 1. The
Figure 1. The proposed
proposed model
model framework, ‫شخص‬
framework, ((

m means
meansperson,
‫حصان‬means
person,àAk meanshorse).
horse)
3.1.
3.1. Mask
Mask R-CNN
R-CNN Algorithm
Algorithm
Mask R-CNN
Mask R-CNN [5] [5] is
is an
anobject
objectdetection
detectionand andsegmentation
segmentation algorithm.
algorithm. It isItbuilt
is built on
on top
top of the
of the Faster
Faster R-CNN
R-CNN algorithm
algorithm [6], [6], which
which is object
is an an object detection
detection algorithm
algorithm thatthat
usesuses a
a re-
region proposal network (RPN) with the CNN model. The essential
gion proposal network (RPN) with the CNN model. The essential difference between difference between
them
them is that Mask
is that MaskR-CNN
R-CNNoperates
operatesone one step
step ahead
ahead of the
of the object
object detection
detection mechanism
mechanism pro-
provided by Faster R-CNN and provides pixel-level segmentation
vided by Faster R-CNN and provides pixel-level segmentation of the object, i.e., of the object, i.e., it
it not
not
only
onlydetects
detectsgiven
givenobjects
objectsininananimage,
image,but also
but decides
also decides which
whichpixel belongs
pixel belongsto which
to which object.
ob-
In this
ject. Instudy, Mask Mask
this study, R-CNN was employed
R-CNN was employedbecause of its flexibility,
because simplicity,
of its flexibility, and ability
simplicity, and
to provide flexible architecture designs. Figure 2 shows the Mask R-CNN architecture.
ability to provide flexible architecture designs. Figure 2 shows the Mask R-CNN architec-
ture.
Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/electronics

Electronics 2023, 12, x FOR PEER REVIEW 5 of 16
Electronics 2023, 12, 541 5 of 16
Figure 2.
Figure The Mask
2. The Mask R-CNN
R-CNN architecture
architecture [6].
[6].
Mask R-CNN
Mask R-CNN generally
generally consists
consists ofof the
the following
following modules:
modules:
•• The The backbone:

backbone: This This isis aa standard
standard CNN,
CNN, which
which isis responsible
responsible for for feature
feature extraction.
extraction.
The initial layers of the network detect low-level features such as edges and corners,
The initial layers of the network detect low-level features such as edges and corners,
while subsequent layers detect high-level features such as car or person. The backbone
while subsequent layers detect high-level features such as car or person. The back-
network converts the input image into a feature map and passes it to the next module.
bone network converts the input image into a feature map and passes it to the next
• Region proposal network (RPN): This is a lightweight neural network that uses a
module.
sliding window mechanism to scan the feature map and extracts regions that contain
• Region proposal network (RPN): This is a lightweight neural network that uses a
objects, known as regions of interest (ROIs). ROIs are fed to the RoIAlign method to
sliding window mechanism to scan the feature map and extracts regions that contain
generate fixed-sized ROIs and precisely locate the regions. The RPN produces two
objects, known as regions of interest (ROIs). ROIs are fed to the RoIAlign method to
outputs for each ROI: the class (i.e., if the region is background or foreground) and the
generate fixed-sized ROIs and precisely locate the regions. The RPN produces two
bounding box refinement.
• outputs for each ROI: the class (i.e., if the region is background or foreground) and
ROI classifier and bounding box regressor: ROIs are passed through two different
the bounding
processes, theboxROIrefinement.
classifier and the bounding box regressor. The ROI classifier is
• ROI classifier and
responsible for classifyingbounding thebox regressor:
object in the ROIROIstoare passedclass,
a specific through suchtwo different
as person or
processes, the ROI classifier and the bounding box regressor.
chair, while the bounding box regressor is responsible for predicting the bounding box The ROI classifier is
responsible
for the ROI. for classifying the object in the ROI to a specific class, such as person or
• chair, while the
Segmentation bounding
masks: The maskbox regressor
branch is is responsible
a fully for predicting
convolutional networkthe bounding
(FCN), which
box for the ROI.
is mainly responsible for assigning object masks to each detected object, i.e., the
• Segmentation
positive regions masks:
selectedTheby mask
the branch is a fully convolutional network (FCN), which
ROI classifier.
• is mainly responsible for assigning
Mask R-CNN has several advantages, especially object masks to in each
termsdetected object,
of training i.e., the First,
simplicity. pos-
itive regions selected by the ROI classifier.
it enhances Faster R-CNN by increasing its speed while adding only minor overhead.
• Mask R-CNN the
Furthermore, has mask
several advantages,
branch allows especially in termsand
for faster systems of training
more rapidsimplicity.
testingFirst,
with
it enhances
only minor Faster R-CNN by
computational increasing
overhead. its speedMask
Moreover, while R-CNN
adding onlycan be minor
easilyoverhead.
adapted
Furthermore,
to various tasks. the mask
Human branch
poseallows for faster
estimation, systems and
for instance, can more rapid
be easily testing with
accomplished
only minor computational overhead. Moreover, Mask R-CNN
with Mask R-CNN using the same framework. Such techniques are key to assisting can be easily adapted
to various tasks. Human pose estimation, for instance,
people with visual impairment in understanding image content. Mask R-CNN has can be easily accomplished
with Maskwidely
also been R-CNNused usingin the same framework.
different Such techniques
domains to produce segmented are masks;
key to assisting
possible
people
uses in with visual impairment
the computer vision field in are
understanding image content.
video surveillance, autonomous Mask carR-CNN has
systems,
also
and been
tumorwidely
detection.used Onin different
the otherdomains to produceissue
hand, a prevalent segmented
in Maskmasks;
R-CNN possible
is that
uses in thepart
it ignores computer
of the vision field are
background and video
marks surveillance, autonomous
it as foreground, car systems,
resulting in target
and tumor detection.
segmentation inaccuracy.On the other hand, a prevalent issue in Mask R-CNN is that it
ignores part of the background and marks it as foreground, resulting in target seg-
3.2. Specifying
mentationPositions
inaccuracy. of the Detected Objects
In this paper, we propose a new method for specifying the detected objects’ positions
3.2. Specifying
regarding Positions
the image. of The
This Detected
method Objects the position of a predicted object by utilizing
determines
its bounding box, where each
In this paper, we propose a new method bounding boxforisspecifying
defined inthe terms of itsobjects’
detected bottom-left and
positions
top-right coordinates in the image. These coordinates are retrieved
regarding the image. This method determines the position of a predicted object by utiliz- to identify the center
of the
ing its detected
boundingobject. Using each
box, where this defined
bounding center
box and based on
is defined the height
in terms of itsand width ofand
bottom-left the
image, thecoordinates
top-right model specifiesin thethe nine These
image. possible positions of
coordinates arethe predicted
retrieved to object,
identifyastheshown in
center
Figure 3.
of the detected object. Using this defined center and based on the height and width of the

image, the model specifies the nine possible positions of the predicted object, as shown in
Figure 3.
Electronics 2023, 12, 541 6 of in

image, the model specifies the nine possible positions of the predicted object, as shown 16
Figure 3.
Figure 3. Map of the possible detected objects’ positions regarding the image.
3.3. Generating the Annotations and Positions in the Arabic Language

Since there is no study in the literature that proposes object detection with Arabic
annotations, this paper fills a gap by incorporating a translator API, which will translate
Figure
the 3. Map
English of theofpossible
labels detected
the detected objects’and
objects positions regarding relating
their positions the image.to the image into the
Arabic language.
3.3. Generating theThe Google Translation
Annotations and Positions APIin (Googletrans)
the Arabic Language was used to translate the class
3.3. Generating
labels the Annotations
for the predicted objectsandandPositions in the Arabic
the positions of the Language
detected objects into Arabic. The
Since there is no study in the literature that proposes object detection with Arabic
Since there
Googletrans libraryis no
wasstudy
chosenin the
dueliterature
to its speed thatand
proposes object
reliability [20].detection withuses
This library Arabic
the
annotations, this paper fills a gap by incorporating a translator API, which will translate
annotations,
Google this paper
Translate fillswhich
Ajax API, a gap is
byresponsible
incorporating a translator
for detecting theAPI, whichofwill
language thetranslate
text and
the English labels of the detected objects and their positions relating to the image into the
the Englishitlabels
translating of the detected
into Arabic. However, objects
sinceand the their positions
Matplotlib relating
library doestonotthesupport
image into the
the Ar-
Arabic language. The Google Translation API (Googletrans) was used to translate the class
Arabic
abic language.
language, The Google
it displays Translation
the names of the API (Googletrans)
objectsof to the was
the detected used
user withobjects to translate
the Arabic the class
text written
labels for the predicted objects and the positions into Arabic. The
labelsleft
from for to
theright,
predicted
whereasobjects andwriting
Arabic the positions
usually offlows
the detected
from objects
right to intoMoreover,
left. Arabic. The it
Googletrans library was chosen due to its speed and reliability [20]. This library uses the
Googletrans
renders the librarycharacters
Arabic was chosen indue
an to its speed
isolated form;and reliability
thus, every [20]. This is
character library uses the
rendered, re-
Google Translate Ajax API, which is responsible for detecting the language of the text and
Google
gardless
translatingTranslate
of it
itsinto Ajax API,
surroundings,
Arabic. which issince
responsible
as in Figure
However, for detecting
4. Matplotlib
the librarythe language
does of the
not support thetext and
Arabic
translating
To solveit into
these Arabic.
issues, However,
two main since the
libraries Matplotlib
were imported;librarya does
bidi
language, it displays the names of the objects to the user with the Arabic text written from not support
algorithm, the
which Ar-
en-
abic
ables language,
left tothe it
textwhereas
right, displays
to be written
Arabicthe names
from of the
rightusually
writing objects
to left, flows to
and anfrom the
Arabicuser
right with
reshaper,the Arabic text
which reshapes
to left. Moreover, written
the
it renders
from
Arabic left to right,
characters whereas
and replaces Arabic
them writing
with usually
their correctflows
shapes from right
according
the Arabic characters in an isolated form; thus, every character is rendered, regardless of its to left.
to theirMoreover,
surround- it
renders
ings. the Arabic
surroundings, as in Figure 4.characters in an isolated form; thus, every character is rendered, re-
gardless of its surroundings, as in Figure 4.
To solve these issues, two main libraries were imported; a bidi algorithm, which en-
ables the text to be written from right to left, and an Arabic reshaper, which reshapes the
Arabic characters and replaces them with their correct shapes according to their surround-
ings.
Figure
Figure 4.4.The
Theproblem of rendering
problem Arabic
of rendering characters
Arabic in isolated
characters ‫شخص‬means
forms,forms,
in isolated ‫حصان‬
person,person,
m means
means
horse)
àAk means horse).
3.4. Text-To-Speech Conversion
To solve these issues, two main libraries were imported; a bidi algorithm, which enables
the text to be written from right to left, and an Arabic reshaper, which reshapes the Arabic
Figure 4. The problem of rendering Arabic characters in isolated forms, ‫شخص‬means person, ‫حصان‬
characters and replaces them with their correct shapes according to their surroundings.
means horse)
3.4. Text-to-Speech Conversion
3.4. Text-To-Speech
Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx Conversion www.mdpi.com/journal/electronics
Text-to-speech (TTS) has emerged as a promising new area of research that bridges
the gap between humans and machines. In this study, TTS is incorporated as a later step,
where the names of the detected objects and their positions are uttered. As promising as it
sounds, this process is not straightforward since every language has diverse and unique
features including punctuation and accent. Existing TTS engines vary in terms of their
Electronics 2023, 12, 541 7 of 16
performance, clarity, naturalness, punctuation consideration, and flexibility, such as Google

text-to-speech (gTTS) API [21], Google Cloud TTS API [21], and AI Mawdoo3 TTS API [22].
For our model, we used the gTTS API, a widely adopted TTS engine developed by Google.
One of the remarkable facts about the gTTS API is its ease of use, where text input is easily
transformed into spoken words, in the form of an MP3 audio file. Additionally, it supports
several languages including Arabic and offers two audio speeds for speech, slow and fast.
4. Materials and Methods

4.1. Experimental Environment
Google Colaboratory (Colab) [23] was selected as the platform for developing the
proposed system. It is a free cloud service introduced by Google to promote machine
learning and artificial intelligence research by providing hosted CPUs and GPUs. One
of the main features of Colab is that the essential Python libraries such as TensorFlow,
Scikit-Learn, and Matplotlib are already installed and can be imported when needed. Colab
was built on a Jupyter Notebook [24], which was used in our study.
4.2. Dataset Acquisition

The Pascal Visual Object Classes (Pascal VOC) dataset [25] was used in this study.
Pascal VOC is widely used in object detection as it encapsulates a large number of stan-
dardized images. The dataset was first introduced in the PASCAL Visual Object Classes
challenge, which took place between 2005 and 2012. This dataset’s primary goal is to cap-
ture realistic scenes that are full of objects to be recognized. Each image is associated with
an annotation file in XML format that specifies the label, as well as the x and y coordinates
of the bounding box for each object contained within the image, which makes this dataset
suitable for supervised learning problems. This dataset has two variations, Pascal VOC
2007 and Pascal VOC 2012, both of which identify objects as belonging to one of 20 classes,
including person, animal, or other objects such as sofa.
The Pascal VOC 2007 dataset includes a total of 9936 images that contain
24,640 annotated objects, while the Pascal VOC 2012 dataset has a total of 11,530 im-
ages with 27,450 annotated objects. In our study, each dataset was split into two halves,
the first of which was used for training the model and the second for testing the model’s
performance. The designed model was trained and validated on the Pascal VOC 2007 and
2012 training datasets and tested on the Pascal VOC 2007 testing dataset. The Pascal VOC
2007 testing dataset was used since the VOC 2012 annotation files of the testing dataset are
not publicly available.
Additionally, image resizing was applied to the dataset as an additional data pre-
processing technique; each image was resized to 1024 × 1024 pixels and padded with zeros
to make it a square, allowing several images to be batched together. The image resizing
was applied to reduce the training and inference time of the proposed model. Moreover,
the training dataset was expanded by using the image data augmentation technique,
which creates modified versions of images in the dataset to improve the performance
and ability of the model to generalize. There are various types of image augmentation,
such as perspective transformations, flipping, affine transformations, Gaussian noise,
contrast changes, hue/saturation changes, and blurring [26]. However, a horizontal flipping
augmentation technique was applied to the training dataset so that 50% of all images were
flipped horizontally, as in Figure 5. The horizontal flipping was selected because it is
suitable to the nature of the classes in the dataset where the model would be able to
recognize the object regardless of its left–right orientation.
flipped horizontally, as in Figure 5. The horizontal flipping was selected because it is suit-
Electronics 2023, 12, 541 8 of 16
able to the nature of the classes in the dataset where the model would be able to recognize
the object regardless of its left–right orientation.
(a) (b)
Figure 5. (a)
Figure 5. (a)Before
Beforeand
and(b)
(b)after
afterthe
theapplication
applicationof
ofthe
thehorizontal
horizontalflipping
flippingaugmentation
augmentationtechnique.
tech-
nique.
4.3. Transfer Learning
Usually,Learning
4.3. Transfer a vast amount of labeled image data are needed to train a neural network
fromUsually,
scratch atovast
detect all theoffeatures
amount in images,
labeled image data starting
are neededwithtothe most
train generic
a neural before
network
detecting more specific elements. Transfer learning is the process of gaining
from scratch to detect all the features in images, starting with the most generic before de-the advantage
of a pre-trained
tecting model,
more specific and load
elements. weightslearning
Transfer that have is been trainedofongaining
the process a vast labeled dataset
the advantage
then use comparatively less training data to customize the model. Therefore,
of a pre-trained model, and load weights that have been trained on a vast labeled dataset transfer
learning
then use was implemented
comparatively lessintraining
this workdatausing the Mask the
to customize R-CNN
model.pre-trained
Therefore,ontransfer
the MS
COCO dataset [27] to reduce the training time. First, the Mask R-CNN
learning was implemented in this work using the Mask R-CNN pre-trained on the MS model was created in
the training mode, and then, the MS COCO weights were loaded into the model. However,
COCO dataset [27] to reduce the training time. First, the Mask R-CNN model was created
all the backbone layers were frozen, and all the output layers of the loaded model were
in the training mode, and then, the MS COCO weights were loaded into the model. How-
removed, including the output layers for the classification label, bounding boxes, and
ever, all the backbone layers were frozen, and all the output layers of the loaded model
masks. Thus, new output layers were defined and trained on the Pascal VOC dataset,
were removed, including the output layers for the classification label, bounding boxes,
where the pre-trained weights from MS COCO were not used.
and masks. Thus, new output layers were defined and trained on the Pascal VOC dataset,
where the pre-trained
4.4. Parameter Settings weights from MS COCO were not used.
The designed object detection model is based on the Mask R-CNN algorithm, which
4.4. Parameter Settings
uses Resnet101 as its backbone network for feature extraction. Resnet101 is a CNN that
The
is 101 designed
layers deep,object
and detection model on
it is pretrained is based
moreon theaMask
than R-CNN
million algorithm,
images from the which
Ima-
uses
geNetResnet101
database,aswhich
its backbone network for
is a well-known feature extraction.
benchmark dataset forResnet101 is a CNNmodels
visual recognition that is
101 layers[28].
research deep, and it is pretrained on more than a million images from the ImageNet
database, which
The first step is when
a well-known
designing benchmark
the modeldataset
was tofor visual
define itsrecognition models
configurations. re-
These
search [28].
configurations determine the hyperparameters, such as the learning rate for training the
TheThe
model. firstprocess
step when designing
of tuning the model was
the appropriate to define its configurations.
hyperparameters is crucial when These con-
defining
figurations determine
a deep learning modelthe hyperparameters,
since they impact thesuch as the learning
performance rate for
of the model training
under the
training.
model.
Table 1Theshowsprocess
someofoftuning the appropriate
the designed hyperparameters
object detection is crucial when
model’s parameters, defining
as well a
as the
deep learning model
hyperparameters thatsince
werethey
tuned impact thethe
during performance
design phaseof of
thethe
model under
object training.
detection Ta-
model.
ble 1 shows some of the designed object detection model’s parameters, as well as the hy-
Table 1. Some of the designed object detection model’s parameters.
perparameters that were tuned during the design phase of the object detection model.
Parameter Value
Table 1. Some of the designed object detection model’s parameters.
LEARNING_RATE 0.001
LEARNING_MOMENTUMParameter 0.9 Value
DETECTION_MIN_CONFIDENCE 0.7
LEARNING_RATE 0.001
NUM_CLASSES 21
LEARNING_MOMENTUM
VALIDATION_STEPS 50 0.9
DETECTION_MIN_CONFIDENCE
STEPS_PER_EPOCH 1405 0.7
NUM_CLASSES 21
VALIDATION_STEPS 50

Electronics 2023, 12, 541 9 of 16
4.5. Evaluation Index

In order to measure the performance of the designed object detection model, the
average precision (AP) and mean average precision (mAP) evaluation metrics were used:
• Average precision (AP):
The AP was computed for each class by using the 11-point interpolation method,
where the precision values are interpolated against 11 equally spaced recall values. The
interpolated precision is the highest precision against the recall value larger than the present
recall value. This is given by the formula below:
1
AP =
11 ∑ p (r ) (1)
r ∈R
where R indicates the 11 equally spaced recall values, and p represents the interpolated
precision.
• Mean average precision (mAP):
As mentioned above, the calculation of AP comprises only one class. However, in
object detection, there are usually N > 1 classes. The mAP is defined as the mean of the AP
across all N classes, as follows:
∑ N APi
mAP = i=1 (2)
N
5. Experimental Results and Analysis

The experiments with the developed object detection and segmentation model were
conducted in two stages. The first of these consisted of experiments designed to facilitate
the creation of the object detection and segmentation model based on the Mask R-CNN
algorithm using the Pascal VOC 2007 and 2012 datasets and evaluate its performance
using the Pascal VOC 2007 testing dataset. In the second stage, the experiments were
conducted to evaluate the accuracy of the designed method that is responsible for specifying
objects’ positions within an image, as well as to evaluate the accuracy of the translator and
TTS services.
5.1. The Object Detection and Segmentation Model Experiments

Following the tuning of the hyperparameters, a learning rate of 0.001 was selected
where the model gave the best performance. Furthermore, when the learning rate was
set to a high value, such as 0.2, the weights might be exploded [29]. Furthermore, when
a learning rate of 0.0001 was tried, the performance of the model on the testing dataset
deteriorated. With regard to the number of epochs, 80 was selected, where the training loss
began to increase.
5.1.1. Training and Evaluating the Model

The Tensorboard [30] was employed to visualize the losses after each epoch. This was
intended to facilitate the monitoring of the training process and the subsequent modification
of the hyperparameters. Figure 6 presents the performance of the developed Mask R-CNN
model during the training process on the Pascal VOC 2007 and 2012 training sets. It shows
the curve of the general training loss in relation to the number of epochs. This general
training loss comprises the sum of five loss values, as calculated automatically for each of
the RoIs. The five losses are as follows:
• rpn_class_loss: the extent to which the region proposal network (RPN) succeeds in
separating the background with objects.
• rpn_bbox_loss: the degree to which the RPN succeeds in localizing objects.
• mrcnn_bbox_loss: the degree to which the Mask RPN succeeds in localizing objects.
• mrcnn_class_loss: the extent to which Mask R-CNN succeeds in identifying the class
for each object.
• mrcnn_bbox_loss: the degree to which the Mask RPN succeeds in localizing objects.
Electronics 2023, 12, 541 • mrcnn_bbox_loss: the degree to which the Mask RPN succeeds in localizing objects. 10 of 16
for each object.
• mrcnn_mask_loss: the extent to which Mask R-CNN succeeds in attempting to seg-
for each object.
ment objects.
• mrcnn_mask_loss: the extent to which Mask R-CNN succeeds in attempting to seg-
•• Asmrcnn_mask_loss:
is evident in Figurethe6,extent to which
the general Mask
training R-CNN
loss succeeds
decreased in attempting
dramatically for thetofirst
seg-
ment objects.
50ment objects.
epochs, after which it changed very slowly. Each epoch took 14 min approxi-
• As is evident in Figure 6, the general training loss decreased dramatically for the first
• As is evident
mately; in Figure
thus, the overall6,training
the general training
process took loss
17 h,decreased dramatically
30 min, and for the first
28 s to complete 80
50 epochs, after which it changed very slowly. Each epoch took 14 min approxi-
50 epochs,
epochs, after which
resulting it changed
in a final trainingvery
loss slowly. Each epoch took 14 min approximately;
of 0.2402.
mately; thus, the overall training process took 17 h, 30 min, and 28 s to complete 80
thus, the overall training process took 17 h, 30 min, and 28 s to complete 80 epochs,
epochs, resulting in a final training loss of 0.2402.
resulting in a final training loss of 0.2402.
Figure 6. The general training loss of the Mask R-CNN model per epoch.
As a part of the parameter tuning for the model, multiple epochs were used to train
As a part of the parameter tuning for the model, multiple epochs were used to train
the model and produced different weights, which were used to compute the mAP on the
the model and of
As a part produced different
the parameter weights,
tuning for thewhich were
model, used toepochs
multiple compute theused
were mAPtoontrain
the
validation set to select the model weights that achieved the highest mAP.
validation set to select the model weights that achieved the highest
the model and produced different weights, which were used to compute the mAP on themAP.
Further experiments were performed to achieve the learning rate that produced the
Further
validation setexperiments
to select the were
model performed
weights thatto achieve
achievedthethe
learning
highestratemAP.that produced the
model’s
model’s best
bestperformance.
performance. TheThe
results are are
results illustrated in Figure 7, which compares the per-
Further experiments were performed toillustrated
achieve theinlearning
Figure 7,rate
which
that compares
produced the
formance in
performance different epochs for
in different epochs learning
for learningrate values of
rate values 0.001
of 0.001and 0.0001.
and 0.0001. As Figure
As Figure 77
model’s best performance. The results are illustrated in Figure 7, which compares the per-
shows,
shows, the model
theinmodel performed
performed better with a learning rate of 0.001, with an mAP of 83%
formance different epochs better with arate
for learning learning
valuesrate
of of 0.001,
0.001 andwith an mAP
0.0001. of 83%
As Figure 7
achieved
achieved ininthe
the 80th
80thepoch.
epoch.
shows, the model performed better with a learning rate of 0.001, with an mAP of 83%
achieved in the 80th epoch.
Figure 7. The mAP of the Mask R-CNN model at different epochs.
5.1.2. Testing the Model

As a result of the training process, the model was selected in the 80th epoch. The
performance of the designed model was evaluated on the testing
Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx dataset using the AP and
www.mdpi.com/journal/electronics
mAP metrics. The AP was measured separately for each class in the dataset, and the results
are shown in Figure 8. Consequently, the mean average precision was calculated, showing
that the model achieved good performance with an 83.9% mAP.
5.1.2. Testing the Model
As a result of the training process, the model was selected in the 80th epoch. The
performance of the designed model was evaluated on the testing dataset using the AP and
mAP metrics. The AP was measured separately for each class in the dataset, and the re-
Electronics 2023, 12, 541 11 of 16
sults are shown in Figure 8. Consequently, the mean average precision was calculated,
showing that the model achieved good performance with an 83.9% mAP.
Figure8.8.Average
Figure Averageprecision
precision(AP)
(AP)per
perclass.
class.
5.1.3.
5.1.3.Ablation
AblationExperiments
Experiments
ToToanalyze
analyzethe theeffectiveness
effectivenessand andcontributions
contributions of ofdifferent
differenttechniques
techniques used used in inthe
the
proposed method, we conducted an ablation experiment
proposed method, we conducted an ablation experiment on the proposed model as listed on the proposed model as listed
in
inTable
Table2.2.We Wetrained
trained three
three MaskMask R-CNNR-CNN variants
variants on on thethePascal VOC
Pascal 20072007
VOC + 12+training
12 training set
and present the evaluation results for the Pascal VOC
set and present the evaluation results for the Pascal VOC 2007 testing set. In the first 2007 testing set. In the first model,
we reduced
model, the number
we reduced of layers
the number of in the in
layers Resnet
the Resnetbackbone to 50 to
backbone to 50
examine
to examine the effect
the effecton
the
on performance.
the performance. In the In second
the second model, model, we removed
we removed the fourth
the fourthconvolutional
convolutional layerlayerin the in
computation graph of the mask head of the feature
the computation graph of the mask head of the feature pyramid network (FPN). The FPN pyramid network (FPN). The FPN was
used
was forused feature extraction;
for feature it takesita takes
extraction; single-scale image ofimage
a single-scale any size of asanyansize
input asand produces
an input and
proportionally sized feature maps with several levels
produces proportionally sized feature maps with several levels in a fully convolutional in a fully convolutional form, referred
to as CN-Mask
form, referred to R-CNN.
as CN-Mask Finally, in the third
R-CNN. Finally,model,
in the we removed
third model,the wesecond
removed conventional
the second
layer in the convolution block in the Resnet graph,
conventional layer in the convolution block in the Resnet graph, which is referred which is referred to as CC-Mask R-CNN. to as
Then, the object detection performance of the trained
CC-Mask R-CNN. Then, the object detection performance of the trained models was com- models was compared with our
proposed
pared with model.
our proposed model.
Table 2. ByAblation
analyzing the values
experiments forof
thethe mAP for
proposed the proposed model in this paper, we found
model.
that removing the fourth convolutional layer in the mask head affected the detection ac-
Method Backbone mAP Aero Bicycle Bird Boat Bottle
curacy, leading toBusa 1.7% Car Cat
reductionChair
inCow Table Dog
the mAP. It isHorse
worth Mbike
notingPerson Plant Sheep
that, after the Sofa Train
backbone TV
Resnet-
Mask R-CNN 68.9% 0.69 0.79 0.82 0.67 0.59 0.65 0.75 0.89 0.65 0.76 0.65 0.73 0.78 0.59 0.57 0.66 0.65 0.65 0.62 0.62
CN-Mask
50
Resnet101 82.2% 0.94 0.34
layers
0.96
were
0.92
changed
0.88 0.8
from
0.97
50 to 101,
0.98 0.9
the
0.87
detection
0.9 0.93
accuracy
0.94 0.4
of the proposed
0.62 0.45 0.88
method
0.92 0.97
in this
0.87
CC-Mask Resnet101 80.7% 0.92 0.11 paper
0.87 gradually
0.85 0.8 improved
0.98 0.97 by 15%.
0.97 0.6 0.77 0.98 0.83 0.88 0.99 0.98 0.34 0.76 0.82 0.94 0.73
Proposed Model Resnet101 83.9% 0.94 0.16 0.95 0.95 0.88 0.98 0.97 0.98 0.8 0.87 0.9 0.93 0.93 0.8 0.95 0.36 0.8 0.85 0.87 0.85
Table 2. Ablation experiments for the proposed model.

By analyzing the values of the mAP for the proposed model in this paper, we found
Back- Aer Bi- Bot- Chai Ta-
Method mAP thatBird
removing
Boat theBus
fourth
Car convolutional
Cat in Hors
Cow layerDog
Mbi
the mask head Plan
Person
Shee
affected the Trai
Sofadetection
TV
bone o cycle tle r ble e ke t p n
accuracy, leading to a 1.7% reduction in the mAP. It is worth noting that, after the backbone
layers were changed from 50 to 101, the detection accuracy of the proposed method in this
paper gradually improved by 15%.
5.1.4. Comparison with Previous State-of-the-Art Object Detection Models
A comparison between the performance of the developed Mask R-CNN model and
previous object detection models was conducted, including Faster R-CNN [5], SSD300
(the input image size was 300 × 300) [9], SSD512 (the input image size was 512 × 512) [9],
and YOLO v2 [31], and more recent models, including EEEA-Net-C2 [32], HSD [33] (the
Electronics 2023, 12, 541 12 of 16
input image size was 320 × 320) and Localize [34], as illustrated in Table 3. Faster R-CNN,
SSD300, and SSD512 were first trained on MS COCO and subsequently fine-tuned on the
Pascal VOC 2007 and 2012 datasets. Other models were trained on the Pascal VOC 2007
and 2012 datasets. As Table 3 clarifies, the developed Mask R-CNN outperformed the base
variant (Faster R-CNN) and the remaining state-of-the-art models on the Pascal VOC 2007
testing dataset.
Table 3. Comparative results on the Pascal VOC 2007 testing dataset.
Detection Models Trained on mAP

Faster R-CNN [5] 07 + 12 + COCO 78.8%
SSD 300 [9] 07 + 12 + COCO 79.6%
SSD 512 [9] 07 + 12 + COCO 81.6%
YOLOv2 [31] 07 + 12 78.6%
EEEA-Net-C2 [32] 07 + 12 81.8%
HSD [33] 07 + 12 81.7%
Localize [34] 07 + 12 81.5%
The proposed model 07 + 12 83.9%
5.2. The Arabic Annotations, Objects Positions, and TTS Experiments

To evaluate the proposed method for specifying the objects’ positions within an image,
as well as the translator and TTS services, several experiments were conducted based on
the evaluation method used in [35] with a sample of 100 images containing 340 objects
that were chosen randomly from the testing dataset. The performance evaluation was
completed by a human judge fluent in both Arabic and English. The judge examined four
outputs for each object in each image passed to the model and was asked to complete an
online datasheet with the results; those outputs were as follows:
(1) The accuracy of the predicted object’s position within an image, where the judge was
asked to use the map illustrated in Figure 3 as a reference and indicate if the detected
object’s position was in the expected position within the map.
(2) The accuracy of the translated label of the predicted object, where the judge was
responsible for indicating whether the translated Arabic label was “accurate” or
“inaccurate”.
(3) The accuracy of the translated position of the predicted object, where the judge was
also responsible for indicating if the translated Arabic position was “accurate” or
“inaccurate”.
(4) The Arabic speech of the predicted object label with its position, where the judge
was responsible for deciding whether the spoken object’s name and its position were
“clear” or “unclear”.
The results of the experiments were calculated for each output. The results of the first
output, which relates to the accuracy of the proposed method, were high: accurate positions
were detected for 300 out of a total of 340 objects. Regarding the translation service for the
Arabic labels and positions, the results were accurate for all the detected objects’ labels and
their positions. Finally, all the Arabic speech in the TTS service was categorized as “clear”.
Table 4 presents two samples of the resulting images with their correct object detections
and translations. The fourth column in Table 4 presents the names and positions of the
detected objects for Samples A and B in Arabic. For instance, the result of Sample A is:
“person in the top of the right side, person in the mid of the left side, chair in the mid of the
left side, dining table in the bottom of the center”. The result of Sample B is: “person in the
mid of the center, dog in the bottom of the left side, bicycle in the bottom of the center”.
Original Image the mid ofoutput,
The results
“clear”
the left
The
of the results
“clear”
or “unclear”.
side,
of thewere
experiments experiments
or “unclear”.
dining table
calculated were
in were
for calculated
the
each output.
bottom
forThe
each output.
results
ofresults
the
Thefirst
of the results of the first
ofcenter”.
results ofThe
accurate posi-result of Sample B
Thewhich
tions were
output,
resultsrelates
of the
tions were
detected
which
Theto
relates to
the accuracy
results
experiments
forto detected
300 of for
the
of
of thewere the bic
accuracy
proposed
experiments
calculated
300 of
out
Annotations
of the
of objects.
proposed
for method,
calculated
each
a total of
were method,
high:
forThe
output.
340 objects.
each
were
accurate
output.
Regarding
The
the
high:
posi-
first the first
the translation service
Detected Ob-
output, which output,
relates theout
which a total
relates
accuracy toofthe 340
theaccuracy
proposed Regarding
of method,
the proposed
werethe translation
method,
high: wereservice
accurate high:
posi- accurate posi-
is: “person in the mid
for theof the
Arabic center,
labels and dog
positions, in the
theaccurate
results bottom
were of
accurate the
for left
allobjects’
the side,
detected bicycle
objects’ in the bottom
for the Arabic labels
tions were detected
labels
labels and their
and
tions were
for 300positions,
detected
and their
positions.
out of afor
Finally,
the
total
positions.
labelsalland
results
300ofout340
theFinally,
were
ofobjects.
a total of
Arabic speech
for
Regarding
all the Arabic
all
340 objects.the
speech in the
in the TTS
wereservice
detected
theRegarding
translationthe
TTS
was
translation service
service
service was categorized
categorized
jects
of the center”.
for the Arabic
as “clear”.
as “clear”. Table 4 presents
positions,
Table
two4 samples
the results
presentsof
two
thesamples
accurate
of images
resulting
for
the resulting
all the detected objects’
for the Arabic labels and positions, the results were accurate for all the detected objects’
images
with their withobject
correct their correct object
labels and theirlabels and their
positions. positions.
Finally, all the Finally, all the Arabic
Arabic speech speech
in the TTS in the
service TTS
was service was categorized
categorized
detections detections andThe translations. The fourth column in Table 4names
presents andthe names and posi-
as “clear”. and astranslations.
Table “clear”.
4 presentsTable fourth
two4samples
presents column
oftwo in Table
thesamples
resulting 4the
presents
ofimagesresulting
with the images
their with
correct posi-
their correct object
object
tions of objects
the detected objectsAfor Samples A andFor B ininstance,
Arabic. the For result
instance,
of the and
result of
Electronics 2023, 12, 541
tions
Sample
tions
of the
detections of
Table 4. Samples
of A
detected
detections
anddetection
is: detected
the “person
and
translations. The
Sample in A the
for
is: “person
tions ofobjects top
the detected
Samples
translations.
fourth column
results.in right
forofSamples
the (
and
The ‫شخص‬
the top of the
objectsAside,
for
and
B
fourth in
in Table Arabic.
column
means
rightinside,
person
Samples
B in Arabic.Athe
and
in
4 presents Table
person,
person
mid
For of
B in
4
in the
the ‫كرسي‬
the names andthe
presents
leftmid
Arabic.
instance, side, ofchair
names
posi-
means
the left
For result
the in side,
instance,
of
posi-
chair,
chair in
the result of
‫طاولة الطعام‬ 13 of 16
means dining
the mid
Sample table,
ofis:
A the the mid
Sample
“person
of
in ‫دراجة‬
the
Ameans
left side, dining left
is: “person
the
side,
top table
of the
dining
bicycle)
inin thetop
the
right
table
bottom
of the
side,
in
of
person
the
the
right
bottom
center”.
inside,
of
person
the mid
the
The in
of the
center”.
result
the
The
of Sample
left mid
side,
‫شخص في الجزء‬
result
ofchair
of Sample
B side, chair in
the left
in
B
is: “person
the mid of the
is:
in the
the
“person
leftmid
mid
side,
in the
ofofdining
the mid
thecenter,
lefttable
of
dog
side,
the
in center,
thethe
indining
dog
bottom
table in
bottom
in
ofofthe
the
the bottom
left side,
bottom
center”. of
of the
bicycle
the
The
left side,
inofthe
center”.
result ‫العلوي من‬
bicycle
bottom
The
Sample result
in the bottom
B of Sample B
of the center”.
of the
is: center”.
“person is: “person
in the mid of thein the middog
center, of the center,
in the dog of
bottom in the
the left
bottom
side,ofbicycle
the leftinside, ، Names
‫الجانب األيمن‬
bicycle in the bottom
the bottom
and
Table
of the center”. of the center”.
Samples
4.Samples
Table 4. Table ofresults.
4. Samples
of detection ( ‫ شخص‬results.
detection
of detection means ( ‫ شخص‬means
results.
person, ‫كرسي‬
( m
means
means
person, ‫كرسي‬
chair, ‫الطعام‬ chair, ‫الطعام‬
‫طاولة‬
person,
means ú æQ» ‫منتصف‬ÐAª¢Ë@
‫ طاولة‬means chair, ‫شخص في‬
means
éËðA£
means dining means
table, ‫دراجة‬
dining table,
means Image
‫دراجة‬
bicycle). with
means bicycle).
( ‫ شخص‬results.
the
( ‫ شخص‬means
Detected
‫ كرسي‬means ‫كرسي‬
Objects
‫الطعام‬

with
chair, ‫طاولة الطعام‬
‫طاولة‬
Ara- ، Positions
‫األيسر‬ of
‫الجانب‬
Sample Original Image
Table 4. Samples
dining table,
Table 4. Samples
of detection of detection
results.
means
means person,
bicycle).
person,
chair, means
‫منتصف‬ ‫ في‬Ob-
‫كرسي‬
means dining ék‫دراجة‬
means
table, @P Xmeans
. dining table, ‫دراجة‬
bicycle). means bicycle). bic Annotations Names and Names and Detected
Image with Image
the with Objects
Detected the Detectedwith Objects
Ara- with Ara-
Positions of Positions
Names and Names and
of ، ‫األيسر‬ ‫الجانب‬
Sample Sample Original ImageOriginal Image
bic Annotations
Image bicDetected
Annotations
withObjects
the ObjectsDetected Ob-
with Ara- Detected Ob- jects ‫طاولة‬
‫الطعام في‬
Image with the Detected with Ara- Positions of Positions of
Sample Sample Original Image Original Image Image with the Detected Objects Names
bic Annotations Detected Ob- Detected Ob- ‫الجزء السفلي من‬
jects jects and Positions of
Sample Original Image bic Annotations
with Arabic Annotations jects jects Detected Objects
‫المركز‬
‫العلوي من‬ ‫العلوي من‬ ‫العلوي من‬
،‫الجزء‬
‫فياأليمن‬ ‫ شخص‬، ‫الجزء‬
‫الجانب‬ ‫في األيمن‬
‫الجانب‬
‫شخص‬
، ‫الجانب األيمن‬
‫منتصف‬ ‫شخص في‬
‫العلوي من‬ ‫منتصف‬ ‫شخص في‬
‫العلوي من‬
A A ‫ الجانب األيسر‬، ‫الجانب األيمن‬
، ‫األيمن‬ ، ‫األيسر‬ ‫الجانب‬ ‫شخص في منتصف‬
‫منتصف‬
‫في منتصف‬ ‫منتصف شخص‬
‫كرسي في‬ ‫في منتصف‬‫كرسي في‬
‫شخص‬
، ‫ الجانب األيسر‬،، ‫األيسر‬
‫الجانب األيسر‬
‫الجانب‬ ، ‫الجانب األيسر‬
A A A A ‫في‬ ‫الطعام‬
‫منتصف‬
‫من‬
‫منتصففيطاولة‬
‫كرسي في‬
‫السفلي‬ ‫الجزء‬‫من‬
،
‫الطعام‬
‫طاولة في‬
‫السفلي‬
‫األيسر‬
‫كرسي‬
‫الجزء‬
‫الجانب‬
‫كرسي في منتصف‬
، ‫الجانب األيسر‬
‫طاولةالمركز‬
‫الطعام فيطاولةالمركز‬
‫الطعام في‬ ، ‫الجانب األيسر‬
‫الجزء السفلي منالجزء السفلي من‬
‫المركز‬ ‫المركز‬ ‫طاولة الطعام في‬
‫الجزء السفلي من‬
‫المركز‬
‫شخص في منتصف‬
‫شخص في منتصف‬ ‫شخص في منتصف‬ ‫ الكلب‬، ‫المركز‬
‫ الكلب‬، ‫ الكلب المركز‬، ‫المركز‬
‫السفلي‬ ‫السفليفيشخص في‬
‫الجزء منتصف‬ ‫الجزء منتصف‬
‫فيشخص في‬ ‫في الجزء السفلي‬
B ‫األيسرمن‬
‫الجانب الكلب‬ ‫من‬
B
B ‫األيسر‬
‫الجانب الكلب‬
، ‫المركز‬
‫دراجة في‬
‫السفلي‬
، ‫المركز‬
‫الجزء‬، ‫السفليفي‬ ‫الجزء‬، ‫في‬ ‫من الجانب األيسر‬
‫من‬ ‫السفلي‬ ‫الجزء‬
B B ‫السفلي من‬
‫األيسر‬ ‫الجزء‬
‫الجانب‬
‫ مركز‬،
‫الجانب األيسرمن‬
‫مركز في‬
‫ دراجة‬،
‫من‬
‫ دراجة في‬،
‫الجزء السفلي منالجزء السفلي من‬
‫مركز‬ ‫مركز‬ ‫الجزء السفلي من‬
‫مركز‬
‫شخص في منتصف‬
‫ الكلب‬، ‫المركز‬
‫في الجزء السفلي‬
B ‫من الجانب األيسر‬
‫ دراجة في‬،
‫الجزء السفلي من‬
‫مركز‬
Electronics
5.3. Discussion of Results
2023, 12, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/electronics
This section comprises overall remarks
Electronics
Electronics 2023, 12, regarding
2023, 12, x. https://doi.org/10.3390/xxxxx
x. https://doi.org/10.3390/xxxxx the evaluation of the generated
model. Based on the analysis of our experimental results, we came to the following
conclusions. Firstly, the training time is defined not only by the quantity of data involved,
but also by the complexity of the architecture of a deep learning model. As Mask R-
5.3. Discussion of Results
CNN is a large model, the training time was lengthy, with a completion time for each
epoch of approximately 14 min. Secondly, the Mask R-CNN model proved its ability to
5.3. Discussion
provideofhighly
Resultsaccurate results when trained on the Pascal VOC 2007 and 2012 datasets, as
compared with previous object detection models that employed Faster R-CNN and SSD300.
Hence, we contend that our model could be beneficial for supporting Arabic-speaking
visually impaired people with the recognition of objects in images.
https://doi.org/10.3390/xxxxx 6. Conclusions www.mdpi.com/journal/electronics

Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx
To conclude, there has been a dramatic interest inwww.mdpi.com/journal/electronics
using a deep learning approach for
object detection in recent years, while various state-of-the-art models have been proposed to
enhance detection performance in terms of accuracy and speed. However, rising numbers
of Arabic-speaking visually impaired people mean that an Arabic language object detection
model is needed. Therefore, we successfully developed and tested a prototype system
that could help Arabic-speaking individuals with impaired vision recognize objects in
digital images.
The proposed model is based on the Mask R-CNN algorithm to classify and locate
different objects in an image. The model was trained and tested using the Pascal VOC
dataset. In addition, extensive experiments were performed to measure its effectiveness by
reducing the number of backbone layers in the mask R-CNN to 50 and removing different
layers, including the fourth convolutional layer in the computation graph of the mask head
of the FPN and the second conventional layer in the convolution block in the Resnet graph.
The results obtained were detailed and compared with the accuracy of the proposed model.
The performance of the developed object detection model was evaluated and com-
pared with other state-of-the-art object detection models using the Pascal VOC 2007 dataset,
and the results revealed that the proposed Mask R-CNN achieved the highest accuracy, at
Electronics 2023, 12, 541 14 of 16
83.9%. Moreover, a new method for specifying the position of detected objects within an
image was introduced, and an Arabic-speaking text-to-speech engine was incorporated
to utter the detected objects’ names and their positions within an image for the visually
impaired. We believe that this is the first approach to incorporate these services with an
object detection model to assist the Arabic-speaking visually impaired.
Despite its efficacy and reliability, the proposed model has several limitations that
indicate potential directions for future research. First, the proposed model lacks Arabic
language, real-time video stream detection, which is vital to developing the means for
Arabic-speaking visually impaired people to recognize their surroundings. Second, another
noticeable limitation is found with certain object classes: although the model performs
well at detecting different objects within images, some classes such as bicycle are hard to
segment, given their complex shapes and curved design, as shown in Table 4 Sample B.
Third, the model lacks interactivity; to further enhance the user experience, an Arabic
language application should be developed that enables users to interact with the system.
The complexity of certain shapes such as bicycles, which are characterized by their
varying shapes and curved designs, were proven to have a low average precision rate of
identification within our model. Therefore, as part of its future development, the model
should be trained on more examples of such images to enhance its detection performance.
Moreover, the proposed object detection model is intended to be deployed as a mobile-
based application, designed to not only detect objects in images, but also within videos
in real-time. Moreover, it will be designed to predict the Arabic caption of an image that
describes its contents for Arabic-speaking visually impaired people. The Web Content
Accessibility Guidelines 2.1 (WCAG 2.1) produced by the World Wide Web Consortium
(W3C) [36] will be followed to ensure the accessibility of the application.
Author Contributions: Conceptualization, H.H.A.-B.; Methodology, H.H.A.-B.; Software, N.A.;

Validation, N.A. and H.H.A.-B.; Formal analysis, N.A.; Investigation, N.A.; Resources, N.A.; Writing—
original draft, N.A.; Writing—review & editing, H.H.A.-B.; Supervision, H.H.A.-B.; Project adminis-
tration, H.H.A.-B. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The two analyzed datasets during the current study are publicly
published. The Pascal VOC 2007 and The Pascal VOC 2012 datasets are available in the PASCAL
visual object classes challenge 2007 and 2012 repository, http://host.robots.ox.ac.uk/pascal/VOC/
index.html (accessed on 15 July 2022).
Acknowledgments: The authors would like to acknowledge the Researchers Supporting Project
Number RSP-2021/287, King Saud University, Riyadh, Saudi Arabia, for their support in this work.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30,
3212–3232. [CrossRef] [PubMed]
2. Pathak, A.R.; Pandey, M.; Rautaray, S. Application of Deep Learning for Object Detection. Procedia Comput. Sci. 2018, 132,
1706–1717. [CrossRef]
3. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
4. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile,
7–13 December 2015; pp. 1440–1448. [CrossRef]
5. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
Electronics 2023, 12, 541 15 of 16
6. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer
Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [CrossRef]
7. Wang, J.; Hu, X. Convolutional Neural Networks with Gated Recurrent Connections. IEEE Trans. Pattern Anal. Mach. Intell. 2022,
44, 3421–3436. [CrossRef] [PubMed]
8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016;
pp. 779–788. [CrossRef]
9. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer
Vision-ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany,
2016; pp. 21–37.
10. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020,
arXiv:2004.10934.
11. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. You Only Learn One Representation: Unified Network for Multiple Tasks. arXiv 2021,
arXiv:2105.04206.
12. Bourne, R.R.A.; Flaxman, S.R.; Braithwaite, T. Magnitude, Temporal Trends, and Projections of the Global Prevalence of
Blindness and Distance and near Vision Impairment: A Systematic Review and Meta-Analysis. Lancet Glob. Health 2017, 5,
e888–e897. [CrossRef]
13. Zeried, F.M.; Alshalan, F.A.; Simmons, D.; Osuagwu, U.L. Visual Impairment among Adults in Saudi Arabia. Clin. Exp. Optom.
2020, 103, 858–864. [CrossRef] [PubMed]
14. Al-muzaini, H.A.; Al-yahya, T.N.; Benhidour, H. Automatic Arabic Image Captioning Using RNN-LSTM-Based Language Model
and CNN. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 67–73. [CrossRef]
15. Jindal, V. Generating Image Captions in Arabic Using Root-Word Based Recurrent Neural Networks and Deep Neural Networks.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student
Research Workshop, New Orleans, LA, USA, 2–4 June 2018; pp. 144–151. [CrossRef]
16. Zhou, G.; Zhang, W.; Chen, A.; He, M.; Ma, X. Rapid Detection of Rice Disease Based on FCM-KM and Faster R-CNN Fusion.
IEEE Access 2019, 7, 143190–143206. [CrossRef]
17. Kavitha Lakshmi, R.; Savarimuthu, N. DPD-DS for Plant Disease Detection Based on Instance Segmentation. J. Ambient. Intell.
Humaniz. Comput. 2021. [CrossRef]
18. Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Ruan, C.; Sun, Y. Cucumber Fruits Detection in Greenhouses Based on Instance Segmentation.
IEEE Access 2019, 7, 139635–139642. [CrossRef]
19. Zhang, Q.; Chang, X.; Bian, S.B. Vehicle-Damage-Detection Segmentation Algorithm Based on Improved Mask RCNN. IEEE
Access 2020, 8, 6997–7004. [CrossRef]
20. de Vries, E.; Schoonvelde, M.; Schumacher, G. No Longer Lost in Translation: Evidence That Google Translate Works for
Comparative Bag-of-Words Text Applications. Polit. Anal. 2018, 26, 417–430. [CrossRef]
21. Cambre, J.; Colnago, J.; Maddock, J.; Tsai, J.; Kaye, J. Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice
Quality for Long-Form Content. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems,
Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–13.
22. Mawdoo3 AI. Available online: https://ai.mawdoo3.com/ (accessed on 10 October 2022).
23. Google Colaboratory. Available online: https://colab.research.google.com/notebooks/intro.ipynb (accessed on 15 September 2022).
24. Project Jupyter. Available online: https://www.jupyter.org (accessed on 18 October 2022).
25. The PASCAL Visual Object Classes Homepage. Available online: http://host.robots.ox.ac.uk/pascal/VOC/index.html
(accessed on 15 July 2022).
26. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [CrossRef]
27. Abdulla, W. Mask R-CNN for Object Detection and Instance Segmentation on Keras and TensorFlow. In GitHub Repository;
Github: San Francisco, CA, USA, 2017.
28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
29. Kelleher, J.D.; Namee, B.M.; D’Arcy, A. Fundamentals of Machine Learning for Predictive Data Analytics, Second Edition: Algorithms,
Worked Examples, and Case Studies; MIT Press: Cambridge, MA, USA, 2020.
30. TensorBoard. TensorFlow. Available online: https://www.tensorflow.org/tensorboard (accessed on 25 July 2022).
31. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [CrossRef]
32. Termritthikun, C.; Jamtsho, Y.; Ieamsaard, J.; Muneesawang, P.; Lee, I. EEEA-Net: An Early Exit Evolutionary Neural Architecture
Search. Eng. Appl. Artif. Intell. 2021, 104, 104397. [CrossRef]
Electronics 2023, 12, 541 16 of 16
33. Cao, J.; Pang, Y.; Han, J.; Li, X. Hierarchical Shot Detector. In Proceedings of the 2019 IEEE/CVF International Conference on
Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9704–9713. [CrossRef]
34. Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Localize to Classify and Classify to Localize: Mutual Guidance in Object Detection.
arXiv 2020, arXiv:2009.14085.
35. Alsudais, A. Image Classification in Arabic: Exploring Direct English to Arabic Translations. IEEE Access 2019, 7, 122730–122739.
[CrossRef]
36. Spina, C. WCAG 2.1 and the Current State of Web Accessibility in Libraries. Weav. J. Libr. User Exp. 2019, 2. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Electronics 12 00541

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Electronics 12 00541

Uploaded by

Copyright:

Available Formats

electronics

Electronics 2023, 12, 541. https://doi.org/10.3390/electronics12030541 https://www.mdpi.com/journal/electronics

Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/electronics

•• The The backbone:

Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/electronics

Electronics 2023, 12, 541 6 of in

3.3. Generating the Annotations and Positions in the Arabic Language

performance, clarity, naturalness, punctuation consideration, and flexibility, such as Google

4. Materials and Methods

4.2. Dataset Acquisition

Electronics 2023, 12, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/electronics

4.5. Evaluation Index

5. Experimental Results and Analysis

5.1. The Object Detection and Segmentation Model Experiments

5.1.1. Training and Evaluating the Model

Figure 7. The mAP of the Mask R-CNN model at different epochs.

5.1.2. Testing the Model

Table 2. Ablation experiments for the proposed model.

Detection Models Trained on mAP

5.2. The Arabic Annotations, Objects Positions, and TTS Experiments

https://doi.org/10.3390/xxxxx 6. Conclusions www.mdpi.com/journal/electronics

Author Contributions: Conceptualization, H.H.A.-B.; Methodology, H.H.A.-B.; Software, N.A.;

You might also like