You are on page 1of 5

2019 TEEE International Conference on Robotics, Automation, Artificial intelligence and Internet of Things (RAAICON)

29 November-1 December 2019, Dhaka, Bangladesh

Zebra Crosswalk Region Detection and Localization


Based on Deep Convolutional Neural Network
Md. Masud Haider, Mohammad Rokibul Hoque, Md. Khaliluzzaman* and Mohammad Mahadi Hassan
Dept. of Computer Science and Engineering
International Islamic University Chittagong (IIUC)
Chittagong-4318, Bangladesh
mud65636gmail.com, rokib2244gmail.com, *khalides2gmail.com and mahadi_cse@yahoo.com

Abstract_ It can be difficult for blinds and people with


limited visual capabilities to find street intersections unknown street environments to efficiently crossing the street
containing a crosswalk along with their accurate location. minimizing the risks as much as possible.
In this paper, a solution to this issue is proposed through a
deep convolutional neural network (DCNN) architecture There exists a lot of work that can detect crosswalks with
that automatically organizes several characteristics of different successful approaches. But not much work has been
zebra stripe crosswalks to support quick, accurate and done taking the help of deep learning that can detect and
reliable identification and detection of a crosswalk in an localize the whole crosswalk without much pre-processing
image. Proposed method uses Faster R-CNN Inception-v2 steps. And to be specific, fully deep neural network
to identify and locate crosswalks, which has sparse architecture alone to detect and localize crosswalks is not
convolutions on the same layer to reduce computational present in the computer vision literature. So, one of the key
load while increasing accuracy. We focused on the single contributions of this research work is to detect and recognize
class- crosswalk, training the network with images of our zebra crossing with the help of a deep CNN architecture.
own dataset combined with extracted image frames. To the Another key contribution of this research is to create a robust
best of our knowledge, proposed framework is the first to model that rejects these false structures while correctly
utilize deep architectures for crosswalk detection and identifying crosswalks in complex illumination even
localization from the street level view. It achieves an conditions.
accuracy of 97.50% and is compared to previous method to
show higher detection accuracy over recent works. II. LITERATURE REVIEW
Every zebra crossing consists of unique characteristics such as,
Keywords Zebra crosswalk; DCNN; Faster RCNN; number of black and white stripes, completion of every
Inception-V2 horizontal stripe, presence of vertical line to the side of
crosswalk, both the size of crosswalk and the stripes, width,
1. INTRODUCTION color intensity, lighting conditions, impaired crosswalk and so
Independent mobility for the visually impaired has been much more. Many researchers utilized these illusive
constantly an active research topic in the current decuys. Road characteristics to detect and recognize the zebra crosswalk
traffic deaths have increased in many countries in the recent from different environmental conditions.
time while no sign of decrease was seen in low income Many researchers used the feature based, deep learning based,
countries and only minor decreasing rate in middle and high- and smart phone based algorithms to detect and localize the
income. Zehra crosswalks are painted in black and white zebra crosswalk. For example, in [1], line based algorithms
alternating stripes that give pedestrians some rights over along with color intensity variation and pose estimation is
vehicles while crossing the street. It is more visible to drivers utilized for detection of crosswalks. In [4, 5] authors utilize
than two-line crosswalks that can prevent serious injuries. mobile phone accelerometer to acquire horizontal orientation
Zebra crossing detection is a constant challenging task in of crosswalk combined with their line detection based library.
computer vision because of crosswalk type, alternating stripes The first portable phone camera based zebra crosswalk
that are very similar to other structures, viewing angles, detection system is introduced in [9]. The system uses
occlusion by other objects, diverse and unpredictable lighting extracted edge segments from crosswalk followed by figure
conditions and so on. Although blind or visually impaired ground segmentation to provide real-time orientation
people can navigate in outdoor environments with white canes, information. Perpendicular or street view crosswalks are
trained pets, auditory cues from moving vehicles and other detected in [11] with the help of high-pass filter, crosswalk's
assistance, crossing busy street intersections is much more stripes counting and dimensions, intensity histograms etc.
difficult without any visual signs from these aids [2].
Some deep learning and Satellite imagery based zebra
That's why, recognizing and locating crosswalks is the crosswalk detection methods are proposed in [2], [10] and
most fundamental and crucial step to navigate in outdoor street [13]. Crosswalk detection from aerial images, especially
environments while overcoming aforementioned challenges. satellite images have gained momentum due to Smartphone
Proposed framework will help the visually impaired people in capabilities with GPS and online maps. In [2] authors utilize
deep architecture i.e., MobileNetv2 in their Smartphone app to
provide midline of crosswalks and direction information
within a range. Wearable mobility aid is presented by [13] that
uses CNNs in a much reduced form for crosswalk
978-1-7281-5852-5/19/$31.00 ©2019 IEEE classification only, where 3D RGBD depth maps followed by
FPGA with pre-programmed 3D vision and a RANSAC
1
Authorized licensed use limited to: University College London. Downloaded on May 23 2020 at 09:21:39 UTC from IEEE Xplore. Restrictions apply
framework does the heavy work. Similar approach is used by for the different object scale variation in image by applying
[10] to acquire images for processing along with extracted branches of varying filter sizes on the same layer. Also batch
features from HOG and LBPH are utilized to train a SVM normalization (BN) introduced by [16] applied to Inception v2
classifier on crosswalk images. reduces internal covariate shift and speeds up training.
In [6], authors uses Google services extensively to acquire and
annotate images of crosswalk to classify them accordingly.
Here, firstly, crosswalk locations for user defined regions are
retrieved through Google Open Street Map. Then Google
Static Maps API downloads positive and negative images
followed by annotation for known crosswalks. In the final step,
these annotated images are used for the training of three
different architectures such as AlexNet, VGG and GoogleNet
to classify crosswalks.
In [12], a smart phone based method uses 360 image panorama
to produce aerial view image that is finally matched to a
downloaded template for verification. On the other hand cloud
services determine the user's location and processes images
while running the Pedestrian Signal Detection algorithm on the
server to determine whether it is safe to cross the intersection
[7].

III. THE PROPOSED FRAMEWORK


In this work, a zebra crosswalk detection and localization Fig. 1. Sample images of zebra crosswalk dataset.
framework is established based on the deep convolutional
neural network architecture. The architecture automatically TABLE 1. Train and test images used in the different
extracted various features of zebra crosswalk to detect and environmental conditions
localize the candidate region accurately from an input image. Environments Train Images Test Images Total

A. Train and Test Dataset Sunny 236 79 315


Sufficient image collection to form a dataset is the
fundamental step in training a DCNN. Our dataset comprises Low Light 234 37 271
of Smartphone images of our own along with image frames
extracted from street videos. The Smartphone images were Shadow 24 8 32
taken by us with a 12 megapixel camera on various locations
of Chittagong, Bangladesh. These images were taken in Night 73 17 90
diverse weather conditions including morning, daylight, low
light, evening, shadowy, rainfall and night images etc. We Rain 15 3 18
took almost all the images from the view of pedestrian
standing at street intersections except a myriad ones were Total 582 144 726
taken perpendicular to the street. Then street video frames
were extracted and combined with the Smartphone images. We Auxiliary classifiers handle vanishing gradients while
only considered zebra style crosswalks and did not include calculating loss during training. Inception v2 modules [3] are
two-line crosswalks. The dataset consists of 726 images where shown in Fig. 2. After feature extraction by CNN. Region
train and test images had 582 and 144 images respectively Proposal Network predicts likely regions containing
without any overlap. The number of images used in the crosswalks with possibility score of presence of object and
different environmental conditions is shown in Table L. As a bounding box location in the last layer feature map. Region of
rule of thumb of Faster R-CNN [14], image resolutions were interest (ROI) pooling normalizes the region proposals by
reduced to maintain the height and width of 600 and 1024 aggregating them to a feature map to increase the accuracy of
respectively prior to labeling all the positive images manually proposal prediction. Finally, the unified detection network
with Labellmg. All the images without crosswalk were with its two sibling layers, softmas and bounding box
considered negative and hence training includes no negative regression, performs classification and bounding box
image. All the image label files were converted to CSV to coordinate correction respectively to improve the final
store every image annotation coordinates (Top-left, bottom- predicted crosswalk outcome with refined location. All the
right) as a single file which was used by the Tensorflow training and testing is performed in CPU. The whole flow
TFRecord, where knowledge of object class - Zebra Crossing diagram of the proposed framework is shown in Fig. 3.
in our case, is provided. Finally, train and test record splits the
dataset into two partitions. Some sample images of dataset
depicts in Fig. 1.

B. Proposed Framework
In this proposed framework, the deep CNN model that we
utilize is the Faster R-CNN Inception v2 model [3]. A CNN in
Faster R-CNN [8][14] extracts image features. Inception [15]
has a layer by layer structure, but unlike other deep
architecture it goes wider rather than deeper. Inception module
removes the barrier of choosing multiple filters to compensate
2
Authorized licensed use limited to: University College London. Downloaded on May 23 2020 at 09:21:39 UTC from IEEE Xplore. Restrictions apply
Fig. 2. Architecture of Inception network a) Original inception refers to the portion that doesn't belong to the crosswalk region
module, b) inception module after 55 convolution replaced by and thus not detected. False positive refers to the portion that is
two 3x3, and c) inception after non factorization detected but is not a part of the original crwalk region. And
false negative refers to the portion that is not detected but is a
part of the original crosswalk region Since, we know the
coordinates of the original region are, we can calculate in
positive by considering the amount of are detected correctly.
Similarly, we can calculate the other measurements
accordingly.

Area of overlap between ground−truth∧¿ predicted cr


IoU =
Areaof ∪between ground−truth∧¿ predicted cross
Precision is the percentage of correct positive
predictions among all refers to the precision also known as
specificity which is calculated by Eq. (2).

TP
Precision=
TP+ FP
Recall is the percentage of total positive cases that the
classifiers can catch correctly refers to the recall also known as
sensitivity which is calculated by Eq. (3)
Fig. 3. Flow diagram of crowwalk region detection and
localization using Faster R-CNN and Inception v2 TP
Recall=
TP+ FN
C. Experimental Results
We evaluated our model on the dataset that we have made with The F1-score is defined as the harmonic mean of the
Tensorflow 1.12 library. Cross Entropy implementations recall and precision. The FI-score is defined by Fq. (4)
provided by Tensorflow faster renn inception v2 network
architecture were used to evaluate training subset frequently. TP
We scheduled our model to be run initially at learning rate F 1−Score=
TP+ FP
0.0002 and after 90k and 120k iterations at learning rate
0.00002 and 0.000002 respectively. We applied early stopping The accuracy is calculated by the following equation Eq.
when the model converged at -45k iterations. Batch size was (5).
one, i.e. the whole dataset was used in one epoch. The loss TP+TN
curve of the training model is presented in Fig. 4. Accuracy=
The performance of the proposed framework has been tested in TP+ FP+ TN + FN
different lighting conditions as well as on different types of
zebra crossings to evaluate the accuracy and efficiency of the Fig. 5 demonstrates the detection results of some example
detection and recognition of zebra crossings images of crosswalk under different environment and lighting
conditions. The Sample 1 in Fig. 5 shows the detection results
of crosswalk image with normal illumination in a sunny day.
Sample 2 presents low light condition for the crosswalk image.
Whole crosswalk has low illumination and surrounding area
has little light. Sample 3 shows the experimental results in the
uneven illumination or shadows environment, where the
crosswalk's different portion has varying high and low
illumination. Sample 4 and 5 represents night and rainy
weather respectively. Sample 4 has artificial lights on the
crosswalk and very dim illumination and Sample 5 has surface
reflection of light. The experimental result for crosswalk
where option of crosswalk is occluded by human is shown in
Fig. 4. Loss-curve of the training model. Sample 6. Bounding box rectangles are shown in thick borders
To understand the efficiency and correctness of the proposed for viewing purposes although real detections have thin
system, some processing examples of different illumination borders.
conditions and orientations of crosswalks have been The machine learning metrics of TP, TN. FP. FN and lou for
demonstrated. The detection accuracy table for varying Sample 1 to 6 is shown in Table II. The crosswalk's ROI
crosswalks are given with the calculation of true positive (TP), detection and localization accuracy at different environmental
true negative (TN), false positive (FP), false negative (FN), conditions is shown in Table III.
intersection over union (lou), precision, recall, F1-score and
accuracy itself. The average accuracy for each orientation and
lighting conditions of crosswalk samples are calculated too
Training Ch
The true positive refers to the portion that belongs to the
onginal crosswalk region and correctly detected. True negative

3
Authorized licensed use limited to: University College London. Downloaded on May 23 2020 at 09:21:39 UTC from IEEE Xplore. Restrictions apply
Sample 4 97.96 97.96 97.96 97.98
Sample 5 100 100 100 100
Sample 6 90.00 100 94.7400 95.00
Average 95.33 99.66 97.41 97.50
S Fig.6 demonstrates results of some false detection example
ample 1: Sunny predicted by our proposed method. False positives happen due
to the structures that are very similar to crosswalks such as
road markings with alternating patterns, stairs in opposite to
illumination, shadows similar to crosswalks etc.

Sample 2: Low light

(a) (b) (c) (d)

D. Comparisons and Discussions


Sample 3 Shadow The proposed method's efficiency is further proved while it is
compared with the research work of [6], which uses deep
architectures such as AlexNet, GoogleNet, and VGG network,
and obtains best accuracy from VGG.

TABLEIV. Comparison of the proposed method

Sample 4: Night Method Accuracy (%)


Proposed Framework 97.50
With VGG [6] 96.04
With AlexNet [6] 97.00
With GoogleNet [6] 96.69

Sample 5: Rainy Our method has the best accuracy of 97.50% over different
environmental situations which are the most diverse weather
conditions. While [6] only classifies crosswalks using CNNs
where many of the images acquiring and annotation steps do
not involve any learning, proposed method can both classify
and detect crosswalks in diverse and complex scenarios with
Sample 6: Occluded crosswalk the proposed network as shown in figure 5. These images
show the robustness of the model in different viewing angles,
scale variance of crosswalks both in horizontal and vertical
directions, occlusion etc. Moreover, in [6] achieves its highest
Fig. 5. Crosswalk images in different environmental and
accuracy with the very deep, computationally expensive VGG
lighting conditions: a) Original experimental image, by
Ground-truth image, e) Predicted image, and d) loU of
network, while our proposed framework surpasses it with a
Ground-truth and predicted image
comparatively cheaper network.
TABLE II. TP, TN, FP, FN and lol of Crosswalkdetected ROI
IV. CONCLUSION
at different environmental conditions
TP (%) TN (%) FP (%)
We present a method based on a DCNN model that can detect
Samples FN("%) Tol (%)
zebra crosswalks in diverse weather and lighting conditions.
Sample 1 92.00 100 8.00 0.00 87.00
Moreover, our model can detect multiple crosswalks in
Sample 2 98.00 100 2.00 0.00 86.50 different orientations without the need of any extra processing
Sample 3 94.00 100 6.00 0.00 83.00 with the accuracy of 97.50%. We provided the framework
Sample 4 96.00 98.00 2.00 2.00 89.00 images of our own and those that are taken from video frames.
Sample 5 99.00 100.00 0.00 0.00 87.00 Proposed method uses the Faster R-CNN and Inception v2
Sample 6 90.00 100 10.00 0.00 85.00 where Inception v2 works by going wider to reduce
bottlenecks without hurting accuracy. There are certain scopes
TABLE III. Crosswalk ROI detection and localizationaccuracy of improvement in our model. Future work would be done to
at different environmental conditions implement the system in real-time environments in complex
Average scenarios and improve the accuracy.
Precisi- Recall F1-score Accuracy
Samples on (%) (%) (%)
Accuracy
(%) (%)
Sample 1 92.00 100 95.83 96.00 97.50
Sample 2 98.00 100 98.99 99.00
Sample 3 94.00 100 96.91 97.00

4
Authorized licensed use limited to: University College London. Downloaded on May 23 2020 at 09:21:39 UTC from IEEE Xplore. Restrictions apply
References
[1] S. Se, "Zebra-crossing detection for the partially sighted," [10] D. Koester, B. Lunt, and R. Stiefelhagen, "Zebra Crossing
in Proceedings IEEE Conference on Computer Vision and Detection from Aerial Imagery Across Countries," In
Pattern Recognition. CVPR 2000 (Cat. No.PR00662), vol. 2, International Conference on Computers Helping People with
pp. 211-217. IEEE, 2000. Special Needs (pp. 27-34). Springer, Cham, 2016.
[2] S. Yu, H. Lee, and J. Kim, "LYTNet: A Convolutional [11] X. Liu, Y. Zhang, and Q. Li, "AUTOMATIC
Neural Network for Real-Time Pedestrian Traffic Lights and PEDESTRIAN CROSSING DETECTION AND
Zebra Crossing Recognition for the Visually Impaired". In IMPAIRMENT ANALYSIS BASED ON MOBILE
International Conference on Computer Analysis of Images and MAPPING SYSTEM." ISPRS Annals of Photogrammetry.
Patterns, pp. 259-270, Springer. Cham, 2019. Remote Sensing and Spatial Information Sciences, vol. IV-
[3] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. 2/W4, pp. 251-258, 09/13 2017.
Wojna, "Rethinking the Inception Architecture for Computer [12] V. N. Murali and J. M. Coughlan, "Smartphone-based
Vision," in 2016 IEEE Conference on Computer Vision and crosswalk detection and localization for visually impaired
Pattern Recognition (CVPR), pp. 2818-2826, IEEE, 2016. pedestrians," in 2013 IEEE International Conference on
[4] D. Ahmetovic, C. Bernareggi, A. Gerino, and S. Mascetti, Multimedia and Expo Workshops (ICMEW), pp. 1-7, IEEE,,
"Zebra Recognizer: Efficient and Precise Localization of 2013.
Pedestrian Crossings," in 2014 22nd International Conference [13] M. Poggi, L. Nanni, and S. Mattoccia, "Crosswalk
on Pattern Recognition, pp. 2566-2571, IEEE, 2014. Recognition Through Point-Cloud Processing and Deep-
[5] D. Ahmetovic, C. Bernareggi, and S. Mascetti, Learning Suited to a Wearable Mobility Aid for the Visually
"Zebralocalizer : identification and localization of pedestrian Impaired," in New Trends in Image Analysis and Processing -
crossings," In Proceedings of the 13rd International ICIAP 2015 Workshops, Cham, pp. 282-289: Springer
Conference on Human Computer Interaction with Mobile International Publishing, 2015.
Devices and Services (pp. 275-284). ACM, 2011. [14] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN:
Towards Real- Time Object Detection with Region Proposal
[6] R. F. Berriel, A. T. Lopes, A. F. d. Souza, and T. Oliveira- Networks." IEEE Transactions on Pattern Analysis and
Santos, "Deep Learning-Based Large-Scale Automatic Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, IEEE,
Satellite Crosswalk Classification," IEEE Geoscience and 2017.
Remote Sensing Letters, vol. 14, no. 9, pp. 1513-1517, IEEE, [15] C. Szegedy et al., "Going deeper with convolutions," in
2017. 2015 IEEE Conference on Computer Vision and Pattern
[7] P. A. a. L. D. Bharat K. Bhargava, "A Mobile-Cloud Recognition (CVPR), pp. 1-9, IEEE, 2015.
Pedestrian Crossing Guide for the Blind." In International [16] S. I. a. C. Szegedy, "Batch Normalization: Accelerating
Conference on Advances in Computing & Communication, Deep Network Training by Reducing Internal Covariate Shift."
2011. Journal of Machine Learning Research, vol. 37, pp. 448-456,
[8] R. Girshick, "Fast R-CNN," in 2015 IEEE International 2015.
Conference on Computer Vision (ICCV), 2015, pp. 1440-
1448, IEEE, 2015.
[9] V. Ivanchenko, J. Coughlan, and S. Huiying, "Detecting
and locating crosswalks using a camera phone," in 2008 IEEE
Computer Society Conference on Computer Vision and Pattern
Recognition Workshops, pp. 1-8, IEEE, 2008.

5
Authorized licensed use limited to: University College London. Downloaded on May 23 2020 at 09:21:39 UTC from IEEE Xplore. Restrictions apply

You might also like