Professional Documents
Culture Documents
Abstract— High-resolution satellite imagery has been increas- perspectives (from a person [1] or a car [2]) present a known
ingly used on remote sensing classification problems. One of the limitation: there is a maximum distance in which crosswalks
main factors is the availability of this kind of data. Despite can be detected. Moreover, these images are quite different
the high availability, very little effort has been placed on the
zebra crossing classification problem. In this letter, crowdsourcing from those used in this letter, i.e., satellite imagery. Nonethe-
systems are exploited in order to enable the automatic acquisition less, Ahmetovic et al. [3] presented a method combining both
and annotation of a large-scale satellite imagery database for perspectives. Their algorithm searches for zebra crossings in
crosswalks related tasks. Then, this data set is used to train deep- satellite images. Subsequently, the candidates are validated
learning-based models in order to accurately classify satellite against Google Street View images. One of the limitations
images that contain or not contain zebra crossings. A novel data
set with more than 240 000 images from 3 continents, 9 countries, is that the precision of the satellite detection step alone, as the
and more than 20 cities was used in the experiments. The one proposed in this letter, is ≈ 20% lower than the combined
experimental results showed that freely available crowdsourcing algorithm. In addition, their system takes 180 ms to process
data can be used to accurately (97.11%) train robust models to a single image. Banich [4] presented a neural-network-based
perform crosswalk classification on a global scale. model, in which the author used 11 features of stripelets,
Index Terms— Crosswalk classification, deep learning, i.e., pair of lines. The author states that the proposed model
large-scale satellite imagery, zebra crossing classification. cannot deal with crosswalks affected by shadows and can only
detect white crosswalks. The aforementioned methods [1]–[4]
I. I NTRODUCTION were evaluated on manually annotated data sets that were
usually local (i.e., within a single city or a country) and
Z EBRA crossing classification and detection are important
tasks for mobility autonomy. Although important, there
are few data available on where crosswalks are in the world.
small (from 30 to less than 700 crosswalks).
Another approach that has been increasingly investigated
The automatic annotation of crosswalks’ locations worldwide is the use of aerial imagery, specially satellite imagery.
can be very useful for online maps, GPS applications, and Herumurti et al. [5] employed a circle mask template match-
many others. In addition, the availability of zebra crossing ing and SURF method to detect zebra crossing on aerial
locations on these applications can be of great use to people images. Their method was validated on a single region in
with disabilities, to road management, and to autonomous Japan and the method that detects most crosswalks took
vehicles. However, automatically annotating this kind of data 739.2 s to detect 306 crosswalks. Ghilardi et al. [6] pre-
is a challenging task. They are often aging (painting fading sented a model to classify crosswalks in order to help the
away), occluded by vehicle and pedestrians, darkened by visually impaired. Their model is based on an SVM classifier
strong shadows, and many other factors. fed with manually annotated crosswalk regions. As most of
A common approach to tackle these problems is using the the other related works, the data set used in their work is
camera on the phones to help the visually impaired people. local and small (900 small patches, and 370 of crosswalks).
Ivanchenko et al. [1] developed a prototype of a cell phone Koester et al. [7] proposed an SVM based on HOG and LBPH
application that determines if any crosswalk is visible and features to detect zebra crossings in aerial imagery. Although
aligns the user to it. Their system requires some thresholds to very interesting, their data set (not publicly available) was
be tuned, which may decrease performance delivered off-the- gathered manually from satellite photographs, therefore it is
shelf. Another approach is a camera mounted on a car, usually relatively small (3119 zebra crossings and ≈12500 negative
aiming driver assistance systems. Haselhoff and Kummert [2] samples) and local. In addition, their method shows a low gen-
presented a strategy based on the detection of line segments. eralization capability, because a model trained in one region
As discussed by the authors, their method is constrained shows low recall when evaluated in another (known as cross-
to crosswalks perpendicular to the driving direction. Both based protocol), e.g., the recall goes from 95.7% to 38.4%.
In this letter, we present a system that is able to automat-
Manuscript received March 16, 2017; revised April 27, 2017 and ically acquire and annotate zebra crossings satellite imagery,
May 26, 2017; accepted June 16, 2017. This work was supported in part and train deep-learning-based models for crosswalk classifi-
by UFES, in part by CAPES for the scholarships, and in part by CNPq under
Grant 311120/2016-4. (Corresponding author: Rodrigo F. Berriel.) cation in large scale. The proposed system can be used to
The authors are with the Universidade Federal do Espirito Santo, Vitória automatically annotate crosswalks worldwide, helping systems
29075-910, Brazil (e-mail: rfberriel@inf.ufes.br). used by the visually impaired, autonomous driving technolo-
Color versions of one or more of the figures in this letter are available
online at http://ieeexplore.ieee.org. gies, and others. This system is the result of a comprehensive
Digital Object Identifier 10.1109/LGRS.2017.2719863 study. This letter assesses the quality of the available data,
1545-598X © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. System architecture. The input is a region (red dashed rectangle) or a set of regions of interest. First, known crosswalk locations (red markers) are
retrieved using the OSM. Second, using Google Maps Directions API, paths (blue dashed arrows) between the crosswalk locations are defined. Third, these
paths are decoded and the result locations are filtered (only locations within the green area are accepted) in order to decrease the amount of wrongly annotated
images. At this point, positive and negative samples can be downloaded from Google Static Maps API. Finally, this large-scale satellite imagery is used to
train ConvNets to perform zebra crossing classification.
the most suitable model and its performance in several imagery points: minimum latitude, minimum longitude, maximum lati-
levels (city, country, continent, and global). In fact, this tude, and maximum longitude (or South–West–North–East),
letter is performed on real-world satellite imagery (almost i.e., the bottom-left and top-right corners (e.g., 40.764498,
250 000 images) acquired and annotated automatically. Given −73.981447, 40.799976, and −73.949402 defines New York’s
the noisy nature of the data, results are also compared with Central Park, NY, USA). For each region of interest, zebra
human annotated data. The proposed system is able to train crossing locations are retrieved from the OSM using the
models that achieve 97.11% on a global scale. Overpass API.3 Regions larger than one-fourth degree in
either dimension are likely to be split into multiple regions,
II. P ROPOSED M ETHOD as OSM servers may reject these requests even though most
of the settled part of the cities around the world meets this
The system comprises two parts: automatic data acqui-
limitation (e.g., the whole city of Niterói, RJ, Brazil, fits into
sition and annotation, and model training and classifica-
a single region).
tion. An overview of the proposed method can be seen
To download the zebra crossings, the proposed system uses
in Fig. 1. First, the user defines the regions of interest (regions
the tag highway = crossing of the OSM, which is one
where he wants to download crosswalks). The region of
of the most reliable and most used tag for this purpose.
interest is given by the lower-left and the upper-right cor-
In possession of these crosswalk locations, the system needs to
ners. After that, crosswalk locations within these regions
automatically find locations without zebra crossing to serve as
are retrieved from the OpenStreetMap (OSM).1 Subsequently,
negative samples. As simple as it may look, negative samples
using the zebra crossing locations, positive and negative
are tricky to find because of several factors. One of these is the
images (i.e., images that contain and do not contain crosswalks
relatively low coverage of the zebra crossing around the world.
on it, respectively) are downloaded using the Google Static
Another is the fact that crosswalks are susceptible to changes
Maps API.2 As the location of the crosswalks are known,
over the years. They may completely fade away, they can be
the images are automatically annotated. Finally, these automat-
removed, and streets can change. Alongside these potential
ically acquired and annotate images are used to train a con-
problems, OSM is built by volunteers, therefore open to con-
volutional neural network (ConvNet) from the scratch to per-
tributions that may not be as accurate as expected. Altogether,
form classification. Each process is described in detail in the
these factors indicate how noisy the zebra crossing locations
following sections.
may be. Therefore, picking up no-crosswalk locations must
be done cautiously. The first step to acquire good locations
A. Automatic Data Acquisition and Annotation for the negative samples is to filter only regions that contain
To automatically create a model for zebra crossing clas- roads. For that, the system queries Google Maps Directions
sification, the first step is the image acquisition. Initially, API for directions from a crosswalk to another to ensure
regions of interest are defined by the user. These regions can that the points will be within a road. In order to lower the
either be defined manually or automatically (e.g., based on number of requests to the Google Maps Directions API, our
a given address, city, and so on). The region is rectangu- system adds 20 other crosswalks as waypoints between two
lar (with the True North up) and is defined by four coordinate crosswalk points (the API has a limit of up to 23 waypoints and
20 ensures this limit). The number of waypoints does not affect
1 http://www.openstreetmap.org
2 http://developers.google.com/maps/documentation/static-maps/ 3 http://wiki.openstreetmap.org/wiki/Overpass_API
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I TABLE II
N UMBER OF I MAGES ON THE D ATA S ETS G ROUPED BY C ONTINENTS D IFFERENT A RCHITECTURES U SING I NTRA -BASED P ROTOCOL
C. Experiments
Several experiments were designed to evaluate the proposed
system. Initially, three well-known ConvNet architectures were
TABLE III
evaluated: AlexNet, VGG, and GoogLeNet. The images down-
I NTRA -BASED R ESULTS FOR THE VGG N ETWORK
loaded from the Google Static Maps API were upsampled
from 200 × 200 to 256 × 256 using bilinear interpolation.
As a data augmentation procedure, the images were randomly
cropped (224×224 to the VGG and GoogLeNet, and 227×227
to the AlexNet—as in the original networks), and the images
were randomly mirrored on-the-fly.
It is known that crosswalks vary across different cities,
countries, and continents. In this context, some experiments
were designed based on the expectation that crosswalks
belonging to the same locality (e.g., same city, same coun-
try, and so on) are more likely to present similar features.
Therefore, in these experiments, some models were trained
30 epochs and the learning rate was decreased three times by
and evaluated on the same locality, and they were named with
a factor of 10. The initial learning rate was set to 10−4 to all
the prefix “intra:” intracity, intracountry, intracontinent, and
three networks.
intraworld. At the smaller scales (e.g., city and country) only
Finally, the annotations from the OSM may not be as accu-
some of the data sets were used (assuming the performance
rate as required due to many factors (e.g., human error, zebra
may be generalized), at higher levels all available data sets
crossing was removed, and so on) even though we assume that
were used. These data sets were randomly chosen considering
the vast majority of them are correct and accurate. To validate
all cities with high annotation density. Ultimately, the intra-
that, an experiment using manually labeled data sets from the
world experiment was also performed.
three continents (America, Europe, and Asia—44 175 images
Besides these intra-based experiments, in which samples
in total) was performed. The experiment was designed to have
tend to have high correlation (i.e., present more similari-
three results: error of the automatic annotation; performance
ties), cross-based experiments were performed. The exper-
increase when switching from automatic labeled training data
iments named with the prefix “cross” are those in which
to manually labeled; and correlation between the error of the
the model was trained with a data set and evaluated in
automatic annotation and the error of the validation results.
another with the same level of locality, i.e., trained using
data from a city and tested in another city. Cross-based
experiments were performed on the three levels of locality: IV. R ESULTS
city, country, and continent. Another experiment performed Several experiments were performed and their accuracy and
was the cross-level, i.e., the model trained using an upper F1 score were reported. Initially, different architectures were
level imagery (e.g., the world model) was evaluated in lower evaluated on two levels of locality. As can be seen in Table II,
levels (e.g., continents). VGG achieved the best results. It is interesting to note that
All the experiments aforementioned share a common setup. AlexNet, a smaller model that can be loaded into smaller
Each data set was divided into train, validation, and test sets, GPUs, also achieved competitive results.
with 70%, 10%, and 20%, respectively. For all the experi- Regarding the intra-based experiments, the chosen model
ments, including the cross-based ones (cross-level included), showed to be very consistent across the different levels of
none of the images in the test set were seen by the models locality. The model achieved 96.9% of accuracy (on average)
during training or validation, i.e., train, validation, and test and the details of the results can be seen in Table III.
sets were exclusive. In fact, the cross experiments used only As expected, the cross-based models achieved a lower over-
the test set of the respective data set on which they were all accuracy when compared to the intrabased. Nevertheless,
evaluated. This procedure enables fairer comparisons and these models were still able to achieve high accuracies (94.8%
conclusions about the robustness of the models, even though on average, see Table IV). This can be partially explained by
the entire data sets could have been used on the cross-based inherent differences on the images between places far apart,
models. Regarding the training, all models were trained during i.e., different solar elevation angles cause notable differences
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.