You are on page 1of 14

1258 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO.

6, JUNE 2021

REIN the RobuTS: Robust DNN-Based Image


Recognition in Autonomous Driving Systems
Fuxun Yu , Zhuwei Qin , Chenchen Liu, Di Wang , and Xiang Chen , Member, IEEE

Abstract—In recent years, the neural network (NN) has shown etc. [4]–[6]. These factors may introduce unexpected varia-
its great potential in image recognition tasks of autonomous tions to the real-world images, which are beyond the limited
driving systems, such as traffic sign recognition, pedestrian detec- training data and can cause considerable accuracy degradation.
tion, etc. However, theoretically well-trained NNs usually fail
their performance when facing real-world scenarios. For exam- As shown in Fig. 1(a), traffic sign examples under adverse
ple, adverse real-world conditions, e.g., bad weather and lighting weather conditions are wrongly classified by a well-trained
conditions, can introduce different physical variations and cause model. Such situations can cause severe safety issues, e.g.,
considerable accuracy degradation. As for now, the generaliza- several autonomous driving accidents due to the self-driving
tion capability of NNs is still one of the most critical challenges system error have already been reported during the last few
for the autonomous driving system. To facilitate the robust image
recognition tasks, in this work, we build the RobuTS dataset: a years [7]–[10].
comprehensive Robust Traffic Sign Recognition dataset, which The current autonomous driving systems’ inability with
includes images with different environmental variations, e.g., the real-world variations demonstrates the functionality flaw
rain, fog, darkening, and blurring. Then to enhance the NN’s gen- of state-of-the-art NNs, i.e., the insufficient generalization
eralization capability, we propose two generalization-enhanced capability. Generalization describes a model’s capability to
training schemes: 1) REIN for robust training without data in
adverse scenarios and 2) Self-Teaching (ST) for robust training extend its functionality from finite training cases into infi-
with unlabeled adverse data. The great advantages of such two nite unseen testing scenarios. To enhance the generalization
training schemes are they are data-free (REIN) and label-free ability of autonomous driving systems, many research works
(ST), thus effectively reducing the huge human efforts/cost of on- have emerged: Lim et al. [11] targeted at traditional data
road driving data collection, as well as the expensive manual data augmentation techniques, i.e., trying to include as much
annotation. We conduct extensive experiments to validate our
methods’ performance on both classification and detection tasks. full-annotated training examples under adverse weather con-
For classification tasks, our proposed training algorithms could ditions as possible, including rainy and cloudy images; and
consistently improve model performance by +15%–25% (REIN) Tian et al. [7] proposed an NN verification and testing system.
and +16%–30% (ST) in all adverse scenarios of our RobuTS By large amounts of simulated scenarios, they could help find
datasets. For detection tasks, our ST could also improve the detec- existed corner cases that autonomous driving systems may
tor’s performance by +10.1 mean average precision (mAP) on
Foggy-Cityscapes, outperforming previous state-of-the-art works fail to operate correctly; several other works aim to recog-
by +2.2 mAP. nize and remove the practical variations from the real-world
captured image, therefore restoring theoretical clean testing
Index Terms—Autonomous driving, deep neural network (NN),
robust image recognition. scenarios [4], [5], [12], [13].
Although these proposed works help alleviate the problems
to some extent, they are mainly focusing on “compensating”
I. I NTRODUCTION the NN generalization ability with large amounts of aux-
N RECENT years, neural network (NN) with exceptional iliary data and post-training efforts. Rather than enhancing
I performance shows its great potential in autonomous driv-
ing systems [1]–[3]. However, an NN with 100% accuracy
the generalization ability, most of the methods fell into con-
structing case-specific or scenario-specific networks, which
in theoretical testing is still not ready-to-go with cars. Many can cover limited scenarios only and may still fail under new
practical factors will affect the NN performance, such as differ- unexpected conditions. By comparison, another common prac-
ent weathers, lighting conditions, camera/sensor discrepancy, tice of industry is to collect tremendous amounts of data by
thousands-of-hours on-road driving, hoping these training data
Manuscript received March 1, 2020; revised June 21, 2020; accepted would cover as many as possible corner cases to enable the
September 28, 2020. Date of publication October 23, 2020; date of current model’s generalization capability in the real world. However,
version May 20, 2021. This work was supported in part by the NSF under
Grant 1717775. (Corresponding author: Fuxun Yu.) such practice comes with not only huge human driving efforts
Fuxun Yu, Zhuwei Qin, and Xiang Chen are with the School of Electrical (in years) but also produce raw thousands-of-hours data, which
and Computer Engineering, George Mason University, Fairfax, VA 22030 demands huge human annotation cost.
USA (e-mail: fyu2@gmu.edu).
Chenchen Liu is with the Department of Computer Science and Electrical In this work, we focus on improving the generaliza-
Engineering, University of Maryland, Baltimore County, Baltimore, MD tion performance of two subtasks in autonomous driv-
21250 USA. ing systems: 1) robust traffic sign classification and
Di Wang is with Microsoft Cognition, Microsoft, Redmond, WA 98052
USA. 2) object detection tasks. Specifically, we have the following
Digital Object Identifier 10.1109/TCAD.2020.3033498 contributions.
0278-0070 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1259

(b)

(a) (c)

Fig. 1. System overview. (a) DNN model’s accuracy can degrade significantly in adverse weathers. We propose two training schemes to enhance model
generalization in adverse weathers targeting at two scenarios: (b) with only clean training data available: REIN and (c) with extra unlabeled training data in
adverse conditions: ST.

1) First, we build a comprehensive RobuTS dataset to facil- our ST could also bring +10.1 mean average precision (mAP)
itate the Robust Traffic Sign recognition research. The performance improvement, outperforming recent state-of-the-
dataset contains a large amount of traffic sign images art works by +2.2 mAP. Besides, the REIN and ST training
under four different weather conditions with varied method requires either no extra training data or extra anno-
variation intensities, e.g., rain, fog, day/night lighting, tations in these scenarios, which can save a lot of on-road
blurring, etc. Dataset will be open sourced for research driving data collection cost, or huge human annotation efforts.
purposes. To interpret the improvement of our training method, we also
2) We benchmark a state-of-the-art NN-based traffic sign conduct gradient analysis and NN visualization, showing that
classifier on our RobuTS dataset, which effectively helps our generalization-enhanced model could better capture the
indicate the existed weakness and performance flaw of main features of image content, which could be the potential
the well-trained classifier, giving valuable feedback on reason for our generalization improvement.
the potential enhancement directions.
3) We analyze the influence of different practical variations II. P RELIMINARY
and summarize these practical variation models into one
In this section, we briefly review the background for deep
unified model. Guided by the unified model, we propose
NNs applied in autonomous driving systems and introduce sev-
REIN: a robust and efficient training approach, which
eral common practical variations of traffic sign images in the
could significantly improve the model generalization but
on-road driving scenarios.
without the needs to utilize extra data during training.
4) Considering the potentially large amounts of unlabeled
data available, we also propose the ST algorithm, which A. Neural Networks in Autonomous Driving
could combine the labeled clean data and the unlabeled There are several major subtasks in autonomous driving
data under adverse conditions to enhance the model’s systems, including image classification tasks, like traffic sign
generalization capability during training; classification; object detection tasks, like vehicle/pedestrian
5) Finally, we implement our training methods and evalu- detection [14]; lane detection tasks [15]; etc. Recently with
ate the generalization enhancement on both our RobuTS the fast development of deep learning, NN-based models have
dataset and the common FoggyCityscapes detection been widely utilized and achieve great performance in such
benchmark, which demonstrates our methods’ effective- tasks [16], [17].
ness and huge potential in autonomous driving system. In this work, we take both the image classification and
Experiments show that on our RobuTS classification dataset object detection subtasks in autonomous driving systems as
(rainy, fogy, darkening, and bokeh/motion blurring), the our target applications. Take the classification task as an exam-
performance of a well-trained model with 96% accuracy can ple, convolutional NN is the most popular model due to its
dramatically degrade by −40%–60%. In such cases, both exceptional performance in image classification tasks [18].
REIN and ST training method could bring model signifi- Currently, in clean testing scenarios, the state-of-the-art traffic
cant generalization improvement, achieving consistent better sign classification model [19] could usually achieve over 99%
performance and bringing +15%–25%, +16%–30% accu- accuracy on Germany Traffic Sign recognition Benchmark
racy improvement in all practical scenarios. Meanwhile, on (GTSRB) [20]. Such performance is considered as the pri-
object detection, such as vehicle/pedestrian detection tasks, mary achievement of the NN in autonomous driving systems.

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
1260 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO. 6, JUNE 2021

However, such strong models often appear to have low A. Practical Variation Modeling With Traffic Signs
accuracy when applied to practical cases with unexpected 1) Rain Variation: Physically, raindrops are uniformly dis-
variations, so does the state-of-the-art detection model. tributed in space and the drop size follows the Marshal–Palmer
distribution for a given rain rate [4], [12]. During the camera
B. Performance Degradation With Practical Variations exposure time tcam , the drop with speed vrain covers a distance
of length lrain = tcam ∗vrain [12]. The color of raindrops is white
The input examples of real-world traffic signs usually con- with certain opacity α due to its motion blur [12]. Therefore,
tain various kinds of practical variations, such as rain [4], the rain variation could be modeled as
fog [5], different ambient light conditions [11], camera/sensor
discrepancy [21], etc. However, these unexpected practical Xrain = M(p) × Xorg + α(1 − M(p)) × Rain (1)
variations are usually beyond the clean training dataset, and
thus can easily corrupt the NN classification accuracy. For where Xorg and Xrain denote the original and the observed
example, previous works have demonstrated thousands of erro- image. M(p) denotes the raindrop position matrix with pixel
neous behaviors across three top-performing DNNs in the index p. And α is the raindrop opacity. ω is the sum of
Udacity self-driving car challenges due to such practical varia- nonzero positions in M(p), denoting the number of raindrops
tion [7], [22]. Once such erroneous behaviors are incurred, the and thus is used to simulate the different rainfall intensity. The
steering or speed modules can fail to work correctly and the synthesized rainfall images are shown in Fig. 2(a).
performance degradation would cause critical safety issues in 2) Fog Variation: Based on the atmospheric optics [13], the
real-world autonomous driving systems. Several autonomous observed color of a camera-captured image in the presence of
driving accidents due to the self-driving system error have fog/haze can be modeled as follows:
already been reported during the last few years [8]–[10], which
has raised customers’ great concern in autonomous driving Xfog = (1 − t(p)) × Xorg + t(p) × A (2)
security.
Based on the previous research work, several representative where A = (Ar , Ag , Ab )T is the global atmospheric light that
practical driving-variation scenarios can be categorized and are represents the ambient light in the atmosphere. Also, t(p) is
taken into consideration in our work. inversely proportional to the scene depth and fog intensity
1) Weather Conditions: Rain, fog/haze, snow, etc. [4], [5]. since fog and long distance will scatter and attenuate the light
2) Ambient Light Variation: Day/Night, cloudy, etc. [11] during transmission. The larger t(p) is, the image itself will
3) Camera Discrepancy: Camera aging, blurring, etc. [21]. be vaguer. Therefore, we use parameter t(p) to control the fog
4) Others: Camera perspective, occlusion, etc. [21]. intensity. The synthesized fogy images are shown in Fig. 2(b).
These environmental factors can cause different variations 3) Darkening Effect: One important factor for driving is
in the camera-captured street scene images. When such varia- enough ambient light. In our simulation, we change the image
tion intensity is high, the NN-based classification or detection brightness to simulate the darkening effect. In image process-
system can be highly likely to mispredict the captured images. ing, adjusting brightness equals to add a constant offset to each
In the next section, we take the classification task as an channel for the R,G,B fields of an image. Therefore to get a
example and choose four representative practical variations to darker image, we add a negative constant to every channel.
demonstrate the current flaw of NN-based systems. And then The ambient light condition can be modeled as
we introduce our RobuTS dataset, which could help evaluate
and demonstrate the influence of such practical variations on Xlight = Xorg ± β × A (3)
the NN-based classification system.
where β controls the darkening intensity. The finally syn-
thesized images with the darkening effect are shown in
Fig. 2(c).
III. ROBU TS: ROBUST T RAFFIC S IGN
R ECOGNITION DATASET 4) Bokeh/Motion Blurring: In the camera exposure, the
bokeh blur is one common phenomenon, which is the aes-
In this section, we introduce our RobuTS dataset for robust thetic quality of the blur produced in the out-of-focus parts of
traffic sign recognition. We choose four representative driv- an image produced by a lens [23]. Meanwhile, the fast moving
ing scenarios, i.e., the commonly seen rain and fog as the and car vibration will also introduce the blurring effect. In
weather cases, darkening effect as the ambient light case, and our simulation, we use the Gaussian blur to simulate the
bokeh/motion blurring as the camera discrepancy case. We blurring effect, which is generated by convolving an image
use image synthesizing to generate the new RobuTS dataset with a Gaussian kernel. The standard deviation δ controls the
with the seed images from GTSRB dataset [20]. To do so, we intensity of blurring. The blurring effect can be modeled as
detailedly model the four different practical variations based
on previous research and synthesize new traffic sign images Xblur = Gaussian(δ) ∗ Xorg (4)
with different variations and in varying intensities. After syn-
thesizing the new dataset, we then benchmark a state-of-the-art where Gaussian(δ) denotes the Gaussian kernel with zero
NN-based classifier on the new RobuTS test sets and demon- mean and standard deviation δ, which controls the varia-
strate the model’s poor generalization capability as it shows tion intensity. And ∗ denotes the convolution operation. The
significant performance degradation. synthesized blurred images are shown in Fig. 2(d).

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1261

Fig. 2. Practical variation modeling and synthesis effect in four scenarios: rain drops, fog variation, darkening effect, and bokeh/motion blurring. With the
highest variation intensity (0 → 1), the traffic sign images are still relatively clear to human drivers, but the NN-based classifier’s accuracy drops significantly
by 40%–60% in all four scenarios.

B. RobuTS Dataset Synthesis C. Benchmarking Neural Network on the RobuTS Dataset


After the modeling, we can generate our new robust traf- In this section, we benchmark the state-of-the-art NN clas-
fic sign recognition dataset containing traffic sign images sifier on our RobuTS dataset. By testing with four different
with the aforementioned variations. As we have demonstrated variations, we demonstrate such an NN-based classifier’s
before, the intensity for all the above variation modeling potential weakness in practical driving scenarios.
can be controlled by a parameter with a specific physical Benchmarking Setup: We adopt the CNN-based classifier
meaning. Therefore, we synthesized images of varying inten- from Udacity Self-Driving Challenge [22]. The model is
sities for each variation in order to analyze the full-spectrum trained on the original GTSRB training set and could achieve
environmental influence. 96.0% accuracy on the original clean test set. The model is
Synthesis Setup: Our seed traffic-sign images are from the then evaluated on our RobuTS test set within all four scenar-
GTSRB dataset [20], containing 39 209 training and 12 630 ios. For each variation scenario, 20 different variation intensity
test samples, which include totally 43 classes of traffic sign levels are tested, and the results are shown in Fig. 2(4).
images (32×32×3). For each seed image, we would synthesize Results and Analysis: Overall, we can find that the model’s
new images with four types of variations. As for the intensity classification accuracies in all test scenarios demonstrate large
in each scenario, the upper bound intensity is selected so that extents of degradation, e.g., from 96% to average below 50%.
the traffic signs are still clearly visible to a human observer 1) Rain Variation: Rain variation shows the most influence
in the most intense setting. Then for each seed image, we on accuracy even in the very beginning stage. The test-
set 20 intervals for the intensity in that range and generate ing accuracy drops to 56% (−40%) at half variation
20 images with different intensities to cover the full-spectrum intensity, and finally arrives 50% (−46%).
variation changing for our research analysis. The intensity is 2) Fog Variation: Fog variation seems to be less influential
then normalized in the range of 0 to 1. For all generated traffic than rain, but still, it causes the model accuracy to drop
sign images with different physical variations, the labels are to 60% (−36%) at the full intensity.
kept consistent with their original ones. 3) Darkening Variation: The accuracy dropping is simi-
Dataset Overview: After synthesis, our RobuTS dataset lar to fog variation, but darkening causes a relatively
contains one million images of traffic signs with different larger accuracy drop. The accuracy finally drops to
synthesized physical variations (784 180 training and 252 600 45% (−51%).
testing) beyond the seed dataset GTSRB. For each traffic sign 4) Blurring Variation: The blurring effect hurt the image
image, besides the original normal version, it comes with four edges most, and the accuracy finally drops to the lowest
other driving scenarios: 1) rain; 2) fog; 3) darkening; and 4) 20% (−76%) with full-intensity blurring.
blurring, and with 20 different variation levels for each sce- Clearly, we can find the NN-based classifier demonstrates
nario. The RobuTS dataset and synthesis code will be open significant performance drops when facing practical variations,
sourced for public research upon publication. although the traffic sign images are still highly distinguishable
Our RobuTS dataset is a comprehensive dataset toward eval- to human observers even with the highest variation intensity.
uating the NN’s robust traffic sign recognition capability with We show several example images with three levels of inten-
various practical variations. For a traffic sign classification sities (0 → 0.5 → 1) in each scenario [Fig. 2(3)], where the
system, evaluation on RobuTS can help the developer to find traffic signs still seem relatively clear. This proves the NN-
the system’s potential weakness and performance flaw, giving based system’s susceptibility to practical variations and also
valuable feedback of enhancement directions. As an example, implies the NN’s limited generalization capability in complex
in the next section, we benchmark a well-trained NN from real-world scenarios. To enhance the generalization capability
Udacity Self-Driving Challenge [22] on our RobuTS test set, of such NN-based traffic sign classifiers, in the next section, we
demonstrating its non-negligible weakness facing the adverse analyze the underlying reasons and propose our REIN training
practical conditions. algorithms.

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
1262 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO. 6, JUNE 2021

IV. REIN: ROBUST AND E FFICIENT T RAINING W ITHOUT To conclude, the network’s resistance of small variation x
C OLLECTING A DVERSE DATA is inversely proportional to its gradients magnitude. That is to
In this section, we first analyze the underline influencing say, NN with high generalization ability should have as small
mechanism of these variations to the NN. We then define a uni- first-order gradients as possible. Therefore, in the next sec-
fied model by utilizing the first-order gradient to evaluate the tion, we will introduce gradient regularization, i.e., regulating
variation influence, as well as the NNs’ generalization capa- the network’s gradient magnitude to be small in the training
bility. Guided by it, we formulate a new training loss function process for generalization-enhancement purpose.
to enhance the model generalization by regulating the first-
order gradient magnitude. Also, the double-backpropagation B. Generalization-Enhanced Training Loss Formulation
technique [24] is used in the training process for second- As aforementioned, large first-order gradients might amplify
order gradients calculation. Finally, we give an overview of small variations and then influence the classification results.
our generalization-enhanced training algorithm. Therefore, in the training process, we should enforce network
gradients to be as small as possible while maintaining the same
A. Theoretical Generalization Problem Abstraction accuracy. As loss function could be used in training process to
Suppose we have a natural image input x, e.g., a 32×32×3 update the network parameters to satisfy certain constraints,
traffic sign image. The practical variation could be denoted we introduce the gradient loss penalty Lgrad into the network
by x. Note that x is usually small/moderate and will not training loss function
influence the main pattern of original image x. The NN could
Lθ (x) = Lce + c · Lgrad
be seen as a large-scale nonlinear function Fθ (x) composed  
of massive volumes of neurons, where θ is network’s weights ∂Regθ (x)
where Lgrad = Lnorm (8)
and bias values. As a result, the NN classification failure cases ∂x
could be denoted as where θ is network parameters, Lce is normal cross-entropy
Fθ (x + x) = Fθ (x) (5) loss and Lgrad is the gradient loss penalty we added.
∂Regθ (x)/∂x is the Jacobian matrix of our regularizer function
which means, added with some variation x, the classification Regθ (x), and Lnorm could be l1 , l2 , or l∞ norm. The coefficient
result of NN changes to a wrong label which is different from c here is used to adjust the gradient loss penalty strength so
the original correct prediction. that it will not harm the accuracy in the training process.
According to first-order Taylor expansion, we can linearly As for the regularizer function Regθ (x), it could be the soft-
approximate the NN function F(x + x) at the neighborhood max outputs, or the logit outputs of the network function,
of x to the following format: Fθ (x). Here, we choose to use the logit outputs since it could
∂Fθ (x) preserve most of the useful information before the softmax
Fθ (x + x) = Fθ (x) + × x (6) operation. The regularizer function is defined as follows:
∂x
where ∂Fθ (x)/∂x is the first-order gradients of function Fθ (x). Regθ (x) = max{Z(x)i , i = t} − Z(x)t (9)
Combining (5) and (6), we could then calculate that the
influence on NN brought by x is where Z(·) is the logits output before the softmax layer, and t
∂Fθ (x) is the input x’s correct label. This regularizer function Regθ (x)
Effect(x) = Fθ (x + x) − Fθ (x) = × x. (7) could be interpreted as the difference between the maximum
∂x
wrong logits and the correct ones. As long as we could main-
From the conclusion in (7), we show that the influence of tain the gradients of this function to be small, the small
variation x on NN Fθ (x) can be approximated to be linearly variation x s influence on this function would be limit to be
correlated with the gradient ∂Fθ (x)/∂x. In other words, gra- smallest. That is to say, the wrong logits will not easily exceed
dients could be seen as the amplification coefficients of small the correct ones and thus will not cause the wrong classifica-
variation x. The larger the gradients are, the more influence tion. Therefore, the network’s resistance to small variations
will be brought by the small variations. x could be improved to the maximum extent.
Fig. 3 illustrates the relationship between gradients and the
influence of x by showing two NN’s decision boundaries.
The network with larger first-order gradients [Fig. 3(a)] will C. Gradients Descent With Double Backpropagation
form a function surface with steeper slopes (gradients). For General network training process is usually done by the
one natural traffic sign image of class Priority Road, if we stochastic gradient descent algorithm. As in normal gradi-
add some small variation into the image, the network with ent descent, every parameter is updated using the following
larger gradients are more susceptible to misclassify it: the rain equation:
variation ’s influence might push the output across the decision  
∂Lce
boundary (violet surface), and then change the classification θ  = θ − lr · (10)
∂θ
results to Yield. In contrast, NN with smoother decision bound-
ary [Fig. 3(b)] is more resistant to such small variations, where lr is the learning rate. However, in our defined train-
because all its decision surface neighbors are still in the same ing loss function, first-order gradient penalty Lgrad is included.
class. As a result, different from normal gradient descent problems,

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1263

Fig. 3. REIN: robust training with gradient regularization. An NN model should have a smooth decision boundary so that small variations (e.g., rains) cannot
change its prediction easily. Through gradient regularization, i.e., regulating gradient magnitudes in training loss, we could improve NN’s decision boundary
smoothness, thus enchaining NN’s generalization capability.

introducing the gradient loss into the training loss needs us to as natural training without sacrificing the training efficiency,
solve a second-order gradient computation problem. as we will show in the later experiments.
To compute the second-order gradients, we adopt the
double-backpropagation technique as the solution [24]. In dou-
V. S ELF -T EACHING : ROBUST T RAINING W ITH
ble backpropagation, we first compute the cross-entropy and
U NLABELED P RACTICAL DATA
gradient loss by forward propagation, with the gradients then
being calculated by backpropagation. Then, to minimize the The robust training algorithm, REIN, could achieve general
gradient loss Lgrad , we need to calculate the second-order robustness against various variations without using any extra
derivative of Lgrad . Therefore, a second backpropagation oper- training data with practical variations. In many real cases,
ation is performed to compute the second-order derivative the raw practical training data can sometimes be available
of Lgrad on θ . After this, the weights of NNs are updated through onboard recording cameras and sensors. For exam-
according to the gradient descent equation ple, to facilitate the auto-pilot system development, Tesla has
     announced its data sharing policy, including collecting driving
 ∂Lce ∂Lnorm ∂Regθ (x)/∂x scene videos on customers’ cars in order to improve self-
θ = θ − lr · − lr · (11)
∂θ ∂θ driving performance [25]. With millions of customers driving
where −(∂Lce /∂θ) is the first-order gradients to minimize on road, collecting large amounts of raw data within all kinds
the cross-entropy, and −([∂Lnorm (∂Regθ (x)/∂x)]/∂θ) is the of scenarios becomes simple. However, the labeling cost for
second-order partial derivative to minimize the gradients loss. these raw but complex driving data can be very huge. This
leaves a huge amount of unlabeled data within all practi-
D. Generalization-Enhanced Training Overview cal scenarios unused, which has great potential to improve
self-driving system’s performance.
In summary, our generalization-enhanced training method Targeting at this problem, in this section, we propose a novel
introduces a new gradient penalty loss in the normal training training method to utilizing such unlabeled data, ST. By our
procedure to regulate the gradient magnitude to be as small novel ST algorithm, we could further improve the NN’s robust-
as possible. And the second-order gradients can be solved by ness against practical variations by utilizing unlabeled practical
adopting the double-backpropagation algorithm. As shown in data.
Fig. 3, during the REIN training, model’s large gradients will
be penalized through the whole training procedure, so the final
formed decision boundary will be smoother than the naturally A. Self-Teaching Overview
trained model. At the meantime, the model accuracy could be The ST algorithm originates from semisupervised learning
well preserved by controlling the penalty coefficients c. As a problem where the data labeling is incomplete, i.e., partial data
result, we could effectively improve the network’s generaliza- is labeled but the others are unlabeled [26], [27]. The main
tion ability among practical varied scenarios. Furthermore, one idea is to first use labeled data to train the learning system,
advantage of our algorithm is that our training method does not and then use the trained system itself as a teacher to (pseudo-
need to collect scenario-specific training data. In autonomous )annotate the unlabeled data. Finally, by combining data with
driving systems, this could save thousands miles of on-road real labels and data with pseudo-labels, the learning system
driving data collection, which has great practical significance. could be retrained to achieve better performance. This process
As for the extra cost of our training algorithm compared to could be iterated multiple times until convergence.
natural training, except that double backpropagation costs one Problem Setting: In our problem, for the labeled training
more backpropagation per training iteration, no other training data, we assume all the clean traffic sign images in GTRSB
overhead is introduced into our training process. Meanwhile, are available with labels, which are denoted as (Xs , Ys ). In
we could maintain the convergence speed at the same order contrast, all the traffic sign images with synthesized practical

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
1264 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO. 6, JUNE 2021

(a) (b)

Fig. 4. Overview of the ST algorithm. (a) In the initial step, we first use the labeled clean data to train an initial annotator model and predict the initial
pseudo-labels on the adverse data. (b) Then, both clean and adverse data will be involved into ST. During the iterative ST, the model performance can be
improved. Therefore, the pseudo-labels will also be relabeled by the better annotator to further improve the ST performance.

variations are assumed to be realistic data collected on-road both data sources (Xs , Ys ) and (Xt , Yt ) to improve classifier F  s
but without annotations. This part of data is denoted as (Xt , ?) performance on such new data
where ? denotes unknown labels. Our major task is to train 
an effective and generalizable NN-based classifier F that can Minimize −ys logF(xs ) + (1 − ys )logF(xs )
  
perform classification tasks well on both regular data, Xs , and + −yt logF(xt ) + 1 − yt logF(xt ). (14)
data with variations, Xt .
Algorithm Overview: Fig. 4 shows the overview of our ST Through the combined learning process, this would allow
algorithm, which is an iterative training process. In the first the classifier to extra knowledge from not only regular data
iteration, we use the labeled data (Xs , Ys ) to train the classifier but also practical data with various scenario variations, greatly
with regular cross-entropy loss, i.e., enhancing its generalization capability and improving accuracy
in these unlabeled practical scenarios.

N
The above process completes the first iteration of the ST
Minimize −ys logF(xs ) + (1 − ys )logF(xs ) (12) algorithm. Since we get a better classifier F than the initial
i
one, the more accurate pseudo-labels could be also obtained
where N is the number of images with labels. by applying the new classifier to reannotate. Therefore, the ST
After the initial training process, we could get a well-trained algorithm reiterates the annotation process and reconduct the
classifier F, which will then be used to generate pseudo- data-combined training to get increasingly better classifier and
labels Yt on the unlabeled raw data Xt . However, directly better pseudo-labels. The iteration process stops when achiev-
applying the classifier F onto the new data can generate erro- ing the upper bound performance. And the pseudocode of the
neous pseudo-labels due to the intense practical variations. ST algorithm is given in Algorithm 1.
Without precautions, these wrong pseudo-labels can signifi- As we can see, throughout the iterations, the classifier F
cantly hinder the following ST process. Therefore, we conduct performs as the teacher to generate pseudo-labels as its super-
confidence-based thresholding to choose the most confident vision, while it also acts as the student who relearns that from
pseudo-labels as they are most likely to be the correct ones. the real-labeled and pseudo-labeled supervision. Therefore, it
Specifically, the pseudo-label choosing criteria is depending is named the ST algorithm. ST could greatly boost the classi-
on whether the pseudo label’s largest confidence is higher than fier performance by involving large amounts of realistic data
a predefined confidence threshold θ . If so, it is labeled as the with practical variations. Meanwhile, since the training data
class with the highest confidence; otherwise, it is not annotated already covers as many as practical scenarios, the generaliza-
and remains unlabeled in the following iteration: tion capability of the classifier could also be greatly enhanced,
as we will show in the later experiments.
arg maxi (pi ), max(pi ) > θ
yt = (13)
none, otherwise
B. Self-Teaching Optimizations
where pi is the predicted probability of each class i for image ST improves the classifier’s performance mainly by training
xt , and yt is the output pseudo-label. with new images and the corresponding generated pseudo-
After the pseudo-label annotation process, we can get the labels. However, as we know that, the pseudo-labels generated
extra pseudo-labeled data (Xt , Yt ) from all possible scenarios. by the classifier inevitably contain certain errors. Even though
Here, Yt denotes the pseudo-labels generated by the previous with the confidence-based thresholding, certain erroneous
annotating process. With these extra images with various prac- pseudo-labels can still exist. When applying naive ST training,
tical variations, we could then conduct retraining by combining these wrong labels can cause error accumulation: During the

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1265

Algorithm 1 ST Algorithm Therefore, when combined two data sources during the
1: procedure I NITIALIZATION (A) training process, we conduct imbalanced sampling, i.e., to
2: Input: Clean data [xs , ys ], Adverse data [xt , ?] oversample the real labels but undersample the pseudo-labels
3: Initialize Model F in every mini-batch. Thus, we could ensure that most image
4: while not converge do samples in every mini-batch are accurately labeled and can
5: Train F using [xs , ys ]; produce useful gradient information. By imbalanced sampling
6: Update F according to Eq. (12); optimization, the ST process could achieve better performance
7: end while than naive random sampling, as we will show later.
8: Get pseudo label yt using F(xt ) by Eq. (13) Progressive Confidence Thresholds: The last optimization
9: Return [xt , yt ] we conduct is on the confidence threshold. Based on our
10: end procedure assumption, the confidence threshold θ controls the pseudo-
11: procedure I TERATIVE S ELF -T EACHING (B) label quality. A higher θ can produce more accurate pseudo-
12: Input: Clean data [xs , ys ], Adverse data [xt , yt ] labels and vice versa. But higher θ also cause another issue
13: Initialize Model F  that most data will remain unlabeled since their pseudo-label
14: while not converge do confidence cannot achieve the threshold based on (13).
15: Train F  using [xs , ys ] and [xt , yt ]; To tradeoff the pseudo-label quality and amount, we thus
16: Update F  according to Eq. (14); propose progressive confidence thresholds. In the beginning
17: end while iterations, we set high confidence thresholds θ to ensure the
18: Update pseudo label yt using F  (xt ) by Eq. (13) high label quality. With more training iterations, classifier F
19: Retrain model F and iterate until convergence. could progressively learn to classify more unlabeled data cor-
20: Return model F rectly. Thus, in the later iterations, we could then lower the
21: end procedure thresholds to include larger amount of new training data to
further enhance the model performance.
Overall, REIN and ST are both robust training algorithms
toward enhancing the NN-based traffic sign classification
following training process, some wrong pseudo-labels are used system’s generalization capability. Compared to the REIN
as ground-truth labels and classifier F will further reinforce training algorithm which does not require any extra train-
their wrong prediction results on such images. Even worse, ing data in adverse conditions, the self-training provides
more classification errors will appear on similar images due another alternative, i.e., utilizing the potentially available unla-
to the wrong supervision. beled training data to enhance the generalization, which could
To avoid the influence of such error accumulation, we get higher performance but also avoids the huge annotation
further propose some optimizations on ST: 1) curricu- cost. In the next two sections, we evaluate the performance
lum learning; 2) imbalanced sampling; and 3) progressive of both algorithms and demonstrate their great performance
confidence-based thresholding. improvement in terms of generalization enhancement.
Curriculum Learning: To avoid the overwhelming effects
produced by erroneous pseudo-labels, we first optimize the ST
algorithm following the practice of curriculum learning. When VI. E XPERIMENTS AND E VALUATION FOR REIN:
combining the pseudo-labeled dataset into ST, we use an easy- ROBUST T RAINING W ITHOUT P RACTICAL DATA
to-hard way to involve the pseudo-labeled training data. That In this section, we evaluate the performance of our robust
is, we first introduce images with smaller-intensity variations training method REIN on traffic sign classification tasks.
since their pseudo-label accuracy is higher than the ones with Experiment Setup: We use the GTRSB dataset as the source
more intense variations. When the model has learned to clas- dataset for model training in the experiments. Before train-
sify these images well, the model thus can also generate more ing, data augmentation techniques, including scaling, random
accurate pseudo-labels on the images with higher-intensity cropping, and rotating, are used. The base model we use is
images. Therefore, we will gradually introduce more images the convolutional NN similar to what we used in Section III.
with higher-intensity variations, i.e., a curriculum ST prac- Two raw models are then trained by natural training and
tice. As we will show later, under certain scenarios when our generalization-enhanced training method REIN, using
pseudo-label accuracy is relatively low, the curriculum learn- Momentum Optimizer with 5e-3 learning rate in Tensorflow-
ing is essential to ensure the significant improvement of the 1.6 [28]. All other training configurations are kept the same
ST algorithm. for a fair comparison. The two trained models are then
Imbalanced Sampling: After following the curriculum learn- named natural model and generalization-enhanced model,
ing practice, we still have both data sources: 1) the data with which achieve 95.8% and 94.4% accuracy, respectively. For
accurate real labels and 2) the data with pseudo-labels, which the accuracy of clear testing images, the robust training method
may not be perfectly accurate. And all real and pseudo-labels is slightly lower (−1.4%) than regular model training due to
are sampled randomly between two data sources. However, the gradient penalty, which we will discuss later. But under
when the pseudo-labels contain too many errors, the gradient all practical scenarios with variations, we will demonstrate
descent on one mini-batch can produce overwhelming wrong that our method could greatly outperform the natural models
gradients that hurt the training performance. by +15%–25%.

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
1266 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO. 6, JUNE 2021

(a) (b) (c) (d)

Fig. 5. REIN effectiveness evaluation: natural model versus generalization-enhanced model by REIN. With four different kinds of practical variations on traffic
sign images, our generalization-enhanced model consistently outperforms the natural model by large margins, achieving +15%–25% accuracy improvement
at the highest variation intensity.

increased, demonstrating much higher tolerance and


robustness than the natural model.
3) Darkening and Blurring: In all other test scenarios, our
generalization-enhanced model also consistently shows
better performance: achieving average +15%–23%
accuracy improvement, as shown in Fig. 5(c) and (d).
But different from fog variations, the accuracy gain in
Fig. 6. Natural and adversarial examples (with adversarial noises) look very
similar, but adversarial ones could easily mislead NN to misclassify. these two scenarios seems to saturate when achieving the
highest variation intensity. We hypothesis this is because
the highest-intensity variation in these cases has affected
A. Generalization Accuracy Under Practical Variations the original image appearance too much, which signif-
To evaluate the generalization improvement of the REIN icantly hinders the original NN classification process,
algorithm, we compare the two models’ (the natural model and e.g., the accuracy of 20%–40%.
the generalization-enhanced model) classification accuracies in In such cases, our robust training cannot provide further
four scenarios of our RobuTS dataset: 1) rain variation; 2) fog performance gain since the image content has been modified
variation; 3) darkening effect; and 4) bokeh/motion blurring. too much to recognize the type of the traffic sign.
For each scenario, we test all 20 levels of variation intensity Overall, our REIN robust training algorithm is shown to
from 0 → 1, and the final results are shown in Fig. 5. To better help the model to achieve consistently better performance in
visualize the improvement of our REIN algorithm, we use the all tested scenarios. Also, it is important to note that the whole
light blue line to show the absolute accuracy improvement REIN training algorithm requires no extra data, i.e., no data
of our generalization-enhanced model over the natural model, in these tested scenarios are used during the training pro-
which is calculated by Accgeneral − Accnatural . cess. Therefore, the generalization-enhancement brought by
Results and Analysis: As Fig. 5 shows, our generalization- our training method is indeed model intrinsic and can even
enhanced model outperforms the natural model in all four potentially generalize to more scenarios than these tested ones.
scenarios by average +15%–25% accuracy.
1) Rain Variation: Our model achieves higher accuracy
under all rain variation intensity settings (at most B. Extreme Corner Case Evaluation: Adversarial Noises
+17%), as shown in Fig. 5(a). In detail, as we increase Recently, adversarial noises have been shown to influence
the rain variation intensity (i.e., number of raindrops) in the traffic sign classification system in real world [29], which
test images, both models’ accuracies show degradation can be considered as some extreme corner cases of gener-
but our robust model clearly outperforms the natural one alization testing [30], [31], as shown in Fig. 6. To evaluate
by large margins. Natural model’s accuracy drops from our algorithm’s effectiveness in such cases, in this part, we
95.8% to 45%, while our model’s accuracy only drops use a recently proposed adversarial attack algorithm to gener-
from 94.4% to 62%, showing much more resistance to ate different intensities of adversarial noises and evaluate our
the rain variations; models’ performance under such variations as before.
2) Fog Variation: As Fig. 5(b) shows, our model can 1) Adversarial Variation Generation Mechanism:
achieve at most +25% accuracy improvement than the Adversarial variation generation includes a family of attacking
original model, indicating a large extent of general- algorithms, e.g., the C&W attack [32], which could mislead
ization enhancement. Also, as the upward accuracy NN by using a very small variation. To be general, we test
improvement trend shows, our model may achieve even two state-of-the-art attacking algorithms: 1) fast gradient sign
higher accuracy gain if the fog intensity is further method (FGSM) and 2) basic iterative method (BIM) [33].

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1267

the adversarial image with the highest intensity is still rather


clear to human (Fig. 6), they can cause the natural model’s
accuracy drop to 0%–15%. Even with half intensity, e.g.,
under the BIM attack = 0.4 − 0.5, the natural model’s
accuracy has already dropped to −5%, which is nearly ran-
dom guessing. In contrast, our generalization-enhanced model
achieves much higher performance, i.e., 40%–50% accuracy.
For the final results, our generalization-enhanced model out-
performs natural model by at most +30%–40% accuracy in
(a) (b) both attacking scenarios. Under the worst attacking cases (the
highest attacking strength), the REIN algorithm could also pro-
Fig. 7. Classification accuracy in adversarial corner cases. (a) FGSM attack vide +23%–25% accuracy improvement than natural training,
model. (b) BIM attack model. demonstrating its effectiveness at adverse scenarios.
Overall, under adverse conditions (including bad weather
and even adversarial corner cases), our REIN algorithm could
grant the model nonnegligible generalization capability with
high accuracy improvement. In the next part, we analyze the
model’s training gradients and the Jacobian matrix to demon-
strate the reasons why our REIN algorithm can help the model
to achieve such better performance than natural training.

C. Generalization Improvement Analysis and Discussion


1) Gradient Curve During Training Process: Fig. 8 shows
the curve of gradients’ l2 norm in the natural training method
Fig. 8. Gradients comparison: our proposed training method could effec- and our proposed REIN training method. In the natural training
tively regulate gradients to a small range, thus significantly improving model’s
generalization capability.
process, the gradients keep increasing, and gradients’ l2 norm
finally reaches 117.30. In contrast, in our proposed training
method, gradients’ l2 norm is constrained to a much smaller
Specifically, the FGSM attack model is defined as range. The final value of ours is 0.100, which is roughly
  ×1100 times less than the natural training method, thanks
∂Lossθ (x)
Xadv = Xorg + × Sign (15) to our gradient regularization loss design. According to our
∂x generalization theory in Section III, the gradient magnitude is
where Lossθ (x) denotes the C&W loss [32] in our experiment, inversely correlated to the generalization ability, i.e., the smaller
and ∂Lossθ (x)/∂x is the gradients of the loss with respect to the gradient is, the more robust the model will be to practical
the input image itself. The Sign() function will return the sign variations. Therefore, it is thus reasonable that our model shows
matrix of the input matrix. And here controls the variation better generalization ability than the natural model, since our
intensity (attack strength). Based on FGSM, BIM iteratively model’s average gradient magnitude is much smaller.
calculate the above equation and clip the perturbation to the 2) Jacobian Matrix Interpretability: Besides the gradient
given range (0 ∼ ). theory, we also use visualization techniques to visualize the
As shown in Fig. 6, the original images and adversarial Jacobian matrix of both models. The Jacobian matrix of the
attacked images by FGSM and BIM are shown with their clas- NN is a gradient matrix with respect to the input image, which
sification results. With some small or even indistinguishable can evaluate every pixel’s contribution to the final prediction
distortions, these adversarial images successfully mislead NN results. Therefore, it can be used to interpret which pixels
to misclassify. Thus, these small distortions found by attack- are considered as the most important pixels to make the final
ing algorithms could be considered as the worst but most decision by the NN.
influential variations. Therefore, we use these attack models In Fig. 9, we visualize some images’ Jacobian matri-
to evaluate our generalization-enhanced training algorithm as ces using the natural model and our generalization-enhanced
extreme corner cases. model. As shown in Fig. 9(a) and (c), the naturally trained
2) Generalization Under Extreme Corner Cases: We use model cannot effectively capture the main feature as its
the attacking models in (15) and the GTSRB test set to gener- Jacobian matrix visualization results are totally meaningless.
ate new testing images with adversarial variations against both This implies the final decision making of the natural model is
the natural model and our generalization-enhanced model. As averagely relying on the nonstructural pixels wide-spreading
before, we use different to control the attacking strength, i.e., the whole image. As a result, when practical variations occur,
variation intensity. The tested models are the natural model and the classification results can be easily influenced.
the generalization-enhanced model trained by REIN. The final In contrast, our robust model could successfully capture
testing results are shown in Fig. 7. the major feature of the image content [as shown by the
Result and Analysis: As Fig. 7 shows, the adversarial noises dashed line in Fig. 9(b)]. This is because we set the gradient-
affect the natural model’s performance much more seriously norm regularization during training. Therefore, the model is
than the four practical variations we tested before. Even though enforced to avoid the wide-spreading gradient pattern but

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
1268 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO. 6, JUNE 2021

As for the baseline, it is using the same VGG16 structure but


trained with natural training method on the source dataset since
there is no way to use the unlabeled data in natural training.
(a) The models are then named the natural model and the ST
model, which can both achieve 98.5% accuracy on the clean
GTSRB test set due to the stronger VGG model structure. Then
we evaluated the models on the RobuTS test sets including
(b) all four scenarios and variation intensities. Under four practi-
cal varied scenarios, we will show that our ST model could
achieve about +16%–30% accuracy improvement than the
natural model. Then, each ST optimization technique will also
(c) be evaluated by ablation study to demonstrate their necessity.
Fig. 9. Jacobian matrix visualization: robust model focuses more on the main
pattern of the input image, and thus have better resistance against small varia- A. Generalization Accuracy Under Practical Variations
tions in images. (a) Original image. (b) Jacob-Mat in our model. (c) Jacob-Mat
in the natural model. We first evaluate the generalization enhancement of our ST
algorithm. As before, we compare the natural model’s and our
ST model’s accuracies under all tested scenarios. The over-
focuses on only small-amount but major features to make all results are shown in Fig. 10. The light blue line shows
precise prediction (thus incurring slightly accuracy drop than the accuracy improvement of our ST model than the natural
natural model). But after such a training process, the model model, i.e., Accsteaching − Accnatural .
can automatically learn to extract the most distinguished fea- Results and Analysis: With the stronger VGG model struc-
tures of different traffic signs. These major features (e.g., the ture, our ST model still consistently outperforms the natural
red triangle edge of Yield sign) are hand-designed to have model in all four scenarios by averagely +16%–30% accuracy,
higher contrast with backgrounds and are not easily affected by as shown in Fig. 10.
weather variations. Thus, our generalization-enhanced model 1) Rain Variation: For the full-spectrum comparison in the
is more capable of resisting variations than the natural model. rain scenario, our ST model achieves at most +22%
3) Algorithm Overhead Analysis: We evaluate the running accuracy improvement than the baseline model. In detail,
overhead of the REIN gradient regularized training method. the baseline normal model’s accuracy drops from 98.5%
Here we consider the wall-clock time. We evaluate the aver- to 39.1%. In contrast, our ST model still maintains
age training time for 1k training iterations for REIN and the nearly 60% accuracy at the highest intensity of rain
natural training method. Averagely, our per-iteration training variation.
time is 2.1 times than the natural training due to the double- 2) Fog Variation: Our ST model achieves a similar
backpropagation overhead. Since the training stage can be performance gain (at most +19%) in the fog scenario.
conducted offline and on server-level GPU, we believe the Also, with small-intensity fog variations (e.g., 0–0.5
time overhead of REIN training is acceptable. For example, intensity level), the ST model could even maintain nearly
the whole 50k iterations to complete the REIN training only no accuracy drop, while the natural model already suf-
takes 2.5 h on Titan V, which is quite normal for NN training. fers from −10% accuracy drop, which demonstrates the
But it in turn brings +15%–25% accuracy improvement under effectiveness of ST with pseudo-labels.
various adverse conditions, and avoids the cost of collecting 3) Darkening and Blurring: Within darkening and blurring
the scenario-specific data, which can be much larger than the scenarios, the ST model outperforms the natural model
training cost on GPUs. Considering the fact that autonomous baseline by at most +16%–30% accuracy improvement.
driving is security critical, and data collecting efforts by on- Notice that, ST seems to be most effective under the
road driving can be huge, we believe the offline training blurring scenario: the ST model achieves over 80%
overhead of the REIN algorithm is negligible compared to accuracy under the highest blurring intensity while the
the cost of that. natural model’s accuracy already drops to around 50%.
Meanwhile, the trend of accuracy improvement is consis-
VII. E XPERIMENTS AND E VALUATION FOR Self-Teaching: tently upward, implying that the ST may achieve even higher
S ELF -T EACHING W ITH U NLABELED DATA accuracy gain with higher blurring intensities.
Experiment Setup: We adopt a stronger model VGG16 to As the results show, ST grants the model with higher gen-
evaluate ST algorithm’s performance. The source dataset is a eralization capability than natural training under most adverse
clean GTSRB dataset as in the previous setting. But different scenarios by utilizing the unlabeled images.
from the previous setting without extra data, in the ST algorithm,
we assume training images in the RobuTS dataset are avail- B. Ablation Study for Design Components
able while their labels remain unavailable. The ST algorithm 1) Improvement of Curriculum Learning Optimization: To
introduces these extra unlabeled images with variations into evaluate the effectiveness of curriculum optimization, we con-
the training process by predicting pseudo-labels on them. The duct an ablation study on the rainy scenario. The results are
aforementioned optimizations like imbalanced sampling and shown in Fig. 11, where the yellow and blue lines show the
curriculum learning, etc., are also used during ST iterations. ST models’ accuracy improvement on top of the baseline (w/

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1269

(a) (b) (c) (d)

Fig. 10. ST evaluation. During testing with four different kinds of practical variations on traffic sign images, our ST model consistently outperforms natural
model by large margins of +16%–30% at the highest variation intensity. (a) Rain variation. (b) Fog variation. (c) Darkening effect. (d) Bokeh/motion blur.

Fig. 11. Ablation study for curriculum optimization in ST. As the Fig. 12. Ablation study for imbalanced sampling optimization in ST. As
result shows, there is a 10% accuracy drop without curriculum optimization the result shows, there is a 8% accuracy drop without imbalanced sampling
during ST. during ST.

and w/o curriculum learning). Clearly, the ST with curriculum to 0.7 during the process of ST, which shows certain better
optimization (yellow line) outperforms the naive version (blue performance (+2%–3%) than the constant thresholding.
line) by at most +10% accuracy, demonstrating the necessity 3) Self-Teaching Algorithm Overhead Analysis: The ST
of curriculum optimization. overhead lies in the extra efforts of the model training on the
The reasons of performance difference are as follows: For pseudo-labeled data. As we adopt the imbalanced sampling
ST without curriculum learning, we directly pseudo-label all optimization, the sampling ratio of labeled data and pseudo-
the unlabeled data and combine them into the training dataset. labeled data also affects the training overhead. In our experi-
These data contains images with high-intensity rain variations, ments, we adopt the sampling ratio of 2:1 between labeled and
and thus their corresponding labeling accuracy at such rain unlabeled data in each iteration. Therefore, the per-epoch train-
intensity level is only around 30%–40%. As a result, these bad ing efforts are around three times than the default one. Since
pseudo-labels greatly hinder the performance improvement of the model training can be mostly done offline, we consider
ST. In contrast, ST with curriculum learning can gradually such overhead is acceptable in exchange for the classification
involve the data following an easy-to-hard way. Therefore, the accuracy improvement.
ST model can learn to predict these images with variations more
and more accurately, and thus achieving better performance. C. Detection Enhancement Under Practical Variations
2) Improvement of Imbalanced Sampling Optimization: The ST method can also improve the performance of detection
We conduct similar ablation study for imbalanced sampling tasks in autonomous driving. We apply our method on multiclass
optimization, and the results are shown in Fig. 12. As the object detection to demonstrate our performance enhancement
results show, the imbalanced sampling also provides with +8% on two widely used detection datasets: 1) Cityscapes [39] and
accuracy improvement than the naive version. 2) Foggy-Cityscapes [40]. As the name shows, the Cityscapes
Specifically, the imbalanced sampling is implemented as fol- dataset contains images of street scenes and includes eight
lows. Before combining the labeled and unlabeled data, we detection classes, such as Pedestrian, Rider, Car, etc. The Foggy-
control their composition ratio in the training dataset, e.g., Cityscapes dataset includes the similar street scene images but
2:1 (real-labeled data: pseudo-labeled data), as more clean with certain fog variations.
data with accurate labels are needed to stabilize the gradi- For implementation, we build our algorithm on top
ents of each mini-batch. Also for all ST experiments, we use of cyclegan for style translation [36], and then use the
a progressive confidence thresholding strategy to control the labeled Cityscapes and unlabeled Foggy-Cityscapes raw
pseudo-label quality, i.e., decreasing the thresholds from 0.9 images as training data for ST. The test data is from the

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
1270 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 40, NO. 6, JUNE 2021

TABLE I
D ETECTION P ERFORMANCE E NHANCEMENT ON F OGGY-C ITYSCAPES DATASET. I NPUT IMAGES A RE R ESIZED W ITH 512 OR 600 P IXELS AS THE
S HORTER S IDE FOR FAIR C OMPARISONS W ITH D IFFERENT S TATE - OF - THE -A RT W ORKS

Foggy-Cityscapes testset for performance evaluation in the [2] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet:
adverse foggy condition. We use the faster-rcnn detector [41] Unified, small, low power fully convolutional neural networks for real-
time object detection for autonomous driving,” in Proc. IEEE Conf.
and the same input size settings with previous works [34], Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 446–454.
[35], [37], [38]. For the performance evaluation, we report the [3] Z. Chen and Z. Chen, “RBNet: A deep neural network for unified road
mAP of all classes following the settings in the aforemen- and road boundary detection,” in Proc. Int. Conf. Neural Inf. Process.,
2017, pp. 677–687.
tioned previous works for fair performance comparison. The
[4] D. Hospach, S. Mueller, W. Rosenstiel, and O. Bringmann, “Simulation
oracle performance denotes the model’s mAP trained on fully of falling rain for robustness testing of video-based surround sensing
annotated foggy-cityscapes train set, which can be regarded as systems,” in Proc. Design Autom. Test Eur. Conf. Exhibit. (DATE), 2016,
the upper bound performance. pp. 233–236.
[5] R. Gallen, A. Cord, N. Hautiére, É. Dumont, and D. Aubert, “Nighttime
As Table I shows, our ST method improves the base- visibility analysis and estimation method in the presence of dense fog,”
line detector’s performance by +10.1 mAP in 512 × 1024 IEEE Trans. Intell. Transp. Syst., vol. 16, no. 1, pp. 310–320, Feb. 2015.
resolution. Compared with previous state-of-the-art works, [6] H. H. Aghdam and E. J. Heravi, Guide to Convolutional Neural
our method also achieves better detection performance, e.g., Networks: A Practical Application to Traffic-Sign Detection and
Classification. Cham, Switzerland: Springer, 2017.
+2.0 mAP and +2.2 mAP than [36] and [38] in 512 × 1024
[7] Y. Tian, K. Pei, S. Jana, and B. Ray “DeepTest: Automated test-
and 600 × 1200 resolutions. ing of deep-neural-network-driven autonomous cars,” 2017. [Online].
Overall, we show that our ST algorithm could bring Available: arXiv:1708.08559.
+16%–30% accuracy improvement for classification tasks and [8] A. Lubben. (2018). Self Driving Uber Killed a Pedestrian as Human
Safety Driver Watched. [Online]. Available: https://www.vice.com/en-
+10.1 mAP for detection tasks in adverse conditions. In prac- us/article/kzxq3y
tice, the requirement of unlabeled data is also easy to fulfill [9] J. Horwitz and H. Timmons. (2019). There Are Some Scary Similarities
by safe data sharing policy without compromising user pri- Between Tesla’s Deadly Crashes Linked to Autopilot. [Online].
vacy [25]. Therefore, we believe the ST algorithm has its great Available: https://qz.com/783009/
[10] J. Green. (2018). Tesla: Autopilot Was on During Deadly Mountain View
potential in enhancing NN model’s generalization capability in Crash. [Online]. Available: https://www.mercurynews.com/2018/03/30
autonomous driving systems. [11] K. Lim, Y. Hong, Y. Choi, and H. Byun, “Real-time traffic sign recog-
nition based on a general purpose GPU and deep-learning,” PLoS ONE,
VIII. C ONCLUSION vol. 12, no. 3, 2017, Art. no. e0173317.
[12] M. Nentwig and M. Stamminger, “Hardware-in-the-loop testing of com-
In this work, we first build a comprehensive RobuTS dataset puter vision based driver assistance systems,” in Proc. Intell. Veh. Symp.
including traffic sign images in four adverse weather condi- (IV), 2011, pp. 339–344.
tions, e.g., rainy, foggy, etc. Based on that, we benchmark [13] J.-H. Kim, W.-D. Jang, J.-Y. Sim, and C.-S. Kim, “Optimized contrast
NN-based classifiers and demonstrate their low generaliza- enhancement for real-time image and video dehazing,” J. Vis. Commun.
Image Represent., vol. 24, no. 3, pp. 410–425, 2013.
tion ability under practical variations. Then, we propose two [14] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A
novel robust training schemes REIN and Self-Training target- benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009,
ing at boosting model generalization ability in two scenarios: pp. 304–311.
1) without extra training data and 2) with unlabeled data in [15] C. Lee and J.-H. Moon, “Robust lane detection and tracking for real-
time applications,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 12,
adverse conditions. Experiments show that our proposed meth- pp. 4043–4048, Dec. 2018.
ods could greatly improve model’s intrinsic generalization in [16] W. Liu, S. Liao, W. Ren, W. Hu, and Y. Yu, “High-level semantic feature
both classification and detection tasks. detection: A new perspective for pedestrian detection,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5187–5196.
[17] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane
R EFERENCES detection CNNS by self attention distillation,” in Proc. IEEE Int. Conf.
[1] D. Feng, L. Rosenbaum, and K. Dietmayer, “Towards safe autonomous Comput. Vis., 2019, pp. 1013–1021.
driving: Capture uncertainty in the deep neural network for Lidar 3D [18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
vehicle detection,” in Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
2018, pp. 3266–3273. Vis. Pattern Recognit. (CVPR), 2009, pp. 248–255.

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.
YU et al.: REIN RobuTS: ROBUST DNN-BASED IMAGE RECOGNITION IN AUTONOMOUS DRIVING SYSTEMS 1271

[19] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale Zhuwei Qin received the B.S. degree from the
convolutional networks,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Tianjin University of Science and Technology,
2011, pp. 2809–2813. Tianjin, China, in 2014, and the M.S. degree
[20] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “The German traffic from Oregon State University, Corvallis, OR, USA,
sign recognition benchmark: A multi-class classification competition,” in 2017. He is currently pursuing the Ph.D.
in Proc. IEEE Int. Joint Conf. Neural Netw., 2011, pp. 1453–1460. degree with the ECE Department, George Mason
[21] M.-Y. Fu and Y.-S. Huang, “A survey of traffic sign recognition,” University, Fairfax, VA, USA, under the supervision
in Proc. Int. Conf. Wavelet Anal. Pattern Recognit. (ICWAPR), 2010, of Prof. X. Chen.
pp. 119–124. His current research directions include deep neu-
[22] (2017). Udacity Self-Driving-Car Challenge. [Online]. Available: ral network compression, and interpretable deep
https://github.com/udacity/self-driving-car/tree/master/ neural network for mobile applications.
[23] J. Wu, C. Zheng, X. Hu, Y. Wang, and L. Zhang, “Realistic rendering
of bokeh effect based on optical aberrations,” Vis. Comput., vol. 26,
nos. 6–8, pp. 555–563, 2010.
[24] H. Drucker and Y. Le Cun, “Double backpropagation increasing gener-
alization performance,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN),
vol. 2, 1991, pp. 145–150.
[25] Tesla. (2019). Tesla Data Sharing Privacy Policy. [Online]. Available: Chenchen Liu received the M.S. degree from
https://www.tesla.com/about/legal Peking University, Beijing, China, in 2013, and the
[26] F. Yu et al., “Unsupervised domain adaptation for object detection Ph.D. degree from the ECE Department, University
via cross-domain semi-supervised learning,” 2019. [Online]. Available: of Pittsburgh, Pittsburgh, PA, USA, in 2017.
arXiv:1911.07158. In 2017, she joined the Department of Electrical
[27] X. Zhu and A. B. Goldberg, “Introduction to semi-supervised learning,” and Computer Engineering, Clarkson University,
in Synthesis Lectures on Artificial Intelligence and Machine Learning, Potsdam, NY, USA. She is currently an Assistant
vol. 3. San Rafael, CA, USA: Morgan & Claypool, 2009, pp. 1–130. Professor with the Department of Computer Science
[28] M. Abadi et al., “TensorFlow: Large-scale machine learning and Electrical Engineering, University of Maryland
on heterogeneous distributed systems,” 2016. [Online]. Available: at Baltimore County, Baltimore, MD, USA. Her
arXiv:1603.04467. current researches include brain-inspired comput-
[29] I. Evtimov et al., “Robust physical-world attacks on deep learning ing system and security, machine learning, integrated circuits design, and
models,” 2017. [Online]. Available: arXiv:1707.08945. emerging nonvolatile memory technologies.
[30] F. Yu, Z. Qin, C. Liu, L. Zhao, Y. Wang, and X. Chen, “Interpreting
and evaluating neural network robustness,” in Proc. 28th Int. Joint Conf.
Artif. Intell. (IJCAI), 2019, pp. 4199–4205.
[31] F. Yu, C. Liu, Y. Wang, L. Zhao, and X. Chen, “Interpreting adversarial
robustness: A view from decision surface in input space,” 2018. [Online].
Available: arXiv:1810.00144. Di Wang received the B.E. degree in computer
[32] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural science and technology from Zhejiang University,
networks,” 2016. [Online]. Available: arXiv:1608.04644. Hangzhou, China, in 2005, the M.S. degree in
[33] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the computer systems engineering from the Technical
physical world,” 2016. [Online]. Available: arXiv:1607.02533. University of Denmark, Lyngby, Denmark, in 2008,
[34] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive and the Ph.D. degree in computer science and engi-
faster R-CNN for object detection in the wild,” in Proc. IEEE Conf. neering from Pennsylvania State University, State
Comput. Vis. Pattern Recognit., 2018, pp. 3339–3348. College, PA, USA, in 2014.
[35] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin, “Adapting object detectors He is currently a Principal Research Lead with
via selective cross-domain alignment,” in Proc. IEEE Conf. Comput. Vis. Microsoft, Redmond, WA, USA. He has authored
Pattern Recognit., 2019, pp. 687–696. more than 40 peer-reviewed papers. His research
[36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image spans the areas of artificial intelligence, computer systems, computer archi-
translation using cycle-consistent adversarial networks,” in Proc. IEEE tecture, and energy-efficient system design and management.
Int. Conf. Comput. Vis., 2017, pp. 2242–2251. Dr. Wang received five best paper awards and two best paper nominations.
[37] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak distri-
bution alignment for adaptive object detection,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2019, pp. 6956–6965.
[38] R. Xie, F. Yu, J. Wang, Y. Wang, and L. Zhang, “Multi-level domain
adaptive learning for cross-domain detection,” in Proc. IEEE Int. Conf.
Comput. Vis. Workshops, 2019, pp. 3213–3219.
[39] M. Cordts et al., “The cityscapes dataset for semantic urban scene under- Xiang Chen (Member, IEEE) received the M.S. and
standing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, Ph.D. degrees from the ECE Department, University
pp. 3213–3223. of Pittsburgh, Pittsburgh, PA, USA, in 2012 and
[40] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene under- 2016, respectively.
standing with synthetic data,” Int. J. Comput. Vis., vol. 126, no. 9, He is currently an Assistant Professor with the
pp. 973–992, 2018. Department of the Computer Engineering, George
[41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- Mason University, Fairfax, VA, USA, where he is
time object detection with region proposal networks,” in Proc. Adv. the Founder of the Intelligence Fusion Laboratory.
Neural Inf. Process. Syst., 2015, pp. 91–99. He also stays in close cooperation with not only
academic society, such as Duke University, Durham,
NC, USA; University of California at Santa Barbara,
Santa Barbara, CA, USA; University of Pittsburgh, Pittsburgh, PA, USA;
Fuxun Yu received the B.S. degree from the Harbin Syracuse University, Syracuse, NY, USA; Tsinghua University, Beijing, China;
Institute of Technology, Harbin, China, in 2017. Hong Kong University of Science and Technology, Hong Kong; and City
He is currently pursuing the Ph.D. degree with the University of Hong Kong, Hong Kong, but also industries, such as the research
Department of Electrical and Computer Engineering, labs of HP, Palo Alto, CA, USA; Samsung, Suwon, South Korea; MSRA,
George Mason University, Fairfax, VA, USA, under Beijing, China; Marvell, Hamilton, Bermuda; Amazon, Seattle, WA, USA; and
the supervision of Prof. X. Chen. Apple, Cupertino, CA, USA. In the past years of research, he has published
His current research interests include deep more than 30 papers in the top international conferences and journals and
learning robustness, high-performance deep neural received many best paper nominations and other awards. His research interests
network computing and optimization, interpretabil- are in the low-power mobile system, high-performance mobile computing,
ity, and explainability of deep learning. machine learning, and secure computing system.

Authorized licensed use limited to: SASTRA. Downloaded on March 21,2023 at 06:35:19 UTC from IEEE Xplore. Restrictions apply.

You might also like