You are on page 1of 8

Explaining Convolutional Neural Networks through Attribution-Based Input

Sampling and Block-Wise Feature Aggregation


- Technical Appendix
Sam Sattarzadeh,1 Mahesh Sudhakar,1 Anthony Lem,2
Shervin Mehryar,1 K. N. Plataniotis,1 Jongseong Jang,3 Hyunwoo Kim,3
Yeonjeong Jeong,3 Sangmin Lee,3 Kyunghoon Bae3
1
The Edward S. Rogers Sr. Department of Electrical & Computer Engineering, University of Toronto
2
Division of Engineering Science, University of Toronto
3
LG AI Research
sam.sattarzadeh, mahesh.sudhakar@mail.utoronto.ca; j.jang, hwkim@lgresearch.ai

Datasets Severstal: Steel Defect Detection


Experiments are conducted on three different datasets: MS Training
COCO 2014 (Lin et al. 2014), PASCAL VOC 2007 (Ev- Class Test set Total
set
eringham et al. 2007), and Severstal (PAO Severstal 2019).
The first two datasets are “natural image” object detection 0 16620 7124 23744
datasets, while the last one is an “industrial” steel defect 1 935 401 1336
detection dataset. They are discussed more in detail in the 2 147 63 210
following subsections. 3 8166 3500 11666
4 971 417 1388
MS COCO 2014 and PASCAL VOC 2007 Datasets
The MS COCO 2014 dataset features 80 different object Table 1: Data distribution on each class of the recast Sever-
classes, each one of a common object. All experimental re- stal dataset, outlining the high data-imbalance among them.
sults are performed on the validation set, which has 40,504
images. The PASCAL VOC 2007 dataset features 20 object
classes, and all experimental results for this dataset are per- numbered 1 through 4. Each image may contain no defect,
formed on its test set, which has 4,952 images. Both datasets or one defect, or two and more defects from different classes
are created for object detection and segmentation purposes in it. The ground truth annotations for the segments (masks)
and contain images with multiple object classes, and images are provided in a CSV file, with a single row entry for each
with multiple object instances, making these datasets chal- class of defect present within each image. The row entries
lenging for XAI algorithms to perform well on. provide the locations of defects, with some entries having
several non-contiguous defect locations available.
Severstal Dataset The original images were long strips of steel sheets with
dimensions 1600 × 256 pixels. To convert the dataset for
Class 0 Class 1 Class 2 Class 3 Class 4
our purpose, every training image was cropped (without any
overlap) with an initial offset of 32 pixels into 6 individ-
ual images of dimensions 256 × 256 pixels. The few empty
(black) images that tended to be located along the sides of
the original long strip images were discarded, along with im-
Figure 1: Sample images with dimension 256 × 256, from ages that had multiple types of defects. This re-formulation
each class of the recast Severstal dataset. left a highly-imbalanced dataset with 5 distinct classes - 0, 1,
2, 3, and 4. Class 0 contains images with no defects, whereas
the other four classes have images with only that specific de-
To extend the analysis of the influence of XAI algorithms fect group. Fig. 1 shows sample images from each class of
beyond natural images, the Severstal steel defect detection the recast dataset. The image per class distribution is pro-
dataset was chosen. It was originally hosted on Kaggle as a vided in Table 1. The training split is 70% of the data, and
“detection” task, which we then converted to a “classifica- the test is the remaining 30%. From the training data, 20% is
tion” task. The original dataset has 12,568 train images un- used for validation. The experimental results and qualitative
der one normal class labeled “0”, and four defective classes figures of the Severstal dataset are conducted on a subset of
Copyright c 2021, Association for the Advancement of Artificial the test set using all of the images from classes 1, 2, and 4,
Intelligence (www.aaai.org). All rights reserved. and using 500 images from class 3.
Grad- Extremal Score- Integrated
Model Metric Grad-CAM RISE FullGrad SISE
CAM++ Perturbation CAM Gradient
EBPG 23.77 18.11 25.71 11.5 12.59 14.01 13.96 28.16
mIoU 15.04 15.69 12.81 14.94 15.52 7.13 14.25 15.57
VGG16 Bbox 28.98 20.48 24.93 28.9 27.8 14.54 27.52 29.63
Drop% 44.46 45.63 41.86 38.69 33.73 52.73 52.39 32.9
Increase% 40.28 38.33 41.30 46.05 49.26 34.11 32.68 50.56
EBPG 25.3 17.81 27.54 11.35 12.6 14.41 14.39 29.43
mIoU 17.89 15.8 13.61 14.69 16.36 7.24 10.14 17.03
ResNet-50 Bbox 32.39 28.28 26.98 29.43 29.27 14.54 19.32 33.34
Drop% 33.42 41.71 36.24 37.93 35.06 55.38 56.83 31.41
Increase% 48.39 40.54 45.74 45.44 47.25 32.18 29.59 49.76

Table 2: Results of ground truth-based and model truth-based metrics for state-of-the-art XAI methods along with SISE (pro-
posed) on two networks (VGG16 and ResNet-50) trained on MS COCO 2014 dataset. For each metric, the best is shown in
bold, and the second-best is underlined. Except for Drop%, the higher is better for all other metrics.

Models recast Severstal dataset was 86.58 percent, while the top-3
VGG16 and ResNet-50 accuracy was 99.60 percent. Table 3 shows the normalized
confusion matrix of this model.
The top-1 accuracies of the VGG16 and ResNet-50 mod-
els (loaded from the TorchRay library (Fong, Patrick, and
Vedaldi 2019)) on the test set of the PASCAL VOC 2007 Evaluation
dataset were 56.56 percent and 57.08 percent respectively In addition to the quantitative evaluation results shared on
out of a maximum top-1 accuracy of 64.88 percent, while the main paper, the results of both ground-truth based and
the top-5 accuracies were 93.29 percent and 93.09 percent model-truth based metrics on the MS COCO 2014 dataset
respectively out of a maximum top-5 accuracy of 99.99 per- are attached in Table 2. Similar to our earlier results, SISE
cent. The top-1 accuracies of the VGG16 and ResNet-50 on outperforms other conventional XAI methods in most cases.
the validation set of the MS COCO 2014 dataset were 29.62 The MS COCO 2014 data set is more challenging for the
percent and 30.25 percent respectively out of a maximum explanation algorithms than the PASCAL VOC 2007 dataset
top-1 accuracy of 34.43 percent, while the top-5 accuracies because of
were 69.01 percent and 70.27 percent respectively out of a
maximum top-5 accuracy of 93.28 percent. • the higher number of object instances
• the presence of more extra small objects
Predicted Class • the presence of more objects either from the same or
0 1 2 3 4 different classes in each image (on average)
• the lower classification accuracy of the models
Actual Class

0 0.89 0.011 0.0056 0.077 0.012


1 0.27 0.59 0.02 0.12 0.0025 trained on it (as provided in TorchRay library).
2 0.095 0.032 0.71 0.16 0 However, the results depicted in Table 2 and Fig. 8 em-
3 0.12 0.014 0.004 0.85 0.0086 phasizes the superior ability of SISE in providing satisfying,
4 0.15 0.0072 0.0024 0.16 0.67 high-resolution, and complete explanation maps that provide
a precise visual analysis of the model’s predictions and per-
spective.
Table 3: Normalized confusion matrix of ResNet-101 model The benchmark results reported on the Pascal VOC 2007
trained on recast Severstal dataset. and MS COCO 2014 datasets are calculated for all ground-
truth labels in the test images. For example, if a chosen input
image has both “dog” and “cat” object instances, then expla-
ResNet-101 nations are collected for both class ids and accounted
A ResNet-101 model was trained on the recast Severstal for in overall performance. SISE’s ability to generate class
dataset using a Stochastic Gradient Descent (SGD) opti- discriminative explanations is represented in this manner.
mizer along with a categorical cross-entropy loss function. As discussed in the main manuscript, SISE chooses pooling
The model is trained for 40 epochs with an initial learning layers to collect feature maps, which are later combined in
rate of 0.1, which is dropped by half every 5 epochs. Con- the fusion module. The experiments on the Severstal dataset
sidering the high data imbalance among the classes, the top- were performed for only the ground-truth labels, as each test
1 accuracy of the ResNet-101 model on the test set of the image has exactly one class id associated with it.
A detailed qualitative analysis of SISE explanations com- Input Image SISE SISE SISE SISE
pared with other state-of-the-art XAI algorithms on the dis-
cussed models on Pascal VOC 2007 and recast Severstal
Person
datasets are shown in Figs. 5, 6 and 7 respectively. Figs. 8 0.9999
and 9 show a similar comparative analysis on MS COCO
2014 dataset.
Car
Ablation Study 0.5281

Metric µ=0 µ = 0.3 µ = 0.5 µ = 0.75 TV


Monitor

EBPG 66.08 66.54 65.84 62.5 0.0014

mIoU 31.37 31.5 30.63 28.51


Bbox 61.59 61.45 59.83 56.53
Motorbike
Drop% 30.92 31.5 33.31 38.83 0.9978

Increase% 40.22 40.05 38.36 36.09


Runtime (s) 9.21 2.18 0.65 0.38 Figure 2: Effect of SISE’s µ variation on a ResNet-50 model
trained on Pascal VOC 2007 dataset.
Table 4: Performance and runtime results of SISE with re-
spect to the parameter µ, on a ResNet-50 network trained on
PASCAL VOC 2007 dataset. Except for Drop% and runtime this hyper-parameter, a dramatic increase in SISE’s speed is
(in seconds), the higher is better for all other metrics. gained in turn with a slight compromise in its explanation
ability.
As stated in the main manuscript, in the second phase of Since the behavior of our method concerning this hyper-
SISE, each set of feature maps is valuated by backpropa- parameter does not depend on the model and the dataset em-
gating the signal from the output of the model to the layer ployed, it can be consistently fine-tuned, based on the re-
from which the feature maps are derived. In this stage, after quirements of the end-user.
normalizing the backpropagation-based scores, a threshold
µ is applied to each set, so that the feature maps passing
the threshold are converted to attribution masks and utilized
Sanity Check
in the next steps, while the others are discarded. Some of In addition to the comprehensive quantitative experiments
these feature maps do not contain signals that lead the model presented in the main manuscript and this appendix, we also
to make a firm prediction since they represent the attribu- verified the sensitivity of our explanation algorithm to the
tions related to the instances of the other classes (rather than model’s parameters, illustrating that our method adequately
the class of interest). These feature maps are expected to explains the relationship between the input and output that
be identified by reaching zero or negative backpropagation- the model reaches. As introduced by (Adebayo et al. 2018),
based scores. Getting rid of them by setting the threshold sanity checks on explanation methods can be conducted ei-
parameter µ to 0 (µ is defined in the main manuscript) will ther by randomizing the model’s parameters or retraining the
improve our method, not only by increasing its speed but model with the same training data, but with random labels.
also by enabling us to analyze the model’s decision making In this work, we performed sanity checks on our method by
process more precisely. randomizing the parameters of the model. To do so, we have
By increasing the threshold parameter µ, a trade-off be- randomized the weight and bias parameters on the VGG16
tween performance and speed is reached. When this param- trained on PASCAL VOC 2007 dataset provided by (Fong,
eter is slightly increased, SISE will discard feature maps Patrick, and Vedaldi 2019). Fig. 3 represents the results of
with low positive backpropagation-based scores, which is sanity checks for some input images. The layers for which
expected not to make a considerable impact on the output the parameters to be randomized are selected in a top to bot-
explanation map. The higher the parameter µ is though, the tom manner, as specified in the figure. Each row shows the
more deterministic feature maps are discarded, causing more effect on the output explanation maps for an image when
degradation in SISE’s performance we perturb the parameters in more layers. According to this
To verify these interpretations, we have conducted an figure, SISE shows alterations in explanation maps, while
ablation analysis on the PASCAL VOC 2007 test set. As dealing with highly perturbed models. Hence, SISE passes
stated in the main manuscript, the model truth-based met- our sanity check.
rics (Drop% and Increase%) are the most important metrics To access SISE’s explanation beyond a few evaluation
revealing the sensitivity of SISE’s performance with respect metrics, another sanity check was performed. Fig. 4 attached
to its threshold parameter. According to our results as de- shows such experimentation where an untrained VGG16
picted in Table 4 and Fig. 2, the ground truth-based results model was directly compared with our Pascal VOC dataset
also follow approximately the same trend for the effect of µ trained VGG16 model. SISE doesn’t generate quality expla-
variation. Consequently, our results show that by adjusting nations from the untrained model, insisting that our method
Cascading weight randomization from top to bottom layers

Image SISE Logit Conv28 Conv21 Conv14 Conv7 Conv2

Dog

Bird

Train

Car

Figure 3: Sanity check experimentation of SISE as per (Adebayo et al. 2018) by randomizing a VGG16 model’s (pre-trained
on Pascal VOC 2007 dataset) parameters.

SISE explanations
Runtime on Runtime on
XAI Method
Input Image Trained model Untrained model
VGG16 (s) ResNet-50 (s)
Grad-CAM 0.006 0.019
Grad-CAM++ 0.006 0.020
Extremal Perturbation 87.42 78.37
Bus RISE 64.28 26.08
Score-CAM 5.90 18.17
Integrated Gradient 0.68 0.52
FullGrad 18.69 34.03
SISE 5.90 9.21
Cow
Table 5: Results of runtime evaluation of SISE along with
other algorithms on a Tesla T4 GPU with 16GB of memory.

was performed with a Tesla T4 GPU with 16GB of mem-


Person
ory on both a VGG16 and ResNet-50 model and attached as
Table 5.
Reported runtimes were averaged over 100 trials using a
Figure 4: SISE results from a VGG16 model trained on Pas- random image from the PASCAL VOC 2007 test set for each
cal VOC 2007 dataset with an untrained VGG16 model. trial. Grad-CAM and Grad-CAM++ are the fastest methods
when applied to both models. This is expected as they op-
erate using only one main forward pass and one backward
pass. Our method, SISE, is not the fastest, and the main bot-
not just provide “featured regions” obtained through convo- tleneck in its runtime is the number of feature maps extracted
lutional operations, but depict the actual “attributed regions” and used from the CNN. This is addressed by adjusting µ,
affecting the model’s decision. as discussed in the ‘Ablation Study’ section.

Complexity Evaluation
References
A runtime test was conducted to compare the complexity of
the different XAI methods with SISE, timing how long it Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt,
took for each algorithm to generate an explanation map. It M.; and Kim, B. 2018. Sanity checks for saliency maps. In
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Train
1.000

Person
0.9959

Dog
0.9408

Person
0.9889

Cat
0.9999

Person
0.0027

Horse
0.9962

Figure 5: Qualitative comparison of SISE with other state-of-the-art XAI methods with a ResNet-50 model on the Pascal VOC
2007 dataset.

Advances in Neural Information Processing Systems, 9505– PAO Severstal. 2019. Severstal: Steel Defect Detection
9515. on Kaggle Challenge. URL https://www.kaggle.com/c/
Everingham, M.; Van Gool, L.; Williams, C. K. I.; severstal-steel-defect-detection.
Winn, J.; and Zisserman, A. 2007. The PASCAL
Visual Object Classes Challenge 2007 (VOC2007) Re-
sults. URL http://www.pascal-network.org/challenges/
VOC/voc2007/workshop/index.html.
Fong, R.; Patrick, M.; and Vedaldi, A. 2019. Understand-
ing deep networks via extremal perturbations and smooth
masks. In Proceedings of the IEEE International Confer-
ence on Computer Vision, 2950–2958.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
manan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
coco: Common objects in context. In European conference
on computer vision, 740–755. Springer.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Cat
1.000

Chair
9.65e-06

Person
0.999

Person
1.24e-04

Car
0.999

Figure 6: Comparison of SISE explanations generated with a VGG16 model on the Pascal VOC 2007 dataset.

Integrated
Input Image Grad-CAM Grad-CAM++ Score-CAM Gradient RISE SISE

Class 1
0.8513

Class 2
0.92

Class 3
0.9994

Class 4
0.9983

Figure 7: Qualitative results of SISE and other XAI algorithms from the ResNet-101 model trained on the recast Severstal
dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Elephant
0.1291

Toilet
0.9962

Tennis
Racket
0.0031

Person
0.9999

Truck
0.8803

Figure 8: Explanations of SISE along with other conventional methods from a VGG16 model on the MS COCO 2014 dataset.
Extremal Integrated
Input Image Grad-CAM Grad-CAM++ Perturbation Score-CAM Gradient RISE SISE

Fire Hydrant
0.9542

Pizza
0.0597

Handbag
0.0012

Donut
0.9786

Cup
0.0203

Person
0.9999

Bicycle
6.13e-07

Figure 9: Qualitative results of SISE and other XAI algorithms from the ResNet-50 model trained on the MS COCO 2014
dataset.

You might also like