You are on page 1of 8

2021 IEEE International Conference on Artificial Intelligence Testing (AITest)

Model-based Data-Complexity Estimator


for Deep Learning Systems
Yuta Ojima Shingo Horiuchi Fuyuki Ishikawa
NTT DATA Corporation NTT DATA Corporation National Institute of Informatics
Tokyo, Japan Tokyo, Japan Tokyo, Japan
Yuta.Ojima@nttdata.com Shingo.Horiuchi@nttdata.com f-ishikawa@nii.ac.jp
2021 IEEE International Conference On Artificial Intelligence Testing (AITest) | 978-1-6654-3481-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/AITEST52744.2021.00011

Abstract—Verification of deep learning (DL) systems in various we evaluate the ratio of situations in a training dataset to the
situations is growing in importance as they are being used in all possible situations on the basis of data properties, such
wider areas than ever before. To prevent unintentional behavior as distribution of numerical data, distribution of unstructured
of a DL system, it is important to feed an unfamiliar input
to simulate situations in which a model in the DL system can data in a feature space, and conditions heuristically drawn from
behave abnormally. Such an input is different from outliers since business requirements [6], [7]. Therefore, it is effective to use
we consider only statistical characteristics to detect outliers and this approach to check a model when the situations we want
the response of the model is not taken into account. This paper the model to cope with are explicit and limited. However, it is
proposes a method of evaluating a complexity of an input without difficult to apply this approach when we cannot limit situations
a label for a given trained model for analyzing a dataset to
find inputs unfamiliar to the model and estimating the similarity in advance, which is often the case. Moreover, heuristic
between two datasets. Specifically, we use an output of neurons conditions are not always important for a task and vice versa
in the model as an embedded representation of data. We regard (i.e., some conditions we ignore might give significant impact
typical patterns in the representations of data in a training on model performance). Another approach is to evaluate how
dataset as features the model obtained from the training dataset. thoroughly the behavior of a model are inspected. In this
We then calculate a complexity of an input as a distance between
the representation of the input and combination of the features. approach, we use internal states of the model (i.e., outputs
The obtained complexity can be applied for multiple purposes of neurons) as the behavior of the model [8]–[11] and try to
such as test guiding by extracting inputs that are complex for cover as many internal states as possible by a test dataset.
the model from a dataset, analyzing a training dataset to find an Therefore, it is ideal to use this approach when we want to
inappropriate input, and measuring the similarity between two prevent the model from behaving undesirably. However, it is
datasets using complexity distribution. Experimental evaluations
showed the effectiveness of the complexity calculated using the not realistic to apply this approach to large models since we
proposed method in analyzing a dataset. need a tremendously large test dataset to cover all the internal
Index Terms—Feature Extraction, Dataset Evaluation, Test states. Therefore, we need a more efficient approach to find the
guiding, Unsupervised, Deep Learning Systems input which is likely to cause abnormal behavior with referring
to what the model learned.
I. I NTRODUCTION We propose a novel method of evaluating a complexity of an
Deep learning (DL) systems have been adopted in various input without a label for a given trained model for analyzing
areas. Many attempts have been made to use DL systems in a dataset to find inputs unfamiliar to the model efficiently
more safety-critical domains such as autonomous vehicles, and estimating the similarity between two datasets (Fig. 1).
medical image diagnosis, and biologics [1]–[3]. While DL We treat activation values of neurons in a DL model as an
systems have high potential, which is still growing rapidly, embedded representation of input for the model, as in most
it is difficult to treat unfamiliar samples with them (e.g., a previous studies [8]–[11]. We call such a representation as
classifier to recognize gender is 12 times more likely to make an activation trace of the input for the model as defined in
wrong prediction even with high confidence if age of subject [11]. We then model the activation trace as the combination
is far from that of people in a training dataset [4]). Such of typical patterns that many activation traces of inputs in
a shortcoming has caused serious accidents in safety-critical a training dataset have in common and a pattern unique to
areas [5]. Therefore, we should find as many situations (i.e., individual activation trace (Fig. 2) since concepts, which are
combination of multiple concepts such as age, race, hair style, labels and components of a situation, should be represented
etc. for a gender recognition system) in which a DL model as some patterns in the activation traces when a situation of
can behave abnormally as possible and evaluate its behavior each input is represented as an activation trace. We describe
in advance to develop safe DL systems. Here, concepts can be these typical patterns as common features in this paper. We
interpreted as any properties of an input including its label. first estimate the common features in the activation traces
There are two major approaches of determining if a DL of inputs in a training dataset and then calculate a pattern
model is sufficiently trained and tested. One approach is to unique to individual input, which cannot be represented using
evaluate how many situations are trained. In this approach, only the common features. The larger the size of the unique

978-1-6654-3481-2/21/$31.00 ©2021 IEEE 1


DOI 10.1109/AITEST52744.2021.00011

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
1. Data Encoding 2. Feature Extraction 3. Complexity Estimation 4. Similarity Estimation
Activation Trace Original Activation Trace Histogram of
Data
#1 Train Dataset
#1 #2 #3 #4 … #2 #1 Complexity


#T Weighted Sum of

Activation Trace Original Activation Trace - Welch’s t-test


- Bhattacharyya dist.
#1 Common Feature #2 Complexity Histogram of
#2 Test Dataset
Weighted Sum of


#T

Fig. 1. Overview of proposed method. Our method consists of four parts: Data encoding, feature extraction, complexity estimation, and similarity estimation.

Input Label Situation Activation Trace II. R ELATED W ORK


Concepts
- Lying Some This section reviews related work on testing methods for
- With H
Horn
orn Features
Deer DL systems. Similar to other studies, we aimed to reveal how
- Grass
Grass
- Sunny
Sunny etc. Unique a model works for given inputs. We also introduce current
Pattern methods of analyzing datasets from various aspects as it is
Model
Concepts also a main goal of ours. Our method is different from these
- A jet airplane Some
- Whitee bbody
Whit ody Features method in that it can handle various tasks in a unified manner.
Airplane - Blue
Blue wwings
ings
- Taking
Taking ooff
ff Unique
- Sunny
Sunny etc.
A. Testing Deep Learning Systems
Pattern
The necessity for revealing the behavior of a DL system
Fig. 2. Definition and relationship of words used in this paper before its use has been determined; therefore, many methods
pattern, the more difficult it is to make an inference on the of thoroughly testing DL systems have been proposed. Pei et
corresponding input since the activation trace of it is far from al. [8] introduced the concept of neuron coverage (NC) as
the common features. We therefore define the norm of the the ratio of activated neurons to all neurons, inspired by the
unique pattern of an input as a complexity of the input for the concept of code coverage used in tests for traditional software
model. Once we calculate the complexity of each input in a systems. They also proposed an approach to generate test cases
dataset, we can interpret the dataset as a set of complexities, to increase NC to determine unintentional behavior of a DL
i.e., numerical set. This enables us to evaluate the similarity system. Ma et al. [9] improved upon NC to compare the
between datasets quantitatively using methods of measuring behavior of a DL system when test data are given with that
the similarity between two numerical sets. when training data are given. This improved index is called
DeepGauge and is interpreted as the extent to which a DL
The main contribution of this paper is proposing and evalu-
system is tested considering a given training dataset.
ating a method using a new concept, complexity, for analyzing
In contrast to the above methods, with which model behav-
a dataset and estimating the similarity between two datasets.
ior toward a test dataset is evaluated all at once, Kim et al.
The complexity indicates the uniqueness of individual input
[11] proposed a method that is focused on individual input in a
by referring to the situations of the inputs in the training
test dataset to evaluate a model and obtain more insights from
dataset. Since label information is not required in calculating
testing. They used activation of neurons without binarization
complexity, our method is applicable on datasets without
to evaluate individual input in a test dataset. They regard all
labels. The proposed method can be applied for the following
training data as concepts the model obtained and evaluate the
tasks:
variations in inputs in a test dataset to estimate the importance
of each input in testing. Their work is close to our method in
• Data classification: to classify inputs by common features a sense it also evaluates each individual input but different in
• Test guiding: to find weak points of the model efficiently that our method further decompose activation of neurons into
referring to the complexity of an input common features and a unique pattern.
• Training data analysis: to detect a suspicious input in a
training dataset B. Dataset Analysis
• Dataset similarity estimation: to estimate the similarity There have been many studies on analyzing datasets for
between datasets using their distribution of complexities various purposes. Branchaud-Charron et al. [12] proposed a

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
method of estimating the complexity of datasets for a classifi- trace, focusing on the non-negative property of the output
cation task using similarity among classes. With their method, of common activation layers, rectified linear unit layers, as
datasets are evaluated without model information since only follows:
feature values of data in lower-dimensional space are taken 
at = w i fi + ε t , (1)
into account in calculating similarity. In contrast, methods
i
of analyzing datasets by taking into account model proper-
ties have also been proposed. Regarding a training dataset, where fi ∈ RN + are weights of neurons representing the i-th
Swayamdipta et al. [13] proposed a method of analyzing common feature, wi is a weight of the i-th common feature in
annotated datasets for a classification task to efficiently extract at , and εt ∈ RN+ is the adjustment term unique to individual
data to make a trained model more robust. Regarding a test input that cannot be represented by obtained common features.
dataset, on the other hand, Mani et al. [14] proposed a method We infer latent variables wi and fi by non-negative matrix
of measuring quality of a test dataset using activation values factorization (NMF). This optimization problem is formulated
of the last layer of a model. They propose four heuristic as follows:
 T T
1≤t≤T d(at , wt F) subject to wt , F,
metrics to evaluate the quality of a test dataset. Moreover, other minimize (2)
studies [15]–[17] focus on data incoming after deployment of
where d(·, ·) is the distance between two vectors, wt ∈ RΦ +
systems. These studies attempted to detect out-of-distribution
is the weights of the common features in the t-th input, F :=
data referring to the output of the model such as confidence.
[f1 , . . . , fΦ ]T ∈ RΦ×N
+ are common features in the entire
Our work follows the directions in Sec. II-A, that is, ana-
training dataset, and Φ is a hyperparameter that represents
lyzing the model behavior and patterns of neuron activations.
the number of common features to be estimated. We obtain a
III. P ROPOSED M ETHOD feature matrix F and weight matrix W := [w1 , . . . , wT ]T ∈
RT+×Φ , which represents the weights of common features in
This section explains the proposed method of estimating a
individual input in the training dataset, as the results of this
complexity of an input x for a model trained on a training
step.
dataset X consisting of T inputs, [x1 , . . . , xT ], and estimat-
ing the similarity between two datasets on the basis of the C. Complexity Estimation
estimated complexity. Our method consists of four steps: data Since an activation trace of an input whose concept is
encoding, feature extraction, complexity estimation, and simi- different from concepts of others cannot be represented by
larity estimation (Fig. 1). In the data-encoding step, we encode the common features in the training dataset, such an input
training data in accordance with the trained model by using have a greater error in approximation using F mentioned in
the output of the activation layers. In the feature-extraction Sec. III-B. We therefore define a complexity for each input as
step, we estimate typical patterns in the activation traces of a reconstruction loss calculated as
inputs in the training dataset as the common features the model 1
obtained from the training dataset. We use these common et := ||εt ||
Zt
features instead of individual activation trace to evaluate the 
1 1  
complexity of an input. In the complexity-estimation step, = (At,n − Wt,i Fi,n )2 , (3)
we measure the distance between a combination of common Zt N n i
features and individual activation trace of an input. Using where Zt := maxn (At,n ) is a normalization factor. We denote
sets of estimated complexities, we then evaluate the similarity by Etraining := {et |1 ≤ t ≤ T } a set of complexities of inputs
between two datasets in the similarity-estimation step. in the training dataset. An Input with a high complexity in the
training dataset can be regarded as a suspicious input.
A. Data Encoding
Moreover, we can obtain activation traces
We use the output of activation layers to represent how 

the trained model treats an input x, which is a concept of B := [b1 , . . . , bT  ]T ∈ RT+ ×Nl (4)
  
activation trace introduced in a previous work [11]. Formally, for a test dataset X := [x 1 , . . . , x T  ] as mentioned in Sec.
we encode a training dataset to obtain the activation traces III-A. We also obtain a weight matrix for the test dataset, W  ,
Al := [al,1 , . . . , al,T ]T ∈ RT+×Nl , where [. . .] represents a by solving the following optimization problem:
matrix whose columns correspond to vectors between brackets,  T T 
T represents transpose of a matrix and Nl represents the
minimize 1≤t≤T  d(bt , w t F) subject to wt .

number of neurons in the l-th activation layer. Hereafter, we Note that F is fixed this time since we are interested in how
omit notation l when the target layer is obvious. far a activation trace of an input in the test dataset is not
from common features in the test dataset but from those in
B. Feature Extraction the training dataset. We then obtain a set of complexities for
We regard an activation trace at as a composite of the the test dataset, Etest := {et |1 ≤ t ≤ T  }, in the same manner
common features essential for an assigned task since a DL as the training dataset. An input with a high complexity in
model is trained to extract such features by adjusting param- the test dataset can be regarded as an unfamiliar input for the
eters in non-linear transformation. We model the activation trained model.

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
D. Similarity Estimation Modified
Dataset Original
Etest and Etraining are expected to be similar when the Brightness Contrast Saturation
situation of data in a training dataset and that in a test dataset (a) 50000 (10000)
are similar since both datasets should be represented by the (b) 40000 (8000)
same common features in such a case. We therefore confirm (c) 16666 (10000) 33334 (20000)
(d) 16666 (10000) 33334 (20000)
the similarity between Etest and Etraining using the following (e) 16666 (10000) 33334 (20000)
two approaches. The first approach is to conduct Welch’s t-
TABLE I
test to test the hypothesis that Etest and Etraining have equal N UMBER OF IMAGES IN TRAINING ( TEST ) DATASETS
means. The second approach is to calculate the Bhattacharyya USED IN THE EXPERIMENTS
distance between the histograms of Etest and Etraining . The
Bhattacharyya distance is calculated as 10% of test inputs sorted by complexities. We expected the
  
ratio of the excluded labels to be high in the top 10% data.
BD(Htraining , Htest ) = 1 − I Htraining (I) · Htest (I), RQ3. Complexities of Training Dataset: Is the input with a
(5) high complexity in the training dataset suspicious?
where H(I) is the density of the histogram for a bin I. To answer RQ3, we prepared a subset of a training dataset
as the top 10 inputs in complexities for each label. We then
IV. R ESEARCH Q UESTIONS determined if the percentage of suspicious inputs is higher in
We prepared the following four research questions to de- the subset than in the original dataset. As the entire dataset
termine the validity of the proposed method. We designed we randomly sampled 50 inputs from the training dataset for
the evaluation experiments on an image-classification task to each label since the number of inputs in the entire dataset is
answer these questions. so large that we cannot determine if they are suspicious in
RQ1. Feature Extraction: Do the common features describe advance. One of the authors made a judgement on whether
the concepts? an input is suspicious, as referred to in Sec.VI. We expect the
percentage of inputs with the highest complexities to be higher
To answer RQ1, we sorted images in the training dataset than those in randomly sampled inputs.
by w∗i , namely, the weights of the i-th common feature fi ,
and determined what kind of concepts are obtained as the i-th RQ4. Similarity Estimation of Datasets: Can we estimate
common feature by observing images in which the weight of the similarity between two datasets on the basis of complexity
the i-th common feature is relatively large. We expected such distribution?
images shows a similar concept (e.g., a ship in the upper half To answer RQ4, we prepared several datasets that differ in
of the image regardless of its color). We further varied the their properties, as described in Sec.V. We first determined if
layer used to obtain common features to see how the concept the means of the complexities of inputs in the training dataset
described by each common feature changes. We expected the and that in the test dataset match using Welch’s t-test. We
common feature obtained from a deeper layer to describe a expect the hypothesis “two populations have equal means” to
higher level concept, as Bengio et al.reported [18]. be rejected when the properties of the training dataset and the
test dataset differ.
RQ2. Complexities of Test Dataset: Is it difficult to make an
We then attempted to quantify the difference between two
inference on the input with a high complexity?
datasets using the Bhattacharyya distance between two his-
To answer RQ2, we observed how prediction accuracy tograms of complexities. The calculated distance is expected
changes when we make an inference on a dataset of increasing to be greater when the properties of two datasets are more
size. We first calculated complexities for all inputs in the different than they are similar.
original test dataset. We then made a subset of the test dataset
by selecting inputs with lowest (highest) complexities and V. E XPERIMENTAL S ETUP
calculate prediction accuracy for the subset. We gradually This section describes the detailed configurations of the
added the inputs to the subset in ascending (descending) order datasets we prepared for evaluation, the target model, and
of complexities to see how the accuracy changes. We expected parameter settings. 1
the accuracy to gradually decrease (increase) as the number
of test inputs increases in ascending (descending) order of A. Datasets and Model
complexities. We prepared multiple datasets based on CIFAR10 [19],
We further confirmed that an unfamiliar input, i.e., an input which is widely used for verifying an image-classification
with a concept which is not included in a training dataset, also model (Table I). CIFAR10 consists of 60000 images 32 × 32
has a higher complexity. In do this, we excluded certain labels in size (50000 as training data and 10000 as test data),
from the training dataset, trained a model, and tested it with each representing one out of ten labels (e.g., automobile).
the test dataset including labels excluded in the training phase We created datasets with various properties by extending or
to see how complexities behaves for inputs with the excluded
1 The codes used to calculate complexities are available at
labels. In particular, we evaluated the ratio of labels in the top
https://github.com/nttdata-rdh/complexity

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
Layer activation
activation_1 activation_2
names activation_6
activation_3
activation_4
activation_7
activation_5
activation_8

Dropout

Dropout
MaxPooling
MaxPooling Dropout

# neurons 32768 16384 8192 1024 512 10


Convolutional Layers Dense Layers

Fig. 3. Model architecture used in the experiments


Fig. 4. Example of training data with large weight for some common features
reducing the training data in the original dataset. Specifically, obtained in a dense layer (i.e., deep layer)
we prepared following four extra datasets from the original
training data (dataset (a)): a dataset without images labeled as
cat or truck (dataset (b)), 2 one with bright and dark images
(dataset (c)), one with high and low contrast images (dataset
(d)), and one with high and low saturation images (dataset
(e)). We also prepared four extra datasets from the original
test data (dataset (a )) in the same manner (denoted by dataset
(b )–(e )). We processed some of the original images using
opencv library [20] in Python to prepare additional images
required to create datasets (c)–(e) and (c )–(e ). Specifically,
we added a constant in RGB space to change brightness,
multiplied the RGB value by a constant to change contrast, and Fig. 5. Example of training data with large weight for some common features
obtained in a convolution layer (i.e., shallow layer)
multiplied saturation by a constant in HSV space to change
saturation. Note that the obtained additional images replace feature has a large weights. Fig. 4 shows the results for
the corresponding original image when we extend the training common features obtained in layer activation_6. Each
dataset to prevent a dataset from containing almost identical row corresponds to one feature. While all of these features are
images. in activation traces of images labeled as horse, each feature
We used ConvNet with 12 layers for the experiment, as represent a different concept: feature #19 corresponds to a
in a previous study [11]. The first six layers are convolutional full body shot, feature #21 corresponds to a head image, and
layers and the others are dropout or dense layers in this model feature #46 corresponds to a horse being ridden.
(Fig. 3). We also confirmed different concepts are obtained in the
convolution layers. As shown in Fig. 5, each feature obtained
B. Parameters in activation_1 represents a global characteristic of the
We set the number of bases corresponding to the number image: feature #4 corresponds to images with black frames,
of features Φ and the number of bins of histogram used in and features #25, #26, and #37 correspond to blue, red, and
the experiments for RQ4 to 50 and 100, respectively. We used green images, respectively, regardless of their labels. Features
Kullback-Leibler divergence obtained from other convolution layers show similar tendency.
 P (i) We therefore answer RQ1 as follows: Yes. The common
d(P, Q) := P (i)log (6) features describes the concepts of data we can distinguish,
i
Q(i)
which depends on the depth of the layer from which they
as the distance d(·, ·) in Eq. (2). We set parameters to increase are obtained. While the model obtained human-understandable
(decrease) brightness, contrast, and saturations in preparation features in this experiment, the obtained feature in much more
of the extra datasets to +0.4 (−0.4), 2.0 (0.5), and 3.0 (0), complex tasks or models can be complex and difficult to
respectively. interpret even though it is an important feature for the task.
VI. R ESULTS AND D ISCUSSION B. Complexities of Test Dataset (RQ2)
We evaluated the proposed method to answer the RQs Fig. 6 shows prediction-accuracy change for test datasets.
mentioned in Sec. IV on the basis of the configurations Three lines corresponds to how the test data were added. Red,
and settings described in Sec. V. This section describes the blue, and gray lines show ascending, descending, and random
experimental results and provides discussion.3 order of complexities, respectively. The final accuracy were
A. Feature Extraction (RQ1) the same (78.92%) regardless of how the test data were added
We confirmed that each common feature corresponds to since the entire test dataset was evaluated at that point. We
a certain concepts by visualizing images in which a certain therefore determined differences in the process to reach that
point. From the accuracy curve, accuracy gradually decreases
2 These labels are selected in accordance with their ease of classification.
if we start with a low complexity and add test data with higher
In a preliminary experiment, the classification accuracy was the lowest for cat
and highest for truck. complexities, and vice versa (i.e., accuracy gradually increases
3 All results can be found at https://nttdata-rdh.github.io/complexity. if we start with a high complexity and add test data with lower

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Relationship between prediction accuracy and complexities when a
model is trained on dataset (a). The blue (red) line shows the accuracy change
as the test data are added in ascending (descending) order of complexities.

Fig. 8. Example of training data with high complexities. Images with red
frames indicate suspicious data determined by one of authors.

Fig. 7. Proportion of labels in test data with top 10% complexities: green and
orange bars indicate the results using a model trained on the dataset (a) and
the dataset (b), respectively. Labels shown in red are excluded in the dataset
(b).
complexities). We therefore conclude that it is more difficult Fig. 9. Examples of suspicious inputs
to make an inference on data with higher complexities. C. Complexities of a Training Dataset (RQ3)
Fig. 8 shows the ten highest inputs in a training dataset in
Fig. 7 shows the proportion of labels in the top 10% of terms of complexities for each label. Images with red frames
test inputs sorted by complexities calculated using the models indicate manually annotated suspicious data. For comparison,
trained on a dataset including all labels (dataset (a)) and on we randomly chose 50 inputs from the training dataset for
the dataset excluding some labels (dataset (b)). The labels each label and manually annotated suspicious data as well3 .
excluded in dataset (b) are shown in red in this figure. Labels Examples of suspicious images are shown in Fig. 9. The ratio
with high proportion can be regarded as difficult labels since of suspicious images was 14% (14 out of 100 images) in
it is difficult to make an inference on an input with a high images with higher complexities while it was 1.8% (9 out of
complexity, as mentioned in the previous paragraph. The 500 images) in randomly chosen images. This result indicates
proportions of excluded labels were obviously higher when that suspicious images can be determined more from images
the model was trained on dataset (b) than when trained on with higher complexities than randomly chosen images. Note
dataset (a). Therefore, we can say an input unfamiliar to the that not every image with a high complexity is suspicious
model has a higher complexity and it is effective to test model since normal images are also included in images with higher
on an input with a high complexity to detect weak points of complexities.
the model. We therefore answer RQ3 as follows: Partly yes. Inputs
with high complexities in a training dataset are likely to
be suspicious while such inputs are not always obviously
We therefore answer RQ2 as follows: Yes. It is difficult suspicious at least from the human viewpoint.
to make an inference on data with high complexities;
therefore, unfamiliar data are detectable among data with D. Similarity Estimation of Datasets (RQ4)
relatively higher complexities. Since a complexity can be We tested if two populations, i.e., complexities of inputs in
calculated without any labels, we can use unlabeled datasets a training dataset and those in a test dataset, have equal means
to determine weak points of the model. using various pairs of training and test datasets. Table II shows

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
(a) Model trained on dataset (a) (b) Model trained on dataset (b)

Fig. 10. Histograms of complexities calculated for various inclusion relations between training dataset and test datasets.

Training = Test Training ⊂ Test Training ⊃ Test


Layer
(a) (b) (c) (d) (e) (b) (a) (a) (a) (a) (c) (d) (e) Training Dataset
(a ) (b ) (c ) (d ) (e ) (a ) (c ) (d ) (e ) (b ) (a ) (a ) (a ) Test Dataset

activation_1 0.914 0.868 0.478 0.654 0.595 0 0 0 0 0 0 0 0


activation_3 0.723 0.023 0.258 0.107 0.423 0.001 0 0 0 0 0.009 0.002 0.531
activation_5 0 0 0 0 0 0 0 0 0 0 0 0.003 0
activation_7 0 0 0 0 0 0 0 0 0 0 0 0 0
TABLE II
p- VALUES OBTAINED FROM W ELCH ’ S t- TEST

the p-values obtained from Welch’s t-test between these two difficult to determine using complexities obtained on deep
populations. Histograms of the complexities calculated for test layers.
datasets (a)–(c) based on the model trained on training datasets
(a) or (b) are shown in Fig. 10. The hypothesis “two popula- VII. T HREATS TO VALIDITY
tions have equal means” is rejected with a significance level
of 0.05 when the complexities are calculated using shallow The primary threat to external validity of this study is the
layers (activation_1 and activation_3) if situations type of input data and task. We evaluated the proposed method
of two datasets are different with a few exceptions. These on only images and on a classification task. Our method is,
results indicates the effectiveness of complexities to detect however, designed not only for images since we focus on
difference between two datasets when it is calculated on shal- how the model treats data regardless of data type. Therefore,
low layers. However, this hypothesis is always rejected for the we believe this method is applicable to other types of data
complexities calculated using deep layers (activation_5 or tasks as long as the model takes input and returns any
and activation_7). We assume that complexities take type of prediction, i.e., regression or detection. The obtained
higher values as a whole in deep layers since typical activation features might be difficult to interpret if we use input data
patterns cannot be fully captured by a limited number of other than images. Thus, there is room for inspection of
features (Φ = 50 for this experiment). In fact, the density features obtained from such data using the proposed method.
of inputs with higher complexities in the test datasets were Regarding the dataset, we only used CIFAR-10, which is a
always higher than that in the training datasets (Fig. 10). relatively small dataset, and did not conduct experiments on
larger datasets. If we want to apply the proposed method to a
We show the Bhattacharyya distance between two his-
much larger training dataset, we should consider alternatives
tograms of complexities in the bottom right of each histogram
in feature extraction since the computational cost of NMF to
in Fig. 10, indicating that the Bhattacharyya distance tends to
extract features increases as the number of inputs in a training
be large when the situations of two datasets differ. However,
datasets increases. As a training dataset is used only to extract
this distance is large on deep layers even if the situations of
typical features, one possible solution is to sample some inputs
two datasets are similar, as is the case with Welch’s t-test
from the entire training dataset to create a sub-dataset with
mentioned above.
tractable size and use it in the feature-extraction step. On the
Considering these results, we answer RQ4 as follows: other hand, we can apply this method to larger test datasets
Partly yes. We can determine if two datasets are similar since the complexity can be independently calculated for each
using complexities obtained on shallow layers, but it is input in a test dataset once the features are extracted from

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.
the training dataset. The other threat to external validity is R EFERENCES
variation of situations. We should consider concepts other than [1] S. Kuutti, R. Bowden, Y. Jin, P. Barber, and S. Fallah, “A survey of
the difference in label or image features such as brightness, deep learning applications to autonomous vehicle control,” IEEE Trans.
contrast, and saturation. We plan to extend situations referring Intell. Transp. Syst., 2020.
[2] J. Zhang, Y. Xie, Y. Li, C. Shen, and Y. Xia, “Covid-19 screening on
[21] to include, e.g., blur, fog, and rain, to determine if the chest x-ray images using deep learning based anomaly detection,” arXiv
proposed method is also useful in detecting such concepts. preprint arXiv:2003.12338, 2020.
[3] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green,
The threat to internal validity is, on the other hand, experi- C. Qin, A. Žídek, A. W. Nelson, A. Bridgland et al., “Protein structure
mental design and parameter settings. Regarding experimental prediction using multiple deep neural networks in the 13th critical
designs, we randomly chose 250 inputs in the experiment assessment of protein structure prediction (casp13),” Proteins: Structure,
Function, and Bioinformatics, vol. 87, no. 12, pp. 1141–1148, 2019.
for RQ3 since it is difficult to determine suspicious data [4] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified
from entire the training dataset. Moreover, suspicious data and out-of-distribution examples in neural networks,” in Proc. of the
were marked by one of the authors, which can result in 5th International Conference on Learning Representations, 2017.
[5] A. Singhvi and K. Russell. (2016) Inside the self-driving tesla
willful interpretation of data. We therefore disclosed images fatal accident. [Online]. Available: https://www.nytimes.com/interac-
used in this experiment and images marked as suspicious on tive/2016/07/01/business/inside-tesla-accident.html
our website3 to reduce such threat and maintain verifiability. [6] C.-H. Cheng, C.-H. Huang, and H. Yasuoka, “Quantitative projection
coverage for testing ml-enabled autonomous systems,” in Proc. of the
We plan to ask more participants other than the authors 16th Automated Technology for Verification and Analysis. Springer,
to determine suspicious data to reduce sampling bias and 2018, pp. 126–142.
preconception. Regarding parameter settings, it is difficult to [7] R. Ashmore and M. Hill, “boxing clever: Practical techniques for gaining
insights into training data and monitoring distribution shift,” in 37th
estimate the appropriate number of features. We used 50 as International Conference, SAFECOMP. Springer, 2018, pp. 393–405.
the number of features, Φ, for the experiments since the task [8] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox
was classification between ten classes, and we expected five testing of deep learning systems,” in Proc. of the 26th Symposium on
Operating Systems Principles, 2017, pp. 1–18.
subclasses in each class. We plan to extend the proposed [9] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su,
method to set this parameter automatically using Gamma- L. Li, Y. Liu et al., “Deepgauge: Multi-granularity testing criteria for
Process NMF [22], which is capable of estimating the number deep learning systems,” in Proc. of the 33rd ACM/IEEE International
Conference on Automated Software Engineering, 2018, pp. 120–131.
of features and features themselves simultaneously on the basis [10] Y. Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore,
of observed data in the feature-extraction step. “Structural test coverage criteria for deep neural networks,” ACM
Transactions on Embedded Computing Systems, vol. 18, no. 5s, pp. 1–23,
2019.
[11] J. Kim, R. Feldt, and S. Yoo, “Guiding deep learning system testing
VIII. C ONCLUSION using surprise adequacy,” in Proc. of the 41st International Conference
on Software Engineering, 2019, pp. 1039–1049.
[12] F. Branchaud-Charron, A. Achkar, and P.-M. Jodoin, “Spectral metric for
dataset complexity assessment,” in Proc. of the IEEE/CVF Conference
We proposed a method of evaluating a complexity of an on Computer Vision and Pattern Recognition, 2019, pp. 3215–3224.
input without a label for a given trained model for analyzing [13] S. Swayamdipta, R. Schwartz, N. Lourie, Y. Wang, H. Hajishirzi, N. A.
a dataset to find inputs unfamiliar to the model efficiently and Smith, and Y. Choi, “Dataset cartography: Mapping and diagnosing
datasets with training dynamics,” in Proc. of the 2020 Conference
estimating the similarity between datasets. We first extract typ- on Empirical Methods in Natural Language Processing, 2020, pp.
ical activation patterns in a training dataset as common features 9275–9293.
on the basis of an output of activation layers then estimate [14] S. Mani, A. Sankaran, S. Tamilselvam, and A. Sethi, “Coverage testing
of deep learning models using dataset characterization,” arXiv preprint
a complexity of an input as the gap between a combination arXiv:1911.07309, 2019.
of the common features and an activation trace of the input. [15] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-
We discussed the procedure of determining if two datasets distribution image detection in neural networks,” in Proc. of the 6th
International Conference on Learning Representations, 2018.
are similar on the basis of complexity distributions calculated [16] Z. Li and D. Hoiem, “Improving confidence estimates for unfamiliar
on inputs in each dataset. Evaluation experiments showed the examples,” in Proc. of the IEEE/CVF Conference on Computer Vision
effectiveness of the method in detecting a suspicious input and Pattern Recognition, 2020, pp. 2686–2695.
[17] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:
from a training dataset and an input on which it is difficult to Representing model uncertainty in deep learning,” in Proc. of the 33rd
make an inference. We also showed that similarity of datasets International Conference on Machine Learning, 2016, pp. 1050–1059.
can be measured using the complexities calculated on shallow [18] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via
deep representations,” in Proc. of the 30th International Conference on
layers. Since the complexity can be calculated for any inputs Machine Learning. PMLR, 2013, pp. 552–560.
only if a training dataset and a trained model are available, [19] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
we can make use of any datasets even without labels to obtain Tech. Rep., 2014.
[20] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
additional insights into the model. This leads to more effective Tools, 2000.
detection of weak points of a model using datasets that are [21] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing
difficult to use so far due to the lack of labels. We plan to of deep-neural-network-driven autonomous cars,” in Proc. of the 40th
International Conference on Software Engineering, 2018, pp. 303–314.
evaluate the effectiveness of the proposed method on other [22] M. D. Hoffman, D. M. Blei, and P. R. Cook, “Bayesian nonparametric
data types and tasks as future work. We are also interested in matrix factorization for recorded music,” in Proc. of the 27th Interna-
how its performance changes if we use a different number of tional Conference on Machine Learning, 2010, pp. 439–446.
features.

Authorized licensed use limited to: Universidade Federal Rural de Pernambuco-UFRPE. Downloaded on April 08,2022 at 20:24:03 UTC from IEEE Xplore. Restrictions apply.

You might also like