You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344373371

Detection of Iterative Adversarial Attacks via Counter Attack

Preprint · September 2020

CITATIONS READS

0 14

4 authors:

Matthias Rottmann Mathis Peyron


Bergische Universität Wuppertal Ecole Nationale Supérieure d’Electrotechnique, d’Electronique, d’Informatique, d…
38 PUBLICATIONS   199 CITATIONS    1 PUBLICATION   0 CITATIONS   

SEE PROFILE SEE PROFILE

Nataša Krejić Hanno Gottschalk


University of Novi Sad Bergische Universität Wuppertal
80 PUBLICATIONS   495 CITATIONS    115 PUBLICATIONS   689 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Deep learning algorithms using uncertainty View project

GivEn -Shape Optimization for Gas Turbines in Volatile Energy Networks View project

All content following this page was uploaded by Hanno Gottschalk on 27 September 2020.

The user has requested enhancement of the downloaded file.


Detection of Iterative Adversarial Attacks via Counter
Attack

Matthias Rottmann∗, Mathis Peyron†, Natasa Krejic‡ and Hanno Gottschalk∗

Abstract been introduced, but these methods do not fool DNNs


reliably.
Deep neural networks (DNNs) have proven to be power- Carlini & Wagner (CW) extended [38] with a method
ful tools for processing unstructured data. However for that reliably attacks deep neural networks while control-
high-dimensional data, like images, they are inherently ling important features of the attack like sparsity and
vulnerable to adversarial attacks. Small almost invisi- maximum size over all pixels, see [3]. This method aims
ble perturbations added to the input can be used to fool at finding a targeted or an untargeted attack where the
DNNs. Various attacks, hardening methods and detec- distance between the attacked image and the original one
tion methods have been introduced in recent years. Noto- is minimal with respect to a chosen `p distance. Distances
riously, Carlini-Wagner (CW) type attacks computed by of choice within in CW-type frameworks are mostly `p
iterative minimization belong to those that are most dif- with p = 0, 2, ∞. Mixtures of p = 1, 2 have also been
ficult to detect. In this work, we demonstrate that such proposed by some authors [5]. Except for the `0 distance,
iterative minimization attacks can by used as detectors these attacks are perceptually extremely hard to detect,
themselves. Thus, in some sense we show that one can cf. figures 1 and 2. Minimization of the `0 distance min-
fight fire with fire. This work also outlines a mathematical imizes the number of pixels changed, but also changes
proof that under certain assumptions this detector pro- these pixels by the maximally and the resulting spikes
vides asymptotically optimal separation of original and are easy to detect. This changes for p > 0.
attacked images. In numerical experiments, we obtain Typically, one distinguishes between three different at-
AUROC values up to 99.73% for our detection method. tack scenarios:
This distinctly surpasses state of the art detection rates
• white box attack: the attacker has access to the
for CW attacks from the literature. We also give nu-
DNN’s parameters / the whole framework including
merical evidence that our method is robust against the
defense strategies.
attacker’s choice of the method of attack.
• black box attack: the attacker does not know the
DNN’s parameters / the whole framework including
1 Introduction defense strategies.
• gray box attack: in-between white and black box,
For many applications, deep learning has shown to out- e.g. the attacker might know the framework but not
perform concurring machine learning approaches by far the parameters used.
[18, 36, 16]. Especially, when working with high-dimensional
The CW attack currently is one of the most efficient
input data like images, deep neural networks (DNNs) are
white-box attacks.
show impressive results. As discovered by Szegedy et al.
[38], this remarkable performance comes with a down-
side. Very small (noise-like) perturbation added to an Defense Methods. Several defense mechanisms have
input image can result in incorrect predictions with high been proposed to either harden neural networks or to de-
confidence [38, 12, 28, 3, 27]. Such adversarial attacks tect adversarial attacks. One such hardening method is
are usually crafted by performing a constrained gradient- the so-called defensive distillation [29]. With a single re-
descent procedure with respect to the input data in order training step this method provides strong security, how-
to change the class predicted by the DNN and at the ever not against CW attacks. Training for robustness via
same time modify the image by the least possible amount adversarial training is another popular defense approach,
measured. The new class can be either a class of choice see e.g. [12, 25, 41, 19, 26]. See also [31] for an overview
(targeted attack) or an arbitrary but different one (untar- regarding defense methods.
geted attack). Many other types of attacks such as the Most of these methods are not able to deal with at-
fast signed gradient method [12] and DeepFool [28] have tacks based on iterative minimization like CW attacks.
In a white box setting the hardened networks can still be
∗ University of Wuppertal, School of Mathematics and Natural
subject to fooling.
Sciences, {rottmann,hgottsch}@uni-wuppertal.de
† ENSEEIHT Toulouse, Department of HPC and Big Data,

mathis.peyron@etu.enseeiht.fr Detection Methods. There are numerous detection


‡ University of Novi Sad, Faculty of Sciences, methods for adversarial attacks. In many works, it has
natasa@dmi.uns.ac.rs been observed and exploited that adversarial attacks are

1
less robust to random noise than non-attacked clean sam- Our contribution. In this paper we present a novel
ples, see e.g. [1, 42, 39, 32]. This robustness issue can approach, how to use Carlini’s & Wagner’s attack frame-
be utilized by detection and defense methods. In [32], a work [3] not only to generate attacks, but also as a very
statistical test is proposed for a white-box setup. Statis- efficient detector for CW white box attacks. This state-
tics are obtained under random corruptions of the inputs ment holds for any CW-type of attack generated by an
and observing the corresponding softmax probability dis- iterative minimization process. In this sense we fight fire
tributions. In [39], noise injection is combined with a with fire.
cryptography component by adding key-defined random- A CW attack aims at finding the smallest perturba-
ization to an ensemble of networks at both training and tion in a given `p norm, such that a given image is pushed
test time. The authors assume in a gray-box setup that across the closest decision boundary. Applying another
the networks, attacks as well as the whole framework are CW attack to the perturbed image moves this image back
known to the attacker, however the keys are not. An- across the same decision boundary. Importantly, this sec-
other popular approach to detecting adversarial exam- ond attack oftentimes generates much smaller perturba-
ples is based on filtering certain frequencies or detecting tions. It is therefore properly the optimality of the CW
them right in the input image. An adaptive noise re- attack that leaves a treacherous footprint in the attacked
duction approach for detection of adversarial examples is data. The perturbation is the easier to detect, the more
presented in [21]. JPEG compression ([30, 23]), similar optimal it is.
input transformations ([15]) and other filtering techniques For this mechanism, we outline a mathematical proof
([20]) have demonstrated to filter out many types of ad- (for the untargeted `2 case) that in the limit where the
versarial attacks as well. In [10], adversarial images are number of iterations of the original CW attack tends to
detected based Bayesian uncertainty. It has also been infinity, by the counter attack method separation of at-
demonstrated that image transformations such as rota- tacked and non-attacked images is possible with an area
tions can be used [40] for filtering adversarial attacks. In under the receiver operator curve (AUROC) tending to
[22], detection is performed from a steganalysis point of one. We therefore for the first time provide a detection
view by calculating filter statistics, the dependence be- method for the CW `2 attack that is provably efficient.
tween pixels is then modeled via higher order Markov This result is based on the mathematical characterization
chains. In [46] it was discovered that eigenvectors cor- of the stationary points and convergence for the CW at-
responding to large eigenvalues of the fisher information tack that is new by itself. We also demonstrate that this
metric yield adversarial attacks and that adversarial at- (asymptotic) mathematical statement passes the numeri-
tacks in turn can be detected by means of their spectral cal tests to a high extent. To this end, we threshold on the
properties. Apart from that, also saliency maps can be `p norms of the computed perturbations of first and sec-
used to detect adversarial examples, see [44]. ond attack and find that statistical separation is possible
Some approaches also work with several images or se- with AUROC values well above those of many competing
quences of images to detect adversarial images, see e.g. [6, detection methods. This finding holds for both, targeted
14]. Approaches that not only take input or output layers and untargeted attacks. We show results for CIFAR10
into account, but also hidden layer statistics, are intro- and ImageNet and obtain AUROC values of up to 99.64%
duced in [4, 47]. Auxiliary models trained for robustness for CIFAR10 and 99.73% for ImageNet when using the `2
and equipped for detecting adversarial examples are pre- norm. Our experiments show that the robustness under
sented in [45]. attacks with different `p norms can be detected very well,
Recently, GANs have demonstrated to defend DNNs thus we can reliably detect iterative minimization attacks
against adversarial attacks very well, see [33]. This ap- without knowing the actual parameter p of the original
proach called defense-GAN iteratively filters out the ad- attack.
versarial perturbation. The filtered result is then pre-
sented to the original classifier. Related work. Unlike in [1, 42, 39, 32, 40], we do not
Semantic concepts in image classification tasks pres- apply any additional perturbation or transformation to
elect image data that only cover a tiny fraction of the the inputs of DNNs to detect adversarial examples. We
expressive power of rgb images. Therefore training data also do not filter noisy signals in images like [21].
sets can be embedded in lower dimensional manifolds and We detect adversarial attacks by measuring the norm
DNNs are only trained with data within the manifolds of an adversarial perturbation obtained by a second at-
and behave arbitrarily in perpendicular directions. These tack and afterwards thresholding on it. Thus, the works
are also the directions used by adversarial perturbations. closest to ours can be considered those that are based on
Thus, the manifold distance of adversarial examples can distance measures, see [4, 47, 17]. However, [4, 47] work
be used as criterion for detection in [17]. on the level of feature maps in hidden layers. All three
As opposed to many of the adversarial training and works do not use a second adversarial attack to detect
hardening approaches, most of the works mentioned in an attack. In spirit, [17] can be regarded to be closest
this paragraph are able to detect CW attacks with AU- to our approach. The authors measure the distance from
ROC values up to 99% [24]. For an overview we refer to a manifold containing the data, therefore a model of the
[43]. manifold is learned. Our method can be interpreted as
thresholding on the distance that is required to find the
closest decision boundary outside the given manifold that

2
`0 `0

`2 `2

`∞ `∞

Figure 1: An illustration of attacked CIFAR10 images. (left): the Figure 2: An illustration of attacked ImageNet images. (left): the
input image, (center): the added noise and (right): the resulting input image, (center): the added noise and (right): the resulting
adversarial image. adversarial image.

contains the image data. Unlike in [17], our approach does differentiable, that P
maps x to a probability distribution
c
not require a model of the manifold and is rather based y = ϕ(x) ∈ I c with i=1 yi = 1 and yi ≥ 0. Furthermore,
n
the intrinsic properties of CW attacks and other itera- κ : I → C denotes the map that yields the corresponding
tive minimization attacks. During the finalization of this class index, i.e.,
paper, we found a work [48] that also pursues a counter (
attack idea. However in this work, distance is consid- i if yi > yj ∀j ∈ C \ {i}
κ(x) = (1)
ered in terms of Kullback-Leibler divergence of softmax 0 else.
probabilities while we consider different norms on the in-
put space. The authors also trained a model with ad- This is a slight modification of the arg max function as
versarial examples whereas we only threshold on distance Ki = {x ∈ I n : κ(x) = i} for i > 0 gives the set of
measures. Furthermore, we provide a general theoreti- all x ∈ I n predicted by ϕ to be a member of class i and
cal statement that our method is provably efficient. We for i = 0 we obtain the set of all class boundaries with
conducted tests also with high definition ImageNet2012 respect to ϕ.
data. The numerical results for CIFAR10 are in the same Given x ∈ Rn , the `p “norm” is defined as follows:
range. Pn 1
• for p ∈ N: kxkp = ( i=1 |xi |p ) p ,
Organization of this work. The remainder of this • for p = 0: kxk0 = |{xi > 0}|,
work is structured as follows: In section 2 we introduce • for p = ∞: kxk∞ = maxi |xi |.
our detection method an provide a theoretical statement The corresponding `p distance measure is given by dist p (x, x0 ) =
that under certain assumptions a second identical attack kx − x0 kp and the n-dimensional open `p -neighborhood
performed for an already attacked image is an ideal de- with radius ε and center point x0 ∈ Rn is defined as
tector for a given CW attack with norm `p . In section Bp (x0 , ε) = {x ∈ Rn : dist p (x, x0 ) < ε}.
section 3 we demonstrate that this finding also applies to
numerical experiments. We demonstrate, that once and The CW Attack. For any image x0 ∈ I n the Carlini &
twice attacked images can be well separated by measur- Wagner (CW) attack introduced in [3] can be formulated
ing the distances between the attack’s input and output as the following optimization problem:
in the `p -norm. For our evaluation we use the CIFAR10
test set as well as the ImageNet test set. We also show, xk = arg min dist p (x0 , x)
that any CW attack with a norm `p can be detected re- x∈Rn (2)
liably by performing another CW attack with a different s.th. κ(x) 6= κ(x0 ) x ∈ I n .
norm `p0 . We also show that our approach works well for
both targeted and untargeted attacks. In order to obtain κ(x0 ) 6= κ(x), relaxations given by
a choice of differentiable terms f (x) are introduced. The
box constraint x ∈ I n is enforced by a simple projection,
2 Detection by Counter Attack i.e., clipping the gradient. Both loss functions are jointly
minimized in a scalarized setting, i.e.,
Let C = {1, . . . , c} denote the set of 2 ≤ c ∈ N distinct
classes. and let I = [0, 1] denote the closed unit interval. F (x) := dist p (x0 , x) + a f (x) → min . (3)
n n c
An image is an element x ∈ I . Let ϕ : I → I be a
In their experiments, Carlini & Wagner perform a binary
continuous function given by a DNN, almost everywhere
search for the constant a such that κ(x) 6= κ(x0 ) is almost

3
• t := κ(xorig ) the original class,
x0 class i
• F := {x ∈ I n : κ(x) 6= t} the feasible region,
• N F := {x ∈ I n : κ(x) = t} the infeasible region,
attack • ∂F = ∂N F the class boundary between F and N F.
x∗ xk,j
counter attack It is common knowledge that I n can be decomposed into
xk
a finite number of polytopes Q := {Qj : j = 1, . . . , s}
such that Z is affine linear on each Qj , see [8, 35]. In [8],
each of the polytopes Qj is expressed by a set of linear
class j
constraints, therefore being convex.
For a locally Lipschitz function g, the Clarke general-
ized gradient is defined by
Figure 3: Sketch of the action of the counter attack, x∗ is the
stationary point of the first attack, xk the final iterate of the first ∇C g(x) := conv{ lim ∇g(xi ) : xi → x and ∇g(xi ) exists} .
i→∞
attack and xk,j the final iterate of the second attack.
(6)
Let S be the stationary set of problem (2),
always satisfied. For further algorithmic details we refer
S = {x ∈ I n : 0 ∈ ∇C F (y) + NI n (x)} (7)
to [3]. Two illustrations of attacked images are given in
figures 1 and 2. In particular for the ImageNet dataset where ∇ F (x) is the set of all generalized gradients and
C
in figure 2, even the perturbations themselves are imper- N n (x) is the normal cone to the set I n at point x. In
I
ceptible. general, computing the set of generalized gradients is not
an easy task, but given the special structure of f – piece-
The CW Counter Attack. The goal of the CW at- wise linear – it can be done relatively easily in this case.
tack is to estimate the distance of x0 to the closest class Namely, the set ∇C f (x) is a convex hull of the gradi-
boundary via dist p (x0 , xk ). The more eager the attacker ents of linear functions that are active at the point x,
minimizes the perturbation xk − x0 in a chosen p-norm, [34]. Therefore, by [7, Corollary2, p.39], the generalized
the more likely it is that xk is close to a class boundary. gradient G(x) of F (x) has the form
Hence, when estimating the distance of xk to the closest
class boundary by performing another CW attack which G(x) = {ag(x) + 2(x − xorig ), g(x) ∈ ∇C f (x)} . (8)
yields an xk,j , it is likely that
The projected generalized gradient method [37] is de-
dist p (xk , xk,j )  dist p (x0 , xk ) , (4) fined as follows. Let P (x) denote the orthogonal projec-
tion of x onto I n . Given the learning schedule {αk } such
cf. figure 3. This motivates our claim, that the CW at- that

tack itself is a good detector for CW attacks. An image X
n
x0 ∈ I , that has not been attacked, is likely to have lim αk = 0 and αk = ∞, (9)
k=1
a greater distance to the closest class boundary than an
xk ∈ I n which has already been subject to a CW attack. the iterative sequence is generated as
In practice, it cannot be guaranteed that the CW attack
finds a point xk which is close to the closest boundary. xk+1 = P (xk − αk Gk ), Gk ∈ ∇C F (xk ), k = 0, 1, . . .
However, an interesting statements can be made on which (10)
we elaborate in the following paragraph. In this paper we add the following condition for the learn-
ing rate
X∞
Theoretical Considerations. In this paragraph we αk2 < ∞ , (11)
outline the proof of the statement that asymptotically, k=1
i.e., for very eager attackers, the counter attack tends to
which strengthens the conditions (9) and does not alter
detect all attacked samples with probability 1. However,
the statements from [37].
a clear and formal proof of all aspects regarding this state-
It can be shown that every accumulation point of the
ment is beyond the scope of this paper and therefore left
sequence {xk } generated by (10) converges to a stationary
for future work.
point of F . The proof of this theorem can be found in
We assume that p = 2 and fix a choice for f which is
[37] and only requires that the set S only contains isolated
n
f (x) = max{Zt (x) − Zi (x), 0} (5) points from I , which can be shown with moderate effort
i6=t exploiting the piece-wise linear structure of f .
Now, considering an iterate xk ∈ F, it is also clear
where Z denotes the neural network ϕ but without the fi- that reducing F (x ) means decreasing the distance of
k
nal softmax activation function. Note that equation (5) is x to x , provided α
k 0 k+1 is sufficiently small. There-
a construction for untargeted attack. Choosing a penalty fore, x ∈ F cannot be a stationary point. Assume that
k
term for a targeted attack does not affect the arguments the CW attack converges to x∗ and is successful, i.e.,
provided in this section. x∗ ∈ / N F. This implies x∗ ∈ ∂F. Summarizing this, any
Furthermore, let

4
B(x∗ , 3ε) F (2) Note that, g (2) (x)T (x − x∗ ) > 0 follows from x − x∗ being
an ascent direction. This holds due to x ∈ Qi ∩ N F (2) ◦
B(x∗ , 2ε)
which implies f (2) (x) > 0 and the fact that x∗ ∈ Qi as
B(x∗ , ε) well and f (2) (x∗ ) = 0. Hence, the difference in distance
x∗ v to x∗ when performing a gradient descent step is
w θ x 2
xk
2
kx − x∗ k − x − x∗ − αG(2) (x) (15)

2
N F (2)
= 2αG(2) (x)T (x − x∗ ) − α2 G(2) (x)

| {z }
≤C(b)
2
Figure 4: Situation present in the counter attack. By bounding ε
cos(θ) from below we show that x reduces its distance to x∗ when
≥ 2α · − α2 C(b) .
3
performing a gradient descent step under the given assumptions.
Therefore, x never leaves B(x∗ , 3ε). The latter expression is greater than zero iff the step-size
2ε2
schedule α is small enough, i.e., α < 3C(b) . Hence, the
distance to x∗ decreases.
successful CW attack converges to a stationary point on
the boundary ∂F of the feasible set. We now take into account the stochastic effects that
Let us now consider the counter attack and therefore stem from choosing an arbitrary initial image x ∈ I n
let F (2) ⊂ N F (1) := N F denote the counter attack’s represented by a random variable X. Let k be the number
feasible region and N F (2) ⊂ F (1) := F the infeasible of iterates of the original CW attack and let Xk ∈ F (2)
region. We seek to minimize the functional be the final iterate after k ≥ 1 iterations, starting at
X0 = X. For any random variables Y, Z ∈ I n let
F (2) (x) := distp (xk , x)p + bf (2) (x) (12)
D(Y ) = dist(Y, F) = inf dist(Y, y) and
where x∗ is the stationary point that we are approach- y∈F
ing in the first iterative attack. For now, let p = 2. D(0,k) = dist(X0 , Xk ) ≥ D(X0 ) , (16)
Let {Q1 , . . . , Qs } be the set of all polytopes that fulfill
x∗ ∈ Qi , i = 1, . . . , s. We assume that the first attackprovided Xk ∈ F, which means that the k-th iterate is
is successful and that there exists an iterate xk of the a successful attack. For δ ∈ R, we define the distri-

first attack such that dist(xS , xk ) < ε and furthermore
bution function corresponding to a random variable D
∗ s
we assume that B(x , 3ε) ⊂ i=1 Qi . (representing a random distance) by FD (δ) = P (D ≤
δ) = D∗ P ((−∞, δ]) where the push forward measure is
Theorem 1. By the preceding assumptions and further D P (B) = P (D−1 (B)) for all B in the Borel σ-algebra.
(2) ∗
assuming, that the counter attack starts at xk ∈ N F , The area under receiver operator characteristic curve of
(2)
stops when reaching F and uses a sufficiently small D(Y ) and D(Z) is defined by
step size α, then the counter attack iteration minimizing
F (2) never leaves B(x∗ , 3ε).
Z
AUROC (D(Y ), D(Z)) = FD(Y ) (δ) dD(Z)∗ P (δ).
R
Proof. Let x ∈ B(x∗ , 3ε)\B(x∗ , 2ε) and G(2) (x) ∈ ∇C F (2) (x) (17)
any gradient with corresponding g (2) (x) ∈ ∇C f (2) (x) which
(0,k)
is fixed for x being member of a chosen polytope Qi . We Obviously, it holds that D(Y ), D ≥ 0. The following
(2) ◦
can assume that x ∈ N F since we are done oth- lemma formalizes under realistic assumptions that we ob-
erwise. Furthermore, let v := xk − x, w := x∗ − x tain perfect separability of D(X) and D(Xk ) as we keep
and cos(θ) = v w where k·k = k·k . This implies iterating the initial CW attack, i.e., k → ∞.
T

kvkkwk 2
x∗ − xk = w − v and Lemma 2. Let D(X) ≥ 0 with P (D(X) = 0) = 0 and
2 2 2 D(Xk ) → 0 for k → ∞ weakly by law. Then,
kvk + kwk − kw − vk
cos(θ) =
2 kvk kwk AUROC (D(Xk ), D(X)) −→ 1 .
k→∞
(18)
4ε2 + ε2 − ε2 1
≥ = (13)
2 · 3ε · 4ε 6 Proof. Let δ0 be the Dirac measure in 0 with distribution
function
Since G(2) (x) = 2(x − xk ) + b g (2) (x), we obtain
(
1 z≥0
Fδ0 (z) = (19)
0 else.
G(2) (x)T (x − x∗ )
= 2(x − xk )T (x − x∗ ) + b g (2) (x)T (x − x∗ ) By the characterization of weak convergence in law by the
| {z } Helly-Bray lemma [11, Theorem 3], FD(Xk ) (δ) → Fδ0 (δ)
>0
for all δ where Fδ0 (δ) is continuous. This is the fact is for
ε2
> 2(x − xk )T (x − x∗ ) ≥ 2ε2 cos(θ) = . (14) all δ > 0. As P (D(X) = 0) = 0, this implies FD(Xk ) →
3 Fδ0 D(X)∗ P -almost surely. Furthermore, it holds that

5
|FD(Xk ) (δ)| ≤ 1. Consequently, by Lebesgue’s theorem of Numbers of samples in X and X̄
CIFAR10 ImageNet2012
dominated convergence
`0 5000 500
Z `2 5000 500
AUROC (D(Xk ), D(X)) = FD(Xk ) (δ) dD(X)∗ P (δ) `∞ 1000 500
ZR
k→∞ Table 1: Numbers of images in X and X̄ , respectively.
−→ Fδ0 (δ) dD(X)∗ P (δ)
R
Z Success rates
= dD(X)∗ P (δ) = 1 CIFAR10 ImageNet2012
R+ 1st attack 2nd attack 1st attack 2nd attack
`0 100% 100% 100% 100%
(20)
`2 100% 100% 100% 100%
`∞ 100% 100% 100% 100%
as D(X) ≥ 0 by assumption which concludes the proof.
Table 2: Success rates of first and second attacks. An attack is
considered successful if the predicted class of the attack’s output is
(1)
Le now X be a random input image. As X ∈ N F , different from the predicted class of the attack’s input.
it holds that D(X) > 0 almost surely, such that in-
deed P (D(X) = 0) = 0. Let furthermore Xk be the
k-th iterate inside F (1) of the CW attack starting with For each of the two datasets, we randomly split the
X. Given that the attack is successful, we obtain that test set into two equally sized portions X and X̄ . For one
k→∞ portion we compute the values D(k) and for the other
Xk −→ X ∗ ∈ ∂F, thus
one also D̄(k,j) such that they refer to two distinct sets of
k→∞ original images. Since CW attacks can be computation-
dist(Xk , X ∗ ) −→ 0 (21)
ally demanding, we chose the sample sizes for X and X̄
almost surely. Now there exists a learning rate schedule as stated in table 1.
αk,j and a multiplier a such that for all steps Xk,j of the Since we discuss the distance measures D(k) and D̄(k,j)
counter attack originating at Xk , the distance for different `p norms, we now alter our notation. Let
γp : I n → I n be the function that maps x ∈ X to its `p
D̄(k,j) = dist(Xk,j , Xk ) ≤ 8 dist(Xk , X ∗ ) −→ 0 (22) attacked counterpart for p = 0, 2, ∞. In order to demon-
k→∞
strate the separability of X and γp (X̄ ) under a second `q
almost surely. Note that the latter inequality holds since attack, we compute the two scalar sets
we can choose ε in theorem 1, such that dist(xk , x∗ ) > ε/2
Dq ={dist q (x, γq (x)) : x ∈ X } and
and then obtain this constant 8 by using the triangle (24)
inequality. Therefore, let Xk,j ∗ be the first iterate of D̄p,q ={dist q (γp (x̄), γq (γp (x̄))) : x̄ ∈ X̄ }

the CW counter attack in F (2) . We consider D̄(k,j ) =
dist(Xk , Xk,j ∗ ) ≤ 8 dist(Xk , X ∗ ) −→ 0 almost surely. for q = 0, 2, ∞.
k→∞

∗ k→∞
As this trivially implies D̄(k,j ) −→ 0, we obtain the
`p Counter Attacks. For now we focus on the case
following theorem by application of Lemma 2:
p = q where the first and the second attack are per-
Theorem 3. Under the assumptions outlined above, we formed by means of the same `p norm as this matches
obtain the perfect separation of the distribution of the dis- with our theoretical findings from section 2. First the

tance metric D̄(k,j ) of the CW counter attack from the state success rates of γp (X̄ ), i.e., the percentages where
distribution on the distance metric D(0,k) of the original κ(x̄) 6= κ(γp (x̄)) for x̄ ∈ X̄ , see table 2. For all attacks
CW attack, i.e., γp , p = 0, 2, ∞, we obtain ideal success rates.
  The separability of original data X and attacked data
(0,k) (k,j ∗ ) k→∞ γ (X̄ ) is visualized by violin plots for Dp and D̄p,p in sec-
AUROC D , D̄ −→ 1 . (23) p
tion 3 for CIFAR10 and in section 3 for ImageNet2012.
For both datasets we observe a similar behavior. In the
3 Numerical Experiments first three panels of each figure, i.e., for p = 0, 2, we ob-
serve well-separable distributions. For p = ∞ this separa-
We now demonstrate how our theoretical considerations bility seems rather moderate. When looking at absolute
apply to numerical experiments. Therefore we aim at `∞ distances, we observe that both distributions peak for
separating the distributions D(k) and D̄(k,j) . For our tests an `∞ distance of roughly 0.1. For `2 , the absolute dis-
we used the framework [2] (provided with [3]). For the tances are much higher. This is clear, since the `∞ norm
CIFAR10 dataset consisting of tiny 32 × 32 RGB images is independent of the image dimensions whereas the other
with concepts from 10 classes, containing 50k training and norms suffer from the curse of dimensionality. This can
10k test images, we used a network with 4 convolutional be omitted by re-normalizing the shrinkage coefficient for
layers and two dense ones as included and used per default `p distance minimization. However it also underlines the
in [2]. For the ImageNet2012 high resolution RGB images punchline of our method, the more efficient the iterative
we used a pre-trained Inception network [13] (trained on minimization process, the better our detection method –
all 1000 classes). For all tests we used default parameters. the efficiency becomes treacherous. Less efficient attacks

6
p=0 p=2 p=
thresholds

Dp (original)

Dp, p (attacked)

0.7 0.8 0.9 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0
normalized distances normalized distances

Figure 5: Violin plots displaying the distributions of distances in Dp (top) and D̄p,p (bottom) from equation (24) for the CIFAR10
dataset, p = 0, 2, ∞ in ascending order from left to right.

p=0 p=2 p=
thresholds

Dp (original)

Dp, p (attacked)

0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0
normalized distances normalized distances

Figure 6: Violin plots displaying the distributions of distances in Dp (top) and D̄p,p (bottom) from equation (24) for the ImageNet2012
dataset, p = 0, 2, ∞ in ascending order from left to right.

are not suitable detectors, however such attacks can po- 1.0
0.8
true positive rate

tentially be detected by other detection methods as they


tend to be more perceptible. 0.6
Next, we study the performance in terms of detection 0.4 0: 0.91439092
accuracy, i.e., the relative frequency of correctly classi- 0.2 2: 0.99124688
: 0.962898
fied as attacked / original for two different thresholds, 0.0
as well as in terms of the area under receiver operator 0.0 0.2 0.4 0.6 0.8 1.0
false positive rate
characteristic curve (AUROC, [9]) which is a threshold
independent performance measure. Figure 7: ROC curves and AUROC values for the different `p norms,
p = 0, 2, ∞.
In all further discussions, true positives are correctly
classified attacked images and true negatives are correctly
classified original (non-attacked) images. More precisely, high accuracy, recall and precision values. In particular
we fix the following notations: the `2 attack shows very strong results and can be de-
tected best by a repeated attack. Please recall that we
• threshold: the value t used for discriminating whether
reduced the number of images for our tests with the `∞ at-
d ∈ Dq or d ∈ D̄p,q ,
tack to 1,000 original and 1,000 attacked images as these
• true positive: T P = |{d ∈ D̄p,q : d < t}| ,
tests were computationally intensive. Noteworthily, out
• false negative: F N = |{d ∈ D̄p,q : d ≥ t}| ,
of 10,000 images we only obtain 16 false positive detec-
• true negative: T N = |{d ∈ Dq : d ≥ t}| ,
tions, i.e., images that are falsely accused to be attacked
• false positive: F P = |{d ∈ Dq : d < t}| ,
T P +T N ones, and 56 false negatives, i.e., attacked images that are
• accuracy: T P +F N +T N +F P ,
not detected, when using threshold choice no. 1. For the
• precision: T PT+F
P
P , `∞ attack we observe an over production of false posi-
• recall: T PT+F
P
N . tives. However, most of the 1,000 attacked images were
still detected.
For this paragraph, let p = q.
In table 3 we report several metrics for two choices
of threshold computation. Choice no. 1 is obtained by Returning to the Original Class. As motivated in
0.5 × recall + 0.25 × accuracy + 0.25 × precision, choice no. section 2, the second attack theoretically is supposed to
2 is obtained by maximizing accuracy. While choice no. 1 make γp (x̄) for x̄ ∈ X̄ return across the decision boundary
rather aims at detecting many adversarial attacks, there- that it passed via the first attack, i.e.,
fore accepting to produce additional false positive detec-
κ(γp (γp (x̄)) = κ(x̄) . (25)
tions, choice no. 2 aims attributes the same importance
to the reduction of false positives and false negatives. We present results for the number of returnees, i.e., Rp =
In accordance to our findings in section 3 we observe |{x̄ ∈ X̄ : κ(γp (γp (x̄)) = κ(x̄)}|, and the corresponding

7
Sensitivity & Specificity values
threshold choice no. 1 threshold choice no. 2
`0 `2 `∞ `0 `2 `∞
True Positive 4,723 4,817 984 4,723 4,803 982
False Negative 277 183 16 277 197 18
True Negative 4,239 4,792 837 4,239 4,809 840
False Positive 761 208 163 761 191 160
Accuracy 0.8962 0.9609 0.9105 0.8962 0.9612 0.9110
Recall 0.9446 0.9634 0.9840 0.9446 0.9606 0.9820
Precision 0.8612 0.9586 0.8579 0.8612 0.9617 0.8599
Threshold 5.0126 0.00891 0.0324 5.0126 0.008225 0.0321

Table 3: Sensitivity & Specificity values for the CIFAR10. Choice no. 1 of thresholds is obtained by maximizing 0.5 × recall + 0.25 ×
accuracy + 0.25 × precision, choice no. 2 is obtained by simply maximizing accuracy. True positives is the number of correctly classified
attacked images, true negatives is the number of correctly classified original images.

1.0
percentages (return rates) given by Rp /|X̄ | in table 4. For 0.8

both datasets we observe strong return rates, in particular 0 0.6


0.4

for `2 only 3 out of 5000 samples x̄ ∈ X̄ do not fulfill 0.2 0.987328 0.992968 0.997288
0.0
1.0
κ(γp (γp (x̄)) = κ(x̄). 0.8

2 0.6
0.4
dataset `0 `2 `∞ 0.2 0.958162 0.96038 0.967528
Returnees CIFAR10 4994 4997 995 0.0
1.0
Percentage CIFAR10 99.88% 99.94% 99.5% 0.8
0.6
Returnees ImageNet2012 500 486 488 0.4
Percentage ImageNet2012 100% 97.20% 97.6% 0.2 0.963138 0.945524 0.95606
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Table 4: Number of returnees as well as return rates. 0 2
Figure 9: ROC curves and AUROC values for the different cross
attacks performed on the ImageNet2012 dataset. The task is to
Cross attacks. As the `p norms used in our tests are separate Dq and D̄p,q , the `q norm is the one used for detection,
equivalent except for `0 , we expect that cross attacks, i.e., q = 0, 2, ∞ is the row index and p = 0, 2, ∞ the column index.
the case p 6= q is supposed to yield a good separation of
Dq and D̄p,q (cf. equation (24)). Figures 8 and 9 show
results for cross attacks. Each column shows the detection Targeted Attacks. So far, all results presented have
performance of a norm `q . been computed for untargetted attacks. In principle a tar-
geted attack γp only increases the distances dist q (γp (x̄), x̄)
1.0
0.8
while the distance measures dist q (γq (γp (x̄)), γq (x̄)) corre-
0 0.6
0.4
sponding to another untargeted attack γq in principle are
0.2 0.9185535 0.99619 0.996444 supposed to remain unaffected. Thus, targeted attacks
0.0
1.0
0.8
should be even easier to detect. However, it might hap-
2 0.6
0.4
pen that we sometimes lose the property that after a sec-
0.2 0.912546 0.991255 0.991676 ond attack the predicted class returns to the one of the
0.0
1.0
0.8
original image, i.e., κ(γq (γp (x̄))) = κ(x̄) does not need
0.6 to hold. For instance, it might happen that the shortest
0.4
0.2 0.89228 0.968136 0.962898 straight `p path crosses another class before reaching its
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 desired destination. In figure 10 we indeed observe that
0 2 a targeted `2 attack yields a higher AUROC value than
an untargeted one.
Figure 8: ROC curves and AUROC values for the different cross
attacks performed on the CIFAR10 dataset. The task is to separate
Dq and D̄p,q , the `q norm is the one used for detection, q = 0, 2, ∞ 1.0
is the row index and p = 0, 2, ∞ the column index. 0.8
true positive rate

0.6
In both cases, for CIFAR10 and ImageNet2012, when 0.4
untargeted: 0.99124688
comparing the different columns of both plots we observe 2
targeted: 0.99399708 0.2
a superiority of the `2 norm. In our tests we observe 2
0.0
that the `2 norm also requires lowest computational ef- 0.0 0.2 0.4 0.6 0.8 1.0
false positive rate
fort, thus the `2 norm might be favorable from both per-
spectives. Noteworthily there is also only a minor per- Figure 10: ROC curves and AUROC values for `2 detections on
formance degradation when going from CIFAR10 to Im- CIFAR10 data where the first attack was once targeted and once
ageNet2012 even though the perturbations introduced by untargeted.
the `p attacks, in particular for p = 2, are almost im-
perceptible, cf. also figure 2. For a better overview, the
presented results are summarized in table 5.

8
Metrics CIFAR10 ImageNet2012
L0 91.86% 99.62% 99.64% 98.73 99.30 99.73
L2 91.25% 99.13% 99.17% 95.82 96.04 96.75
L∞ 89.23% 96.81% 96.29% 96.31 94.55 95.61
Metrics L0 L2 L∞ L0 L2 L∞

Table 5: Summarizing table for crossed-attacked. First column represents the first attack used to craft data and last row the one performed
for detection. We remind that CIFAR10 data sets shuffle 1000 regular images and 1000 adversarial images while ImageNet2012 data sets
hold 500 clean images and 500 perturbed images.

4 Conclusion and Outlook [4] F. Carrara, R. Becarelli, R. Caldelli, F. Falchi, and


G. Amato. Adversarial examples detection in features
We outlined a mathematical proof for asymptotically op- distance spaces. In The European Conference on Com-
puter Vision (ECCV) Workshops, September 2018.
timal detection of CW attacks via counter attacks for the
[5] P. Chen, Y. Sharma, H. Zhang, J. Yi, and C.-J. Hsieh.
`2 norm and demonstrated in numerical experiments that
Ead: Elastic-net attacks to deep neural networks via ad-
our findings hold to a high extent in practice for different versarial examples. ArXiv, abs/1709.04114, 2017.
`p norms. We obtained AUROC values of up to 99.73% [6] S. Chen, N. Carlini, and D. A. Wagner. Stateful detection
on the CIFAR10 dataset and demonstrated that also cross of black-box adversarial attacks. CoRR, abs/1907.05587,
attacks based on different norms `p and `q yield high de- 2019.
tection performance. [7] F. Clarke. Optimization and Nonsmooth Analysis. SIAM,
Our results are in range of state-of-the-art results. 1990.
DBA [48] shows a superior detection accuracy for CI- [8] F. Croce, M. Andriushchenko, and M. Hein. Provable
FAR10, however the underlying neural network is also robustness of relu networks via maximization of linear
much stronger which makes a clear comparison difficult. regions. CoRR, abs/1810.07481, 2018.
AUROC values are not reported. Our results for the `2 - [9] J. Davis and M. Goadrich. The relationship between
attack with CIFAR10 outperform those of defense-GAN precision-recall and ROC curves. In Machine Learning,
[33] for fashion MNIST by far (which is also difficult to Proceedings of the Twenty-Third International Confer-
compare). The same statement holds for the CIFAR10 ence (ICML 2006), Pittsburgh, Pennsylvania, USA, June
25-29, 2006, pages 233–240, 2006.
results presented for I-defender [47] when detecting an it-
[10] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gard-
erative `2 attack. I-defender was successful for roughly
ner. Detecting adversarial samples from artifacts. ArXiv,
90% of the test data. abs/1703.00410, 2017.
We expect that our results can be further improved [11] T. Ferguson. A Course in Large Sample Theory. Springer
by proceeding similarly to [14] and investigating statis- US, 1996.
tics obtained from sets of attacked and original images [12] I. J. Goodfellow, J. Shlens, and C. Szegedy. Ex-
instead of single samples (the latter is how we proceed plaining and harnessing adversarial examples. CoRR,
so far). All networks used in our tests have not been abs/1412.6572, 2014.
hardened / trained for robustness. We anticipate that [13] Google. Tensorflow inception network. http:
our method might benefit from using hardened networks //download.tensorflow.org/models/image/imagenet/
as the `p -distance of the first attack might even increase. inception-2015-12-05.tgz.
Furthermore, works that study the nature of adversarial [14] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and
attacks found that most attacks are located close to deci- P. D. McDaniel. On the (statistical) detection of adver-
sion boundaries, see e.g. [43]. In accordance with the find- sarial examples. CoRR, abs/1702.06280, 2017.
ings in [48] this suggests, that our counter attack frame- [15] C. Guo, M. Rana, M. Cissé, and L. van der Maaten.
work might transfer also to other attacks than the CW Countering adversarial images using input transforma-
tions. CoRR, abs/1711.00117, 2017.
attack. Studies on these aspects as well as an extension
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
of our mathematical analysis remain future work.
learning for image recognition. CoRR, abs/1512.03385,
Acknowledgement. H. G., M. P. and M. R. acknowl- 2015.
edge interesting discussions with T. Hantschmann. [17] S. Jha, U. Jang, S. Jha, and B. Jalaian. Detecting ad-
versarial examples using data manifolds. In MILCOM
2018 - 2018 IEEE Military Communications Conference
References (MILCOM), pages 547–552, Oct 2018.
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Ima-
[1] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. genet classification with deep convolutional neural net-
Synthesizing robust adversarial examples. CoRR, works. In F. Pereira, C. J. C. Burges, L. Bottou, and
abs/1707.07397, 2017. K. Q. Weinberger, editors, Advances in Neural Informa-
[2] N. Carlini. Robust evasion attacks against neural net- tion Processing Systems 25, pages 1097–1105. Curran As-
work to find adversarial examples. https://github.com/ sociates, Inc., 2012.
carlini/nn_robust_attacks. [19] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial
[3] N. Carlini and D. A. Wagner. Towards evaluating the machine learning at scale. CoRR, abs/1611.01236, 2016.
robustness of neural networks. CoRR, abs/1608.04644, [20] S. Lee, S. Park, and J. Lee. Defensive denoising methods
2016. against adversarial attack. In 24th ACM SIGKDD Con-
ference on Knowledge Discovery and Data Mining, 2018.

9
[21] B. Liang, H. Li, M. Su, X. Li, W. Shi, and X. Wang. De- of the 36th International Conference on Machine Learn-
tecting adversarial examples in deep networks with adap- ing, volume 97 of Proceedings of Machine Learning Re-
tive noise reduction. CoRR, abs/1705.08378, 2017. search, pages 6586–6595, Long Beach, California, USA,
[22] J. Liu, W. Zhang, Y. Zhang, D. Hou, Y. Liu, and N. Yu. 09–15 Jun 2019. PMLR.
Detecting adversarial examples based on steganalysis. [42] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. L. Yuille. Mit-
CoRR, abs/1806.09186, 2018. igating adversarial effects through randomization. CoRR,
[23] Z. Liu, Q. Liu, T. Liu, Y. Wang, and W. Wen. Feature abs/1711.01991, 2017.
distillation: Dnn-oriented JPEG compression against ad- [43] H. Xu, Y. Ma, H. Liu, D. Deb, H. Liu, J. Tang, and A. K.
versarial examples. CoRR, abs/1803.05787, 2018. Jain. Adversarial attacks and defenses in images, graphs
[24] J. Lu, T. Issaranon, and D. A. Forsyth. Safetynet: and text: A review. Int. J. Autom. Comput., 17(2):151–
Detecting and rejecting adversarial examples robustly. 178, 2020.
CoRR, abs/1704.00103, 2017. [44] D. A. Yap, J. Xu, and V. U. Prabhu. On detecting adver-
[25] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and sarial inputs with entropy of saliency maps. In CV-COPS,
A. Vladu. Towards deep learning models resistant to ad- IEEE CVPR, 2019.
versarial attacks. ArXiv, abs/1706.06083, 2017. [45] X. Yin, S. Kolouri, and G. K. Rohde. Divide-and-conquer
[26] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. adversarial detection. CoRR, abs/1905.11475, 2019.
On detecting adversarial perturbations. ArXiv, [46] C. Zhao, P. T. Fletcher, M. Yu, Y. Peng, G. Zhang, and
abs/1702.04267, 2017. C. Shen. The adversarial attack and detection under the
[27] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and fisher information metric. CoRR, abs/1810.03806, 2018.
P. Frossard. Universal adversarial perturbations. CoRR, [47] Z. Zheng and P. Hong. Robust detection of adversarial
abs/1610.08401, 2016. attacks by modeling the intrinsic properties of deep neu-
[28] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deep- ral networks. In Proceedings of the 32Nd International
fool: a simple and accurate method to fool deep neural Conference on Neural Information Processing Systems,
networks. CoRR, abs/1511.04599, 2015. NIPS’18, pages 7924–7933, USA, 2018. Curran Associates
[29] N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and Inc.
A. Swami. Distillation as a defense to adversarial [48] Q. Zhou, R. Zhang, B. Wu, W. Li, and T. Mo. Detection
perturbations against deep neural networks. CoRR, by attack: Detecting adversarial samples by undercover
abs/1511.04508, 2015. attack. In L. Chen, N. Li, K. Liang, and S. Schneider,
[30] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. A. editors, Computer Security – ESORICS 2020, pages 146–
Storer. Protecting JPEG images against adversarial at- 164, Cham, 2020. Springer International Publishing.
tacks. CoRR, abs/1803.00940, 2018.
[31] K. Ren, T. Zheng, Z. Qin, and X. Liu. Adversarial attacks
and defenses in deep learning. Engineering, 6(3):346 –
360, 2020.
[32] K. Roth, Y. Kilcher, and T. Hofmann. The odds are
odd: A statistical test for detecting adversarial examples.
CoRR, abs/1902.04818, 2019.
[33] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-
gan: Protecting classifiers against adversarial attacks us-
ing generative models. CoRR, abs/1805.06605, 2018.
[34] S. Scholtes. Introduction to Piecewise Differentiable
Equations. SpringerBriefs in Optimization. Springer New
York, 2012.
[35] S. Shalev-Shwartz and S. Ben-David. Understanding Ma-
chine Learning - From Theory to Algorithms. Cambridge
University Press, 2014.
[36] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014.
[37] M. V. Sodolov and S. K. Zavriev. Error stability proper-
ties of generalized gradient-type algorithms. JOTA, 1998.
[38] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-
han, I. J. Goodfellow, and R. Fergus. Intriguing proper-
ties of neural networks. CoRR, abs/1312.6199, 2013.
[39] O. Taran, S. Rezaeifar, T. Holotyak, and
S. Voloshynovskiy. Defending against adversarial attacks
by randomized diversification. CoRR, abs/1904.00689,
2019.
[40] S. Tian, G. Yang, and Y. Cai. Detecting adversarial ex-
amples through image transformation. AAAI Conference
on Artificial Intelligence, 2018.
[41] Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu. On
the convergence and robustness of adversarial training. In
K. Chaudhuri and R. Salakhutdinov, editors, Proceedings

10

View publication stats

You might also like