You are on page 1of 8

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

ROWBACK: RObust Watermarking for neural


2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) | 978-1-6654-4337-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICMLA52953.2021.00274

networks using BACKdoors


Nandish Chattopadhyay and Anupam Chattopadhyay

School of Computer Science and Engineering, Nanyang Technological University, Singapore

Abstract—Claiming ownership of trained neural networks is networks ushered in a new era of end-to-end systems that
critical towards stakeholders investing heavily in high perfor- were designed to perform feature extraction and the machine
mance neural networks. There is an associated cost for the learning task, like classification, together in a cascaded fash-
entire pipeline starting from data curation to high performance
computing infrastructure for neural architecture search and ion. The introduction of backpropagation ensured optimization
training the model. Watermarking neural networks is a potential of parameters throughout the neural network in an efficient
solution to the problem, but standard techniques suffer from way. This led to the growing popularity in usage of many AI
vulnerabilities demonstrated by attackers. In this paper, we applications [4], [5].
propose a robust watermarking mechanism for neural architec- Neural networks have existed for over six decades, being
tures. Our proposed method ROWBACK turns two properties
of neural networks, the presence of adversarial examples and first introduced as the perceptron algorithm in 1958 [6]. How-
the ability to trap backdoors in the network while training, ever, it didn’t incite the kind of enthusiasm among researchers
into a scheme that guarantees strong proofs of ownership. We as we see today for many years, due to the unavailability of
redesign the Trigger Set for watermarking using adversarial the hardware infrastructure that is necessary for harvesting the
examples of the model which needs to be watermarked, and real potential of these learning systems. The deep learning
assign specific labels based on adversarial behaviour. We also
mark every layer separately, during training, in order to ensure research has significantly benefited from some exceptional
that removing watermarks requires complete retraining. We properties of neural networks, like being highly parallelize-
have tested ROWBACK for satisfying key indicative properties able in operations, primarily due to the inherent linear nature
expected of a reliable watermarking scheme (generates accuracies of the matrix multiplications involved. This was greatly put to
within 1 − 2% of actual model, and a complete 100% match on use on GPU based training infrastructures.
the Trigger Set for verification), whilst being robust against state-
of-the-art watermark removal attacks [1] (requires re-training of A. Motivation
all layers with at least 60% samples and for at least more than
45% of epochs of actual training). Three components are essential towards the success of
Index Terms—watermarking neural networks, robustness, the data driven learning system c̃iteesl. Firstly, there is a
backdooring, adversarial samples requirement of a large volume of data, specifically curated to
be rich in information relevant to the context of the problem
I. I NTRODUCTION being addressed. Secondly, there is a neural architecture that
The growth in research and development in the field of ma- comprises of the computational graph and weight matrices.
chine learning has seen an unprecedented rise over the years, Finally, one needs the hardware infrastructure to train the
and a lot of it might be attributed towards the proliferation of network, by tuning the weights or parameters by utilizing
high performing neural networks [2]. The applications of deep the data available. The cost associated with training a neural
learning touch upon a plethora of daily used things ranging network ranges from a few thousand dollars (for a model
from mobile apps to driving autonomously. The investments with about a million parameters) to over a million dollars
in data centres and repositories of big data can be put to good (for a model with over a billion parameters) [7]. Overall, for
use by harvesting valuable insights from them, using machine solving any task using machine learning, the corresponding
learning algorithms. This has the potential to significantly stakeholders have to invest in:
boost productivity and gradually reduce the human supervision • gathering the relevant data, curating it and labelling it
dependence. The interest has been widespread in the industry particularly for supervised machine learning problems
and academia alike. A key milestone in the flourishing growth • designing the most suited neural architecture for the task
of AI was in 2012, when a trained neural network model • procuring highly capable hardware infrastructure for
could out-perform average human performance in an image training the neural network with the data to obtain the
classification task [3]. Over the years, techniques in statis- learned model
tical learning depended on laborious pre-processing of data, It is natural for those investing in any one or multiple of the
involving feature engineering and extraction through hand- components mentioned above, to demand a method to establish
crafted methods. Such approaches lacked generalization and ownership of the trained neural network. Claiming authority
typically saturated in performance potentially due to the lack by proving Intellectual Property rights on such models is a
of the quality of extracted features. The paradigm of neural necessity to incentivise the stakeholders. As a result, there has

978-1-6654-4337-1/21/$31.00 ©2021 IEEE 1728


DOI 10.1109/ICMLA52953.2021.00274

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
been a steady growth of research into various watermarking processing domain [18]. Being an effective way to prove
schemes for neural architectures [8]–[11] as well as efforts ownership, the idea has been leveraged into other domains
to break those watermarking models [1], [12]–[15] by adver- as well, most notably in the machine learning community.
saries, which are discussed in details in the following section Backdoors have been found to be a possible way of trapping
II. In this line of research, we contribute by proposing a watermarks within large trained neural networks.
novel watermarking scheme that shows robustness against all
relevant attacks without any perceptible degradation in model A. Principles
efficiency. There are three essential components of any functional
B. Contribution watermarking scheme. Under the basic assumption that there
is a training dataset and training data and a trained neural
In this paper, we propose a robust watermarking scheme network model M , we need three things to work. First, one
ROWBACK for claiming ownership of trained neural networks needs to devise a way to create a secret key mk for water-
based on backdooring. We study the existing literature in this marking, which will be used to embed the watermarks, and the
domain, and identify the key reasons that contribute towards associated verification key vk will be used for verifying the
the vulnerability of such watermarking techniques through presence of the watermarks, thereby establishing ownership.
extraction attacks that try to steal the neural networks. In Second, there is a need to have an algorithm to embed the
essence, we make use of two properties of neural networks, the watermarks within the asset, which is the neural network
existence of adversarial examples upon introducing structured model in this case. Third, one needs an algorithm utilizing
perturbation and the ability to embed backdoors while training, both the secret key mk and public key vk for verification.
to our advantage in the attempt of establishing ownership. These algorithms can be formally expressed as:
ROWBACK guarantees robustness through the following:
• Key Generation(): provides the pair of marking and
• Re-designed Trigger set for marking the neural networks.
corresponding verification keys (mk , vk )
We make use of adversarial examples of the models and • W atermarking(M, mk ): inputs as parameter a trained
associate Trigger labels to them, to customize the Trig- model and a secret watermarking key mk , returns a
ger set that ensures preservation of functionality, leaves watermarked model M̂
strong embedded watermarks whilst being extremely • V erif ication(mk , vk , M ): inputs as parameters the
difficult (since there exist infinitely many adversarial marking and verification key pair (mk , vk ) and the wa-
examples) to be replicated by adversaries. termarked model M̂ , returns the output bit b ∈ {0, 1}
• Uniform distribution of the embedded watermarks
throughout the model, by explicitly marking every layer The functioning of the watermarking scheme is strongly
with imprints of the backdoors, so as to prevent model dependant on the correct working of all of the three algo-
modification attacks by making such attempts to steal the rithms (Key Generation, W atermarking, V erif ication)
network computationally equivalent to training a fresh together. In this particular context, correctness can be formally
network from scratch. described as:
The ROWBACK framework has been tested for benchmarking P r(M,M̂ ,mk ,vk )←W M () [V erif ication(mk , vk , M̂ ) = 1] = 1
with standard datasets like MNIST [16] and CIFAR-10 [17] (1)
with extraction attacks [1].
B. Properties
C. Organisation
Any watermarking scheme should satisfy a set of criteria
In this paper, Section II introduces and describes the funda-
for being useful. They are:
mentals of watermarking using backdooring, which includes
• Functionality-preserving: The introduction of the water-
its principles, properties, vulnerabilities and points out the
reasoning behind the failure of existing techniques. Section marks in the model doesn’t affect its performance with
III addresses the notion of Robustness, and we propose to respect to the machine learning task.
• Non-trivial ownership: The secrecy of the key pairs,
achieve that through the Trigger set design and distribution
of embedded watermarks. The next Section IV describes the would not be compromised to an adversary, upon the
implementation details of the ROWBACK, including the criti- knowledge of the watermarking algorithm.
• Un-removability of watermarks: The watermarks would
cal components of Key Generation, Marking and Verification.
Thereafter, we present the results of our experimental analysis not be removable from the model by the adversary, even
as demonstration of the proposition made. Finally, we mention with knowledge about the watermarking algorithm
• Un-forgeability of watermarks: The establishment of
some concluding remarks and touch upon important scopes of
future works. ownership through verification requires more that just the
availability of the verification key.
II. WATERMARKING AND BACKDOORING Typically, commitment schemes are used for ensuring that
Watermarking has been studied for long in different re- the secrecy of the private is compromised under no circum-
search communities, and has seen great progress in the image stances at all.

1729

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
C. Watermarking methods F. Failure of existing mechanism
The fundamental principles behind the watermarking Reliable use of any watermarking scheme is dependant on
schemes can be divided into two types. One is where the wa- a study of strengths and weaknesses of it. One needs to pay
termarks are embedded through construction, and are present attention to vulnerabilities of the mechanisms, to understand
within the design of the neural architecture, as observable in the potential flaws and correct them. As mentioned earlier,
multiple works [8]–[11]. The other way is to add explicitly the primary aspects of importance for any such watermarking
designed samples in the training data that leaves a mark within scheme include:
the trained weights, like data poisoning methods. There are few
such watermarking techniques that make use of embedding • Watermark embedding within the trained model (back-
watermarks during training [19]–[22]. doors for example)
• Secrecy of the key (which is the Trigger set)
D. Vulnerabilities • Reliability of the verification mechanism for claiming
ownership
While there have been multiple propositions of mechanisms
to embed watermarks, they have had their fair share of failures An attacker with malicious intent, works to negate one or
upon introduction of attacks. An attacker is interested in using more of the aforementioned requirements. It may be noted that
the trained neural network without owning it, and therefore the failure of any one or more of the above may jeopardise the
is keen in breaking the watermarking scheme. In particular, entire watermarking scheme. In a situation where the adversary
there are a couple of different types of attacks that break such is interested in stealing the trained neural network, thereby
a scheme and makes the attacker free to use the model. successfully denying the owner his/her claim of ownership,
Attacks: The attacks on watermarking schemes can be the following must hold good:
classified on the basis of how much information the attacker • The model that has been extracted must generate compa-
has, of the model. Therefore, there are black box attacks and rable accuracy for the specific machine learning task as
white box attacks. To simplify, these attacks can be grouped the watermarked model
into Evasion attacks [23], [24] and Model Modification attacks • The verification process must fail, the extracted model
[1], [12]–[15]. should perform poorly on the Trigger set

E. Backdoors as Watermarks If an attacker is able to accomplish the fulfilment of these


criteria, then the stakeholders invested in curating the model
A backdooring algorithm works by taking in an input neural potentially lose their right to claim IP. In the specific context
network model, and provides a model with backdoors trapped of watermarking through backdooring, attacks based on target-
in it. The backdoors ensure that the model behaves in a specific ted re-training of the model using synthesized samples have
way upon specific input samples being given. Essentially, proved to be a legitimate threat [1]. These attacks expose the
the embedded backdoors make sure that the accuracy of the underlying vulnerability of the watermarking scheme.
model is high on the Trigger set. It may be noted here that
a clean model would perform poorly on the Trigger set as
the Trigger labels are not the naturally occurring labels of
the Trigger samples, and are set explicitly to induce such
behaviour. Backdoors, therefore, are a good choice to serve as
embedded watermarks, as demonstrated in the literature [19].

Figure 2: Model Modification attack using synthesis [1].

These attacks put together provide a challenge for reliability


in watermarking schemes. We wish to address this problem,
by learning from the vulnerabilities of the existing literature
Figure 1: Schematic diagram of watermarking using backdoor- and utilizing some of the available techniques, coupled with
ing [19] generic properties of neural networks, and thereby propose
ROWBACK.

1730

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
III. G UARANTEES OF ROBUSTNESS is infinite in number) and mapping them to all but two classes
of the dataset.
The primary goal of this paper is to propose a watermarking
scheme, which borrows the fundamental idea behind the B. Watermark Distribution
principle of watermarking neural networks using backdooring,
The principle of using backdooring to embed watermarks
but addresses the deficiencies that it faces by adding counter-
in the trained neural network has a drawback that has been
measures for the attacks that have been carried out on it. The
demonstrated in multiple attacks available in the literature.
resulting watermarking scheme is therefore efficient as well
It is typically observed that the backdoors, in the form of
as robust against prominent attacks available in the literature.
weights within the trained weights matrices of the network,
Robustness is achieved addressing to key areas, the way the
are generally present in the densely connected layers. This
Trigger samples are designed, and the method in which the
facilitates the vulnerability of these networks, which is that by
watermarks are embedded. Before going into the details of
being partially re-trained in the densely connected layers, the
the mechanism that of ROWBACK, we would like to mention
watermarks are removed. We take note of this flaw in design
that this watermarking scheme is built on two key pillars,
and address it by forcefully ensuring uniform distribution of
each of which were originally discovered as defects of neural
the embedded watermarks in every layer of the network. We
networks. Their backgrounds are briefly touched upon here.
make sure that while the marking process is carried out, each
• Adversarial examples: Researchers found in 2015 that of the layers are individually marked, which means a model
neural networks have a particular vulnerability in adver- modification attack aimed at removing the watermarks, will
sarial attacks [25]. High performing models would be need to retrain the entire model, which is equivalent to train
fooled by adversarial examples [26]–[28]. The adversar- a fresh model from scratch, in time and effort.
ial samples are created by introducing minute structured
perturbation to the clean test samples, which would be C. Guarantees
unobservable to the human eye. In order for us to claim the robustness of the watermarking
• Backdoors: They were originally observed as flaws in scheme, we need to establish that the proposed mechanism sat-
trained neural networks, where backdooring would be isfies the necessary criteria for being effective, as mentioned in
considered a specific technique to train the model in such the earlier section. Here, we study how each of the properties
a way that it predicts erroneous outputs for a particular are satisfied by construction of the watermarking scheme. The
set of inputs. following claims have also been verified through experimental
results in Section V.
A. Trigger Set Design
1) Functionality Preserving: The functionality preserving
The vulnerabilities mentioned in the aforementioned attacks property requires that a watermarked model should be as
stem significantly from the fact that the often the choice of accurate as a model that has not been watermarked. Naturally,
Trigger samples make the model modification attack easier. different models have different metrics of measurement of
This is particularly true when the Trigger samples are Out- performances, but for the machine learning task that we con-
Of-Distribution (OOD), which was proposed in the original sider here, the metric of choice is test accuracy of the model
work demonstrating watermarking using backdooring [19]. on the test set. Since adversarial examples are a naturally
1) Adversarial Examples as Trigger Samples: The notion occurring phenomenon of neural networks in general, using
of utilizing adversarial examples as Trigger samples stems them as specific Trigger samples does not hinder the overall
from the fact that adversarial examples are samples that are performance of the model. In fact, the approach used here
perturbed train/test samples belonging to the distribution of is much like adversarial training, which is used to create
the training data, therefore in essence quite close to the overall robust models. The functionality preserving claim has been
distribution that the model has seen during training. substantiated through experimental results.
2) Labelling Trigger Samples: The mechanism of associ- 2) Non-Trivial Ownership: The property of Non-Trivial
ating the labels to the trigger samples is critical towards the Ownership requires that an attacker who has the knowledge
integrity of the water marking scheme. It is important to note about the watermarking algorithm will still not be able to claim
that the ”true” class labels of the samples, or their primary ownership of the model. It has to be noted that the process of
adversarial samples can not be used as the labels of the Trigger claiming ownership of the model involves the demonstration
samples, to ensure that the non-trivial ownership property of the accuracy of the model on the Trigger set, which
holds. The adversary, keen on stealing the trained neural is available only to the creator of the watermarked model.
network, should not be able to regenerate the Trigger Set, Therefore, the non-trivial ownership aspect is taken care of in
which consists of the Trigger samples and their corresponding design, in the construction of the Trigger set. The Trigger set
labels. This is why, we have used the adversarial samples as the consists of randomly selected adversarial samples generated
Trigger samples and associated a class label to each, which is by introducing structured perturbation to clean train samples.
not its ”true” class label or its primary adversarial class label. Since there are infinitely many such samples producible, it is
Reverse-engineering this kind of a Trigger set would therefore impossible to reverse engineer the exact set without any other
involve considering all adversarial samples to a model (which knowledge. The random sampling ensures that there is a lack

1731

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
of correlation among the samples and that takes care of the purpose of claiming ownership, we use a ResNet [29] model
scenario where accidental revealing of a part of the Trigger M and demonstrate the use of our proposed framework on
set will also not hamper the ownership verification process. it. The model M is trained on training samples Strain and is
3) Un-removability: The un-removability property requires tested thereafter on a test sample Stest . The performance of
that the attacker who has the knowledge of the watermarking the model M on the test set Stest is denoted as Sacc . The
algorithm and also has the watermarked model as hand, will performance of the model M on the Trigger set T is denoted
not be able to detect and remove the embedded watermarks. as Tacc . The process of embedding watermarks is described
In essence, this property necessitates the requirement of the hereafter.
watermarking scheme to be robust against model-modification
attacks. In our proposition, we take care of this property by B. Key Generation
paying particular attention to the distribution of the embedded As mentioned in Section III, the generation of the Trigger
watermarks. The watermarking scheme describes here ensures set T , which is the key to this watermarking scheme, involves
that the embedded watermarks are present in each layer, and adversarial examples. These samples are created by first con-
every layer will have to be re-trained to get rid of them com- sidering a random stratified sample of ts train images with
pletely. The intuition here is that, should the attacker require equal representation of each class. Then, each of these samples
as much effort and resources (time and training samples) to are subjected to structured perturbation which is introduced by
remove the watermarks as it is needed to train the model an adversarial attack. In this work, we have made use of the
from scratch, then in theory we will have satisfied the un- Fast Sign Gradient Method. The Fast Sign Gradient method
removability property. was first introduced by Goodfellow et al. [30] in 2014.
4) Un-forgeability: The un-forgeability property requires Once the ts many adversarial samples are generated, we
that partial information about the Trigger set (which in this obtain one part of the Trigger Set TS . We take note of the
case consists of the Marking and Verification keys, in Trigger original class labels to which each sample belonged, as well
samples and labels respectively) will not provide the attacker as its new class label, as detected by the classifier upon being
any advantage in establishing ownership on the watermarked converted into an adversarial example.
model. The Trigger labels TL associated with each of the Trigger
IV. ROWBACK I MPLEMENTATION samples is chosen randomly, from the rest of the class labels
excepting the two class labels, the naturally occurring true
The implementational details are discussed in this section, label and the adversarial label. The reasoning behind this
covering each component of the watermarking pipeline. In construction has been explained earlier, in the discussion about
broad terms, the key components are the model itself, the satisfying criteria for being effective watermarking schemes.
training and test datasets and a specific Trigger dataset for The quasi-random choice of class labels for each of the Trigger
watermarking. samples ensures exclusivity of the Trigger Set, and makes its
replication very difficult.

C. Marking
In order to mark the neural network with the watermarks,
we make use of the Trigger Set T and a pre-trained model
M . The neural network in consists of multiple layers, in this
case convolutional layers and fully connected layers. For the
process of embedding backdoor watermarks, we make use of
Transfer Learning [31] and fine tune the parameters. The fine-
tuning works in the following way:
1) In every epoch epk , where k ∈ {1, . . . , n}, we freeze
all but one layer in the network (starting from the
Figure 3: Schematic diagram of ROWBACK. fully connected layers and ending with the convolutional
layers), and fine-tune that layer with the Trigger set T
by updating the parameters therein.
A. The ML task and model 2) For epoch epk , where k ∈ {1, . . . , n}, we note the
In this particular work, we consider a computer vision corresponding accuracies of the model M , for the test
task of image classification. This is particularly relevant as it set Stest which is Sacc and the Trigger set T which is
extends from the use-cases available in the literature and most Tacc .
widely in use. The overall framework however is absolutely 3) We repeat the combination of the steps 1 and 2, for n
task agnostic and can be used for any machine learning model times, where n is a hyper-parameter that is determined
with underlying neural networks, for example transformer- by cross-validation. The cross-validation is carried out
based neural networks for neural machine translation. For our by observing the Sacc and Tacc , and an intuitive thumb

1732

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
rule is to stop the epochs when either or both of the on a 7th generation Intel core i7 processor with an additional
following occur: NVIDIA GeForce GTX 1080 Ti GPU support.
• The Tacc , which is the performance of the model
A. Experiment Design
M on the Trigger set T , starts to saturate after
increasing with each of the earlier epochs The neural network model of choice is the ResNet [29]
• The Sacc , which is the performance of the model architecture, which is a convolutional neural network with
M on the test set Stest , begins to drop significantly eighteen layers that makes use of deep residual learning
4) After n epochs, we make a note of the Trigger accuracy frameworks. The datasets for the image classification tasks
Tacc . This is of critical importance towards the veri- are MNIST [16] and CIFAR-10 [17]. The Trigger Set, T
fication of watermarks, which is the key to claiming comprising of TS trigger samples and TL trigger labels, is
ownership of the network. generated through FSGM-based adversarial attacks on the test
images and a quasi-random allocation of classes to them.
The aforementioned process of marking the model with The watermarks are embedded through targeted fine-tuning
backdoors makes the model ready for deployment in the public using Transfer Learning as described earlier. The model is
space, as the stakeholder is guaranteed of having the provisions finally tested for robustness against Evasion attacks and Model
of proving ownership of it, should the requirement arise. modification attacks.
D. Verification B. Results
The explicit criterion for verification of any watermarking We have looked at three key target areas to investigate.
scheme is expressed in Equation 1 which states that the The datasets CIFAR-10 and MNIST were split into 90%-10%
probability of a Verification Function taking as parameters the for the train-test split, the size of the Trigger set T was 100
Marking Key and the Verification Key and returning always samples, created using the FSGM attack with  = 0.04. The
True is unity. In the context of watermarking neural networks, ResNet model has been trained for 80 epochs, which a learning
this translates to the following: rate α being reduced in half every 20 epochs. Since our
• The Verification function has two parts. One part is the proposed scheme ROWBACK involves watermarking through
model M itself, and it returns the probabilities of the embedding backdoors, we have compared the performances
classes of the classifier, the highest probability being of ROWBACK with the relevant existing scheme which also
allotted to the class to which the sample in question most uses backdooring [19].
likely belongs. The second part matches these outputs to 1) Test of Preserving Functionality : The goal of these
the expected labels. experiments is to check whether ROWBACK is able to embed
• The Marking Key is the set of Trigger sample TS watermarks through backdoors without creating any hindrance
• The Verification Key is the set of Trigger labels TL to the overall machine learning task, which in this case
• The Verification Function, takes the Marking key (Trigger is image classification. This is necessary for any functional
samples TS ) and generates the predictions first. Then it watermarking mechanism to be deployed in practice.
compares them to the Verification key (Trigger Labels The experimental setup presents a study of how ROWBACK
TL ) and generates a score. behaves, with respect to the performance on the Test set,
• In theory, as per Equation 1, this score should be 100%. measured by Test accuracy. We compare the performances of
In practice, we allow a tolerance limit in our framework, a clean model without any watermarking, a standard water-
which is determined by the Tacc obtained after n epochs marking model with backdooring [19] and ROWBACK.
of marking. Based on the observations tabulated in Table I, in the
It may be noted here that in the following Section on Test Set Accuracy columns, we can conclude that there is
experimental results, we are able to achieve a full score of no significant degradation of performance for introducing the
100% match for verification. The tolerance limit is still part embedded watermarks in a robust fashion, and the generated
of the framework for making the approach more generic. model is almost as accurate (accuracy differs by 1 − 2%) as
either of a clean model without watermarks or a standard
V. E XPERIMENTAL O BSERVATIONS watermarked model through backdooring that uses Out-Of-
Every claim regarding performance of ROWBACK and Distribution samples in its Trigger set.
its robustness against standard attacks have been validated 2) Test of Watermarking Verification: The goal of these
through thorough experimentation. In this section, we discuss experiments is to test whether the presence of the watermarks
in details the overall experimental setup and illustrate the can be verified at will, which is in fact the proof of ownership
functionalities of the scheme through results. As mentioned that the stakeholders may use to claim their investments and
earlier, we have demonstrated ROWBACK’s performance on prevent theft or unauthorised usage.
image classification tasks for relevance with the existing The experimental setup studies how ROWBACK works on
literature. The entire pipeline of experiments from building the Trigger set T . In case of the clean model, there is no
the model to watermarking it and verifying watermarks and existence of a pre-defined Trigger set, and we have used our
testing for robustness has been implemented in PyTorch [32] own Trigger set comprising of adversarial samples to check

1733

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
the performance. For the standard watermarking model, the those results with the same analysis of ROWBACK (shown in
Trigger Set is constructed with Out-Of-Distribution abstract Table III). Finally, we extend the attack to see how much re-
images [19]. The performances of the standard watermarking training us necessary for completely removing the watermarks
scheme and the one proposed in this paper are compared. from ROWBACK’s model.
The observed results are tabulated in Table I, in the Trigger
Set Accuracy columns. Based on the observations, we can Table II: Checking weakness of standard watermarking scheme
[19] using Model Extraction attack [1].
Table I: Checking preservation of functionality through per-
Test Set Accuracy Trigger Set Accuracy
formance on Test Set and checking verification through per- Model Description
MNIST CIFAR-10 MNIST CIFAR-10
formance on Trigger Set.
Clean Model
Test Set Accuracy Trigger Set Accuracy 98.6% 92.1% 4% 4%
Model Description w/o Watermarking
MNIST CIFAR-10 MNIST CIFAR-10 Watermarking
Clean Model 97.7% 91.8% 98% 96%
98.6% 92.1% 4% 4% w/ Backdooring [19]
w/o Watermarking Extracted Model [1]
Watermarking 93.4% 88.5% 15% 18%
97.7% 91.8% 98% 96% (Mode 1)
w/ Backdooring Extracted Model [1]
ROWBACK 96.2% 89.2% 11% 26%
97.9% 91.2% 100% 100% (Mode 2)
scheme

The results are as expected, in agreement with the effec-


assert that the verification can be used to claim ownership, as
tiveness of the extraction attack [1] and set the basis for the
a model which is not watermarked, would generate poor scores
requirement of a more robust mechanism. As observable, the
on the Trigger set, as opposed to a watermarked model. Since
Trigger accuracy drops significantly upon the targetted re-
the size of the Trigger sets in these experiments is 100 samples,
training, which is a key weakness of this model.
the accuracy noted in Table I is the count of matches that the
model is able to predict. It may be noted that the accuracy We therefore, repeat the same experiments on ROWBACK
of the clean model on the Trigger set is just the adversarial to check what impact the modes of model extraction has, on
accuracy of the model. its Trigger accuracy.
3) Test of Robustness: The goal of these experiments is to
Table III: Checking Robustness for ROWBACK scheme using
study the robustness of ROWBACK against attacks. We look
Model Extraction attack [1].
at all kinds of attacks, as discussed earlier, Evasion attacks
and Model Modification attacks. Test Set Accuracy Trigger Set Accuracy
Model Description
ROWBACK is robust against Evasion attacks like ensemble MNIST CIFAR-10 MNIST CIFAR-10
attacks [23] by design. The Trigger set is comprised of Clean Model
adversarial samples. The watermarked model will predict the 98.6% 92.1% 4% 4%
w/o Watermarking
Trigger labels TL , for the Trigger samples TS . The models
ROWBACK
that are not watermarked, will predict the adversarial labels 97.9% 91.2% 100% 100%
scheme
for the Trigger samples, which is still not the true labels, and
Extracted Model [1]
therefore the premise of Ensemble attacks do not hold good. 94.1% 87.2% 99% 96%
Since the watermarks embedded within the weight matrices of (Mode 1)
the network are not disturbed anyway, the robustness comes Extracted Model [1]
95.3% 87.8% 95% 94%
from construction. (Mode 2)
Model modification attacks are the most pertinent threat to
these watermarking schemes as they are able to remove water- ROWBACK ensures explicit fine-tuning of each layer, with
marks thereby eliminating any traces of proof of ownership. the hope that the traces of embedded watermarks will be well
In particular, we look at the removal of watermarks through distributed. We have indication of that from the results in Table
synthesis [1], which specifically attacks the mechanism of III, where the two ways/modes of model extraction are unable
watermarking using backdooring. The attack emulates training to bring down the Trigger accuracy significantly, at worst by
samples using GAN-based synthesis and uses them to re-train 5 − 6%. We can set the tolerance limit for verification to
targetted parts of the model. As it turns out, re-training just accommodate the same and ensure reliable usage.
the feature-rich layer (Mode 1 of attack) or the entire set of As a natural follow-up analysis, we studied how much re-
densely connected layers (Mode 2 of attack) is sufficient to training is required, to eliminate the watermarks significantly,
remove the traces of watermarks. below 50% accuracy on the Trigger set, that is, whilst main-
The experimental setup is designed as follows: We first taining the the performance on the test set within the 2 − 3%
observe the impact of the ways/modes of attack on the standard range of the clean model. In this experiment, we progressively
watermarking scheme (shown in Table II). Then we compare re-trained all the layers for as many as 80 epochs, which is the

1734

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.
number of epochs on which the original ResNet model was [10] S. Szyller, B. G. Atli, S. Marchal, and N. Asokan, “Dawn: Dy-
trained. namic adversarial watermarking of neural networks,” arXiv preprint
arXiv:1906.00830, 2019.
We have observed that it takes about 35 epochs of re- [11] Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh, “Embedding water-
training with a sample size of 60% of actual training samples, marks into deep neural networks,” in Proceedings of the 2017 ACM on
to obtain an extracted model which is functional and without International Conference on Multimedia Retrieval, 2017, pp. 269–277.
[12] H. Chen, C. Fu, J. Zhao, and F. Koushanfar, “Deepinspect: A black-box
watermarks. This effort, of using as many training samples trojan detection and mitigation framework for deep neural networks.” in
and running it through all layers for about 45% of iterations IJCAI, 2019, pp. 4658–4664.
is similar to training a new model from scratch. [13] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing
machine learning models via prediction apis,” in 25th {USENIX}
Security Symposium ({USENIX} Security 16), 2016, pp. 601–618.
VI. C ONCLUSIONS [14] X. Chen, W. Wang, C. Bender, Y. Ding, R. Jia, B. Li, and D. Song,
Robust watermarking schemes are necessary for proper “Refit: a unified watermark removal framework for deep learning
systems with limited data,” arXiv preprint arXiv:1911.07205, 2019.
verifiable claiming of ownership of IP rights of trained neural [15] S. Guo, T. Zhang, H. Qiu, Y. Zeng, T. Xiang, and Y. Liu, “The hidden
network models. The stakeholders need a strong assurance vulnerability of watermarking for deep neural networks,” arXiv preprint
that adversaries would fail to steal their models and use them arXiv:2009.08697, 2020.
[16] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,”
without authorisation. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist,
In this paper, we propose a robust watermarking scheme vol. 2, 2010.
called ROWBACK, that combines two properties of neural [17] A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” online:
http://www. cs. toronto. edu/kriz/cifar. html, 2014.
networks, the existence of adversarial examples and the ability [18] C. Rey and J.-L. Dugelay, “A survey of watermarking algorithms
to trap backdoors within a network during training. Specif- for image authentication,” EURASIP Journal on Advances in Signal
ically, we have redesigned the Trigger set making use of Processing, vol. 2002, no. 6, pp. 1–9, 2002.
[19] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet, “Turning
adversarial examples and modified the marking mechanism your weakness into a strength: Watermarking deep neural networks
to ensure thorough distribution of the embedded watermarks. by backdooring,” in 27th {USENIX} Security Symposium ({USENIX}
We have tested ROWBACK with the most relevant state- Security 18), 2018, pp. 1615–1631.
[20] E. Le Merrer, P. Perez, and G. Trédan, “Adversarial frontier stitching
of-the-art attacks to demonstrate its robustness. In future, we for remote neural network watermarking,” Neural Computing and Ap-
would like to study the watermarking scheme formally to plications, vol. 32, no. 13, pp. 9233–9244, 2020.
prove that it can withstand all known attacks. We are exploring [21] R. Namba and J. Sakuma, “Robust watermarking of neural network
with exponential weighting,” in Proceedings of the 2019 ACM Asia
its vulnerability and robustness against different other types Conference on Computer and Communications Security, 2019, pp. 228–
of attacks which are not directly relevant to its context. They 240.
violate the essential setting upon which ROWBACK is built, [22] J. Zhang, Z. Gu, J. Jang, H. Wu, M. P. Stoecklin, H. Huang, and
I. Molloy, “Protecting intellectual property of deep neural networks
which include maintaining a secret Trigger Set and free access with watermarking,” in Proceedings of the 2018 on Asia Conference
to verification of watermarks when required. Such attacks are on Computer and Communications Security, 2018, pp. 159–172.
those which prevent verification by screening samples before [23] D. Hitaj, B. Hitaj, and L. V. Mancini, “Evasion attacks against water-
marking techniques found in mlaas systems,” in 2019 Sixth International
feeding to the network [15] or side-channel attacks. Conference on Software Defined Systems (SDS). IEEE, 2019, pp. 55–
63.
R EFERENCES [24] ——, “Evasion attacks against watermarking techniques found in mlaas
[1] N. Chattopadhyay, C. S. Y. Viroy, and A. Chattopadhyay, “Re-markable: systems,” in 2019 Sixth International Conference on Software Defined
Stealing watermarked neural networks through synthesis,” in Inter- Systems (SDS). IEEE, 2019, pp. 55–63.
national Conference on Security, Privacy, and Applied Cryptography [25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
Engineering. Springer, 2020, pp. 46–65. and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint
[2] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. arXiv:1312.6199, 2013.
MIT press Cambridge, 2016, vol. 1. [26] I. Goodfellow, P. McDaniel, and N. Papernot, “Making machine learning
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: robust against adversarial inputs,” Communications of the ACM, vol. 61,
A large-scale hierarchical image database,” in Computer Vision and no. 7, pp. 56–66, 2018.
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, [27] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and
2009, pp. 248–255. D. Mukhopadhyay, “Adversarial attacks and defences: A
[4] N. Stephenson, E. Shane, J. Chase, J. Rowland, D. Ries, N. Justice, survey,” CoRR, vol. abs/1810.00069, 2018. [Online]. Available:
J. Zhang, L. Chan, and R. Cao, “Survey of machine learning techniques http://arxiv.org/abs/1810.00069
in drug discovery,” Current drug metabolism, vol. 20, no. 3, pp. 185– [28] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-
193, 2019. chine learning: from phenomena to black-box attacks using adversarial
[5] S. Mohseni, N. Zarei, and E. D. Ragan, “A survey of evaluation meth- samples,” arXiv preprint arXiv:1605.07277, 2016.
ods and measures for interpretable machine learning,” arXiv preprint [29] S. Targ, D. Almeida, and K. Lyman, “Resnet in resnet: Generalizing
arXiv:1811.11839, 2018. residual architectures,” arXiv preprint arXiv:1603.08029, 2016.
[6] F. Rosenblatt, “The perceptron: a probabilistic model for information [30] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning
storage and organization in the brain.” Psychological review, vol. 65, at scale,” arXiv preprint arXiv:1611.01236, 2016.
no. 6, p. 386, 1958. [31] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of research
[7] O. Sharir, B. Peleg, and Y. Shoham, “The cost of training nlp models: on machine learning applications and trends: algorithms, methods, and
A concise overview,” arXiv preprint arXiv:2004.08900, 2020. techniques. IGI global, 2010, pp. 242–264.
[8] H. Chen, B. D. Rohani, and F. Koushanfar, “Deepmarks: a digital [32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
fingerprinting framework for deep neural networks,” arXiv preprint T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
arXiv:1804.03648, 2018. imperative style, high-performance deep learning library,” in Advances
[9] B. D. Rouhani, H. Chen, and F. Koushanfar, “Deepsigns: A generic in neural information processing systems, 2019, pp. 8026–8037.
watermarking framework for protecting the ownership of deep learning
models.”

1735

Authorized licensed use limited to: Nanyang Technological University Library. Downloaded on May 01,2023 at 09:18:58 UTC from IEEE Xplore. Restrictions apply.

You might also like