You are on page 1of 17

Received: 2 January 2022

DOI: 10.1049/ise2.12102

ORIGINAL RESEARCH
- -
Revised: 20 October 2022 Accepted: 22 November 2022

- IET Information Security

On the performance of non‐profiled side channel attacks based


on deep learning techniques

Ngoc‐Tuan Do1 | Van‐Phuc Hoang1 | Van Sang Doan2 | Cong‐Kha Pham3

1
Institute of System Integration, Le Quy Don Abstract
Technical University, Hanoi, Vietnam
In modern embedded systems, security issues including side‐channel attacks (SCAs) are
2
Vietnam Naval Academy, Nha Trang, Vietnam becoming of paramount importance since the embedded devices are ubiquitous in many
3
The University of Electro‐Communications (UEC), categories of consumer electronics. Recently, deep learning (DL) has been introduced as a
Tokyo, Japan
new promising approach for profiled and non‐profiled SCAs. This paper proposes and
evaluates the applications of different DL techniques including the Convolutional Neural
Correspondence
Network and the multilayer perceptron models for non‐profiled attacks on the AES‐128
Van‐Phuc Hoang, Institute of System Integration,
Le Quy Don Technical University, no. 236 Hoang encryption implementation. Especially, the proposed network is fine‐tuned with different
Quoc Viet Str., Hanoi, 100000, Vietnam. number of hidden layers, labelling techniques and activation functions. Along with the
Email: phuchv@lqdtu.edu.vn
designed models, a dataset reconstruction and labelling technique for the proposed model
has also been performed for solving the high dimension data and imbalanced dataset
Funding information
problem. As a result, the DL based SCA with our reconstructed dataset for different
National Foundation for Science and Technology
Development, Grant/Award Number: 102.02‐ targets of ASCAD, RISC‐V microcontroller, and ChipWhisperer boards has achieved a
2020.14 higher performance of non‐profiled attacks. Specifically, necessary investigations to
evaluate the efficiency of the proposed techniques against different SCA countermea-
sures, such as masking and hiding, have been performed. In addition, the effect of the
activation function on the proposed DL models was investigated. The experimental re-
sults have clarified that the exponential linear unit function is better than the rectified
linear unit in fighting against noise generation‐based hiding countermeasure.

KEYWORDS
computer network security, cryptography, embedded systems, security

1 | INTRODUCTION attacks (SCAs) can be classified into two approaches: profiled


and non‐profiled attacks.
Embedded systems have been applied to many categories of Profiled attacks use a reference device, which is identical
consumer electronics including home security systems, home (or very close) to the target device, to build a database stocking
appliances, printers, smart watches etc. These embedded sys- power consumption information dedicated to a type of device
tems often store, access, or communicate private/sensitive data [2]. This class of attack was initially proposed in Ref. [3] and
that cause a serious security concern during their operation [1]. then developed under the well‐known name, the Template
Therefore, cryptographic algorithms are widely used as security attack [4]. In particular, profiling attacks take place in two
solutions for the embedded systems. Although brute force stages: the profiling stage and the key extraction stage. In the
attacks cannot break these mathematically secure algorithms, profiling stage, a large number of power traces recorded from
numerous accounts' confidential keys have been broken by the reference device are used to build a template for each
exploiting side‐channel information such as timing informa- hypothesis key based on the multivariate‐Gaussian distribution.
tion, power consumption, or electromagnetic (EM) leaks Then, the key extraction stage uses a small number of power
collected from cryptographic devices. In general, side‐channel traces (very small compared to profiling stage) collected from

-
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
© 2022 The Authors. IET Information Security published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

IET Inf. Secur. 2022;1–17. wileyonlinelibrary.com/journal/ise2 1


17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2
- DO ET AL.

the target device to find the correct key based on the template, higher the data dimension, the more complex the network ar-
which has been pre‐built in the profiling stage. Such a detec- chitecture. Therefore, high dimension data is a serious problem
tion method is based on the maximum likelihood metric. for DDLA, which motivated us to extend the previous work [7]
Even though profiled attacks are considered the most for a comprehensive analysis to assess the efficiency of DDLA
powerful from the SCAs, the condition of this method is techniques in a more complicated context. It is expected that a
sometimes difficult to satisfy in practice. Indeed, for closed new DL architecture combining with the HW labelling tech-
products like smart cards running banking applications, the nique can tackle the high dimension data issue. On the other
attacker does not have control of the keys and is usually limited hand, to deal with the imbalanced dataset problem, a new simple
by a transaction counter. In such a case, profiled attacks cannot labelling technique based on HW values is proposed. Finally, we
be performed. However, the secret device is still threatened by verify how the increase in dataset size helps mitigate the effect of
a method called the non‐profiled SCA attack. hiding countermeasures like noise generators.
Non‐profiled attacks are based on the relationship between
the power consumption model and the real power consumption.
By computing correlation of the ground‐truth models and the 1.2 | Previous works
power traces recorded from the target device, the non‐profiled
SCA can recover the secret key. Therefore, the non‐profiled In this subsection, some previous works related to the topic of
attacks do not require any reference device. If the power con- this paper are surveyed and discussed as follows. Firstly,
sumption models are incorrect, it is impossible to detect the regarding the data preparation technique, Picek et al. [8] used
right key. According to the authors in Ref. [5], Hamming weight the correlation analysis to extract the most relevant samples.
(HW) and Hamming distance (HD) are commonly used for This technique exploits the correlation between power traces
power consumption models. The HW model is useful if a pre‐ with a power consumption model. Another technique, namely
charged bus is used. It means that the HW model is particu- principal component analysis (PCA) [9], which is usually used
larly well applicable to software implementations. On the other in DL to reduce data dimension, is also employed for attacking
hand, the HD model is appropriate to describe the power the secrete key. However, the main drawback of PCA is the
consumption of buses and registers on hardware implementa- computation time because the number of samples on a power
tion. In order to determine the correct key, the attacker uses a trace increases quadratically. Despite having the efficiency of
discriminator like correlation power analysis (CPA) to determine reducing the data dimension, the above‐mentioned techniques
the most probable key among a set of keys. are applied in the profiled context only.
Recently, the hardware security research community has Secondly, in terms of labelling techniques, two typical
focussed their attention on deep learning (DL). This is a labelling techniques, including HW and Binary, have been
promising technique for implementing powerful attacks against applied in the DL‐based non‐profiled SCA [6]. The efficiency
cryptographic devices. However, most previous publications of DDLA using the Binary labelling method is proven in many
focussed on DL‐based profiled attacks. Similar to template works [6, 10, 11]. In contrast, no report of using the HW
attacks, mounting a DL‐based profiled attack requires to have labelling method is published on the non‐profiled DL context.
access to a referent device, which is sometimes challenging to Additionally, the authors in Ref.[12] have indicated that the
satisfy on closed devices and limits the usage of the DL HW model causes the imbalance dataset problem in DL‐based
technique. Based on the advantages of non‐profiled attacks, SCA (DLSCA) scenario. Indeed, by observing Table 1, it is
Timon introduced the first DL‐based non‐profiled approach obvious that the distribution of intermediate values on each
called the Differential Deep Learning Analysis technique HW is imbalanced and symmetric about HW4 in the case of
(DDLA) in Cryptographic Hardware and Embedded Systems AES‐128. To address this problem, the authors in Ref. [12]
Conference 2019 (TCHES 2019) [6]. Accordingly, DDLA can proposed a method based on the data re‐sampling technique.
implement the DL algorithm to reveal the correct key without Accordingly, they used a random oversampling method, so‐
any copy of the target device. The efficiency of DDLA has called synthetic minority over‐sampling technique (SMOTE),
been clarified on both non‐protected and protected devices to oversample for each class. In practice, SMOTE can be
using hiding (non‐synchronized power traces) or masking considered as a general case of the data augmentation (AU)
techniques. technique, which is proposed in Ref. [13].
Finally, regarding the impact of noise, several works have
investigated the effect of noise addition on the DL‐based SCA
1.1 | Motivation [14, 15]. Kim et al. have demonstrated in Ref. [14] that adding
the artificial noise on the input signal can actually improve the
Despite being effective on both unprotected and protected
(masking) SCA data, the performance of DDLA has not been
investigated in other scenarios, such as high dimension data T A B L E 1 Probability distribution of the Hamming weight (HW) of a
uniformly distributed 8‐bit value
input, labelling techniques, or the additional SCA countermea-
sure (hiding‐noise generator). In addition, one drawback of HW 0 1 2 3 4 5 6 7 8
DDLA is that it is necessary to perform a DL training for each Probability 1 8 28 56 70 56 28 8 1
256 256 256 256 256 256 256 256 256
key guess (i.e., 256 training times in the case of AES‐128) [6], the
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 3

performance of neural networks. In Ref. [15], Maghrebi has providing more consistent results and distinguishing the
proved that the noise addition can avoid over‐fitting in correct key better than using ReLU with the presence of
DL‐based SCA techniques. Both works have indicated that hiding countermeasures. However, due to the exponential
adding noise is beneficial to implement the DL‐based profiled function, the ELU activation function makes the correct key
SCA. In the non‐profiling scenario, we have performed the detection more delayed than the ReLU one.
preliminary investigation of Gaussian noise on the DL‐based
SCA attack [7]. In contrast to DL‐based profiled SCA, the
results in Ref. [7] show that Gaussian noise has a negative 1.4 | Paper outline
effect on DDLA techniques.
The rest of this paper is organised as follows. In Section 2, we
describe data preparation including test platforms and dataset
1.3 | OUR contributions reconstruction. Section 3 presents our proposed architectures,
including MLP and CNN, which are used to deal with the
Our work has main contributions as follows: reconstructed dataset in the non‐profiled scenario. In the next
section, we discuss the specific results from various experi-
‐ Firstly, we introduce a method of data preparation ments implemented on raw power traces, which are collected
using correlation in non‐profiling scenarios. In Sec- from the CW, RISC‐V microcontroller (MCU), or ASCAD
tion 2, we assume that the attackers only have a basic database. [16]. Simultaneously, the effect of noise is investi-
knowledge of the advanced encryption standard (AES) gated in this section using CW data. Section 5 discusses some
implementation. They cannot specify the samples corre- limitations of this work and suggests future developments.
sponding to attack positions like Sbox function. Therefore, Finally, we conclude the paper in Section 6.
a huge number of samples for each power trace must be
recorded. In this case, we collect power traces that contain
the first round process of the AES‐128 algorithm on 2 | DATA PREPARATION
ChipWhisperer (CW) (10,000 samples) or RISC‐V pro-
cessors (9919 samples). Related to the imbalance dataset 2.1 | Experimental platforms
problem, we propose a labelling technique of three sig-
nificant HWs, which is more resistant to noise than that of In this work, both first‐order and second‐order leakage data
256 key classes or nine HW ones. In comparison with are used for evaluating the performance. In the case of
TCHES 2019, our experimental results show that the first‐order data, we use AES‐128 software implementations
proposed models provide a reliable technique to reveal the based on power traces, such as unmasked ASCAD and
correct key in first‐order non‐profiled attacks. RISC‐V‐based data. Considering a higher level of security,
‐ Secondly, new Convolutional Neural Network|con- we used an ASCAD database and additive noise‐added CW
volutional neural network (CNN) and multilayer per- data for investigating masking and hiding countermeasure,
ceptron (MLP) instances in non‐profiled context are respectively.
introduced. In Section 3, we propose DL architectures,
including MLP and CNN, working on the dataset of three
HW labels. We have demonstrated that our proposed models 2.1.1 | Unmasked ASCAD dataset
have a good performance on various platforms such as CW,
ANSSI SCA Database (ASCAD), and RICS‐V processor First, we chose a public dataset, namely ASCAD, which is
implemented on field‐programmable gate array (FPGA) introduced by Prouff et al. in Ref. [16]. The dataset provides
technologies. side‐channel power traces of an 8‐bit ATMega8515 board with
‐ Next, we extend our previous work in Ref. [7] to the first‐order protected software AES implementation.
investigate the efficiency of the non‐profiled DL‐based ASCAD is composed of two sets of traces: a profiling set of
SCA with different contexts, such as unprotected AES 50,000 traces to train DL networks and an attack set of 10,000
implementation and protected AES using the masking traces to test the efficiency of trained neural networks. It is
and hiding technique. In Section 4, we perform various worth noting that 700 samples correspond to the output of the
experiments on different levels of additive noise. In this case, third Sbox processing during the first round. Because of an
we consider a noise generator as the main hiding technique. unmasked implementation, the mask should be known and
We demonstrate in Section 4 that DL‐based non‐profiled thus it can quickly turn into an unprotected scenario.
SCAs are sensitive to additive noise. Accordingly, the leakage model is computed as follows:
‐ Finally, we compare the results in using two different
activation functions, including exponential linear unit 0 1
(ELU) and rectified linear unit (ReLU) and verify how B � � C
the increase in dataset size helps mitigate the effect of Hp3 ;k ¼ HW B
@Sbox p3 ⊕ k ⊕

m3 C ð1Þ
|{z} A
hiding countermeasures like noise generators. We aim known mask
to point out the efficiency of the ELU activation function in
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4
- DO ET AL.

where p3, k* and m3 are the third byte of plaintext, the key, and As a result, 10,000 power traces of the Sakura‐G board are
the third byte of the known mask, respectively. collected. Each power trace contains 9919 samples. In this
platform, we chose the third byte for investigating the proposed
techniques. The leakage model on RISC‐V power traces is
2.1.2 | RISC‐V dataset similar to (1) but without mask value, which is written as follows:
� ��
Next, an experimental system is set up to collect power traces Hp3 ;k ¼ HW Sbox p3 ⊕ k ∗ ð2Þ
automatically. As depicted in Figure 1, the experimental system
consists of a Keysight DSOX6004A oscilloscope, a monitoring
personal computer (PC), and a Sakura‐G FPGA board in
which the Sakura‐G FPGA board is the test platform. It is 2.1.3 | ChipWhisperer
worth noting that the target device is a 32‐bit Murax RISC‐V
MCU operating at 48 MHz. This MCU is implemented on the Similar to the RISC‐V target, we set up an automatic system
Sakura‐G board and then programed an AES‐128 software in using CW to acquire power traces. ChipWhisperer is an all‐in‐
C language. In our experiments, the secret key is fixed, and the one platform, including a 10‐bit 105MSa/s analog‐to‐digital
plaintexts are chosen randomly. converter (ADC) chip, a Spartan‐6 FPGA for controlling, and
The Keysight DSOX6004A Oscilloscope is employed to an Atmel Xmega chip that served as the target device. Chip-
measure side‐channel data when the target RISC‐V SoC Whisperer is controlled by a personal computer using Python
operates the AES‐128 encryption. The oscilloscope is config- software, which repeatedly sends the plaintexts to the CW
ured with four analogue channels, 6 GHz maximum band- board and receives the side channel data along with corre-
width, and up to 20 GSa/s sample rate. Two passive probes are sponding ciphertexts via the USB port. All capturing processes
used to gather power traces from the target device in the are automatically done by Spartan‐6 FPGA and an ADC chip.
experimental workplace. One probe is applied to acquire the Accordingly, 5000 power traces have been collected from the
analogue signal from the core VDD node of Spartan‐6 FPGA CW board. Each power trace contains 10,000 samples, as
at 125 MSa/s sample rate. The second one detects the trigger depicted in Figure 2.
signal provided by the RISC‐V target through a general‐pur-
pose input/output pin. The oscilloscope is remotely controlled
by a monitoring PC through a Python software and a virtual 2.2 | Points of interest for non‐profiled
instrument systems architecture COM library. attack
The monitoring PC is utilised to operate the whole auto‐
measuring system. It communicates with the oscilloscope Power trace, in principle, is a vector of voltage values recorded
through the local area network port and with the target device by a digital oscilloscope. The measured voltages are propor-
through the universal serial bus (USB) port. It repeatedly sends tional to the power consumption of a cryptographic device
plaintexts to the target device and commands the oscilloscope because the oscilloscope connects to that device through an
to capture the power traces when the target SoC executes each appropriate measurement circuit or an EM probe [17].
encryption. After each encryption, the monitoring PC receives Accordingly, a power trace of the CW board illustrated in
the measured data from the oscilloscope and corresponding Figure 2 shows that the power trace contains many different
ciphertext from the target SoC. The Python software also sample points. However, some of these points do not contain
verifies the ciphertext to ensure that the target device works any representative information for predicting cryptographic
correctly. The power traces and the corresponding plaintexts keys. Hence, using valuable points of interest (POI) of the
and ciphertexts are saved to NumPy files for later analysis. power trace could improve the performance of SCAs and
reduce the time required for revealing the correct key. For this
reason, a method of extracting these POIs is proposed in this

F I G U R E 1 Test platform: RISC‐V power traces acquisition on FIGURE 2 A raw power trace taken from ChipWhisperer (CW)
Sakura‐G board. board.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 5

study in order to enhance the efficiency of SCAs in terms of Afterwards, the formula (3) is applied to select the samples of
accuracy and time cost. high correlation from all guess keys (Key = [0; 255]).
Several techniques were proposed for improving the per- We provide an intuitive example on the CW platform to
formance of non‐profiled attacks by extracting the POI. For illustrate the POI extraction method for determining the po-
instance, Abdellatif et al. [18] exploited pattern recognition sitions of high correlation on a dataset that contains 5000
methods to filter interesting points of power trace for power trace of 10,000 samples each. Firstly, the correlation
obtaining a successful attack. In another work, Alipour et al. value of the real power trace with the HW model is calculated
[10] used a simple method in order to reduce the sampling on 2500 power traces. As a result, a matrix of correlation co-
points of each power trace by 50%. Recently, in our previous efficients with a size of 256 � 10,000 is produced, in which
work [19], an improved correlation method was presented for each row corresponds to a hypothesis key. Figure 3a illustrates
taking the most relevant samples in the power trace by the correlation of three rows (Key = 43, 44, 45) in the cor-
computing the correlation between real traces with their relation matrix. The next 50 positions with the 50‐top highest
model. This method is suitable for a large number of samples correlation values are located. Hence, 50 relevant sample
because attackers do not need to know detail about the AES points of all power traces are extracted, as depicted in Figure 3.
implementation. In the next section, the dataset reconstruction Consequently, the size of a new dataset can be reduced to 200‐
based on correlation coefficients for the DL model in the non‐ fold compared to that of the original one due to the
profiled attack scenario will be presented. correlation‐based extraction that can reduce the size of each
power trace from 10,000 to 50 samples.
On the one hand, these POI samples strongly correlate
2.3 | Dataset reconstruction with the intermediate values created from the correct key.
However, on the other hand, these leakage samples also
As mentioned in the previous section, non‐profiled attacks do correlate with incorrect keys. It leads to some ‘ghost peaks’ in
not need the profiling phase as profiled attacks; therefore, the SCA attacks (i.e., Correlation power analysis attacks). In the
datasets that are used for DL in non‐profiled attacks are case of DDLA attacks, the correlation of leakage samples and
different. Indeed, in Ref. [6], Timon indicated that it is possible to intermediate values created from the incorrect key will cause
use the relationship between power consumption and its model high accuracy for both correct and incorrect key guesses.
for classification by using neural networks. It means that we can Therefore, in this work, we divide the power traces into
use power trace values as input for training neural networks, different sub‐datasets corresponding to each key hypothesis.
which can learn to classify the power traces in the group that Concretely, a smaller dataset of power traces is generated and
corresponds to the power consumption model. However, a reconstructed following hypothesis keys and HW values in
power trace often contains thousands of samples, whereas only a which HW plays a role in labels. The power traces correlated
part of them serves for key prediction. Therefore, using all of to the correct key are assigned in a separate sub‐dataset and fed
these samples as the input features of the DL model lead to an to the network only once.
increment in the DL complexity and time consumption [8, 20, The authors in Ref. [12] indicated that the commonly used
21]. In order to handle this issue, we use the correlation char- HW model leads to imbalanced training datasets. Indeed, by
acteristic to find the most useful samples on a power trace to feed
up the neural network. Accordingly, the samples which have high
correlation values with the power model will be selected as
strong features for training a neural network.
To select the useful features, the Pearson correlation co-
efficients formula is applied:
N1
P � �
hn;k − hk tn;i − ti
n¼1
ρk;i ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð3Þ
P N1 �2 P N1 �2
hn;k − hk tn;i − ti
n¼1 n¼1

where hk and ti are the average values of the power con-


sumption model and real power at the instant i, respectively.
Let us consider N random plaintexts corresponding to N
power traces in which each power trace has L samples. It is
noted that ti,j is the value of jth sample in the ith trace
(1 ≤ j ≤ L, 1 ≤ i ≤ N), di,B is the byte value of byte B (B ∈ [1;
F I G U R E 3 The positions on sample axis corresponding highest
16]) in the ith plain‐text. According to the authors in Ref. [19],
correlation values on all hypothesis keys will be taken (The red markers): (a)
for collecting the useful features from the power traces, half of Correlation values of key 43; (b) Correlation values of key 44; (c)

the power traces are used and denoted as N1 N1 ¼ 12 N . Correlation values of key 45.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6
- DO ET AL.

observing Table 1, it is obvious that the distribution of power Regarding the ‘difference in the size,’ the number of power
traces on each HW is imbalanced and symmetric about HW4. traces of each labelled folder selected from N follows the
Therefore, instead of labelling with all nine HW values, we distribution in Table 1. For example, with N = 5000 plaintexts,
propose to designate the dataset with three dominant HW the number of data assigned to each label is 1017, 1416, and
values, including HW3, HW4, and HW5 as shown in Figure 4. 1105 power traces (corresponding HW3, HW4, and HW5).
Specifically, each reconstructed dataset corresponds to a key Despite the number of power traces is not equal for all labels, it
hypothesis with a ‘difference in the size’ and a ‘difference in the still ensures that the classification model works properly. In this
data.’ The difference in the data of each sub‐dataset is created work, four new datasets are reconstructed from the original
since, with each key guess, different plaintexts are combined to ASCAD, RISC‐V, and CW data as presented in Table 2. Each
create the intermediate values, which have the HW equal to dataset is divided into two parts: training and validating data
three,four, and five. We expected that training on different data corresponding to 80% and 20% of the created dataset,
corresponding to each key guess would provide a better respectively.
discrimination of correct and incorrect keys compared to using
the same data for all networks. In this way, numerous power
traces corresponding to other remaining HW values can be 3 | PROPOSED DEEP LEARNING
ignored; therefore, the measurements continues to be reduced ARCHITECTURES
(about one‐third). In addition, the fewer output classes are the
faster and more accurate to classify the neural network. As a Deep learning is a branch of neural network‐based machine
result, we obtain 16 folders corresponding to 16 sub‐bytes of a learning that uses deep neural networks. It has been success-
secret key. Each folder contains 256 sub‐folders corresponding fully applied to various fields, like image classification, speech
to 256 values of the potential guessed key. In such a sub‐folder, recognition, and, recently, side channel analysis. While the
three folders, namely HW3, HW4, and HW5 are used for three original DDLA proposal [6] uses CNN and MLP as a building
labels of the CNN. Finally, for each label folder corresponding block, one can use advanced DL techniques like recurrent
to the intermediate value (HW = 3, 4, 5), the power traces neural networks (RNN) or long short‐term memory (LSTM),
were partitioned as illustrated in Figure 4. especially in the case of sequence‐based data like SCA data. We
however continue to use MLP and CNN architecture for two
distinct reasons. Firstly, RNN and LSTM are not currently
well‐studied for SCA‐used cases. Secondly, the wide variety of
results available for use of CNN or MLP with SCA datasets
help us to benchmark our results. To this end, we provide a
new instance of MLP and CNN architectures in non‐profiled
scenarios.

3.1 | Multilayer perceptron


Our proposed network comprises an input layer, output layers,
and six hidden layers. As presented in Section 2, the recon-
structed dataset consists of 50 samples of each power trace.
F I G U R E 4 Structure of the new datasets: There are 16 folders (red) The number of nodes in the input layer is assigned according
corresponding to 16 bytes of secret key, each folder contains 256 sub‐
folders (blue) which correspond to 256 hypothesis keys used. N original
to the number of samples in a power trace. Therefore, the
power traces (L samples/trace) are calculated to form N(i,1), N(i,2), N(i,3) input layer of the proposed MLP has 50 nodes corresponding
new traces corresponding to three labelled folders (black) to 50 features of the extracted power trace. As depicted in
ðHW ¼ f3; 4; 5gÞ, i denotes the value of key guess (0 ≤ i ≤ 255). Each new Figure 5, all arrows represent the weights. Prior to the imple-
trace contains 50 samples which are the highest correlation values. mentation training phase, the values of weights and bias are

TABLE 2 The details of reconstructed datasets

ASCAD data RISC‐V data CW data


Dataset Number of traces Dimension Number of traces Dimension Number of traces Dimension Labelling technique
Original 20, 000masked 700 10,000 9919 5,000 10,000

Dataset1 20, 000unmasked 700 ‐ ‐ ‐ ‐ LSB

Dataset2 ≈7, 000unmasked 50 ‐ ‐ ‐ ‐ 3‐HW

Dataset3 ‐ ‐ ≈7, 000 50 ‐ ‐ 3‐HW

Dataset4 ‐ ‐ ‐ ‐ ≈3, 000 50 3‐HW


17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 7

where K is the number of classes, and in our case, K = 3 since


our proposed model uses HW label.
Finally, backward propagation is implemented in order to
update the weights to obtain the expected results. Since we
have three labels, the categorical cross‐entropy loss between
the ground‐truth and prediction labels are computed as
follows:

X
3
LX ðw Þ ¼ − ytrue lnðzÞ ð9Þ
j¼1

FIGURE 5 The proposed Multi‐layer perceptron architecture.


where ytrue is the grouth‐true values of HW classes.
Then, we use stochastic gradient descent with a mo-
randomly chosen from a normal distribution by using the
mentum (SGDM) optimiser to find the optimal, minimising
Xavier scheme.
the loss function. Deep learning will do a series of iteration t,
In order to obtain the output, a procedure called forward
and in each iteration, the gradient of the loss function
propagation is performed. Forward propagation can be
∇LX ðw Þ is computed. After that, w is updated by using the
implemented as illustrated in Figure 5, and the weighted sum is
formula as follows:
calculated as follows:
v t ¼ γv t−1 þ η∇w LX ðw Þ
ðl−1Þ
l
X ð10Þ
w tþ1 ¼ w t − v t
wjil
ðlÞ ðl−1Þ
yi ¼ bi þ aj � ð4Þ
j¼1
where γ is the momentum value. In our case, γ is chosen equal
to 0.9, and the learning rate η is chosen equal to 0.01.
where bi is the bias of the node ith, wlji is the weight which
When the correct hypothesis key k* is used, the series of
connect the node ith of layer l − 1 to the node jth of layer l, intermediate results will be correctly computed. Consequently,
ðl−1Þ
aj is the output of the activation function F(y) on the node the partition and the labels used for our model will be
jth of layer l − 1 and the calculateion is as follows: consistent with the corresponding traces. In contrast, for all
the incorrect guess keys, the labels used for the training will be
� �
ðlÞ
aj ¼ F yj
ðlÞ
ð5Þ incompatible with the traces. By analysing and optimising the
model, we decided to use six layers with the number of nodes
as depicted in Figure 5. As a result, our model provides better
It is worth noting that formula (5) does not apply to the results with lower loss or higher accuracy than the other can-
ðlÞ
input layer. The output aj is then used as the input of the next didates. Therefore, the correct key can be obtained.
neuron on the next layer. This procedure is performed from
the input layer to the output layer.
In the DL‐based SCA context, the popular activation 3.2 | Convolutional neural network
functions used in hidden layers are ELU and RELU, which are
computed as formula (6) and (7), respectively. Our proposed As demonstrated in Ref. [6], CNN is an efficient architecture to
model used ELU instead of ReLU to avoid the vanishing perform DDLA on protected power traces. In this work, we
problem and produce negative outputs for each node in the propose a new CNN model for the non‐profiled SCA attack.
hidden layer. Our novelty is to apply the HW labelling technique. The details
of the proposed architecture are described as follows.
� �
y: y>0 Our proposed architecture is composed of an input layer
ReLU : FðyÞ ¼ ð6Þ
0: y≤0 and three convolutional (Conv) blocks in the middle, followed
by the fully connected (FC) and classification (output) layer.
� � Each Conv block is formed by a Conv layer directly followed
y: y>0
ELU : FðyÞ ¼ ð7Þ
α ⋅ ðey − 1Þ : y≤0 by a batch‐normalised layer and an activation layer. In order to
select the informative and downsample the feature maps, a
pooling layer is applied after the activation layer in the Conv
For classification, the Softmax function is used in the block as described in Figure 6.
output layer for calculating the probability of each HW label. Similar to MLP, the proposed CNN performs forward
This function is calculated as propagation to calculate the output (3 nodes) from the input
layer (50 nodes) and backward propagation to update the
e y½i�
SoftMax : zðyÞ½i� ¼ PK ð8Þ learning metrics. However, unlike in MLPs where each neuron
j¼1 e
y½j�
has a separate weight vector, neurons in CNNs share weights.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8
- DO ET AL.

T A B L E 3 Developed convolutional neural network (CNN)


hyper‐parameter

Layer Weight shape Stride Activation


Convolutional (1) 1 � 3 � 16 ‐ ‐

Batch‐ Normalisation (1) ‐ ‐ ELU

MaxPooling (1) ‐ [1 2] ‐

Convolutional (2) 1 � 3 � 24 ‐ ‐

Batch‐ Normalisation (2) ‐ ‐ ELU


FIGURE 6 The proposed convolutional neural network (CNN) MaxPooling (2) ‐ [1 2] ‐
architecture.
Convolutional (3) 1 � 3 � 32 ‐ ‐

Neurons perform convolutions on the data with the convo- Batch‐ Normalisation (3) ‐ ‐ ELU
lution filter being formed by the weights. In our case, three Average‐pooling (1) ‐ [1 2] ‐
Conv layers corresponding to three blocks are designed with
FC‐output ‐ ‐ Softmax
16, 24, and 32 filters, respectively.
According to the authors in Ref. [6], on unprotected AES
implementation, DDLA only uses one main leakage area to
Matlab software running on a personal computer with the
classify the data. It means that the CNN model tries to find the
configuration of Intel Core i5‐9500 CPU and DDR4 24 giga-
most informative feature in the training phase. Therefore, the
bytes memory.
filter of the proposed model has a size of [1 3], the stride of
[1 1] to extract the strongest features. The weighted sum of
each filter is calculated as follows:
4.1 | Taking the correct key and fine‐tuning
X
2 model
a1;j ¼ w1;n I1;jþn þ b ð11Þ
n¼0 Unlike the profiled DLSCA, non‐profiled DLSCA attacks use
the trend of training metrics, such as loss and accuracy, to
where I is the input data (or the output of the previous layer) detect the correct key. In this subsection, we indicate how to
b is the bias, w1,n stands for the element of filter (or con- use accuracy for determining the correct key in a non‐profiled
volutional kernel weights). Before feeding up to a non‐linear context, and it can be applied to loss metrics as well. Apart
activation function, the output of (11) is normalised. from taking the correct key, we aim to achieve a better per-
Compared to TCHES2019, our proposal uses ELU activation formance by choosing the right number of hidden layers and
for the same reason as in the proposed MLP model. Finally, the size of each hidden layer. Specifically, we propose four
the FC layer along with the Softmax function plays a role as a different models based on the MLP architecture called
classifier, which has three output classes corresponding to MLPproposed1, MLPproposed2, MLPproposed3, and MLPproposed4
three HWs labels. In the backward propagation stage, we use corresponding to four different number of hidden layers. In
SGDM optimiser to find the optimal minimising the loss the case of CNN‐based model, we reuse the CNN model as
function. The details of the proposed CNN architecture are described in Table 3, which is optimised in Ref. [19]. The
described in Table 3. details of the proposed models are presented in Table 4.
Firstly, we perform training processes on unprotected data
(Dataset4) by using the proposed MLP models. As an illus-
4 | EXPERIMENTAL RESULTS tration, we present in Figure 7a the validation accuracy ob-
tained when performing MLP‐based SCA attacks using a CW
In our experiment, the reconstructed datasets in Section 2.3 are dataset with ne = 30 epochs per guess key. In the graph in
used to perform training on the proposed models. We use Figure 7a, the horizontal axis presents the number of training
different models to obtain the results. Firstly, with the un- epochs, and the vertical axis shows the validation accuracy
protected dataset, we reconstruct the MLP architecture in values. We obtain a total of 256 curves corresponding to 256
TCHES2019 called MLPTCHES2019 and the fine‐tuned pro- accuracy metrics. It means that these curves are the results of
posed MLP model. In the case of a protected dataset, we reuse 256 pieces of training using 256 sub‐dataset (blue) as shown in
the MLPTCHES2019 model, however, different labelling tech- Figure 4. It can be seen that only one red curve has the highest
niques are used to train the masked data. Regarding hiding value (around 65% from the 10th epoch to the last). In
countermeasures, we consider noise generators as a hiding contrast, all other blue curves are lower and fluctuate around
technique [22]. For convenience, we simply add different levels 40%. This graph demonstrates that our proposed model can
of Gaussian noise to simulate this SCA countermeasure. We learn the dataset formed by the correct key and cannot learn
then use the fine‐tuned proposed MLP model and proposed the incorrect key datasets. Consequently, the correct key is
CNN for this dataset. All experiments are performed on found in our attack setup.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 9

TABLE 4 Hyper‐parameters of multilayer perceptron (MLP) models using in experiments

Hyperparameters MLPTCHES2019 MLPproposed1 MLPproposed2 MLPproposed3 MLPproposed4


Input 700 50 50 50 50

Hidden layer 2 3 4 5 6

Neuron 20 � 10 150 � 100 � 25 150 � 300 � 100 � 25 150 � 300 � 300 � 100 � 25 150 � 300 � 600 � 300 � 100 � 25

Output 2 3 3 3 3, 9

Label LSB 3‐HW 3‐HW 3‐HW 3‐HW, 9‐HW

Optimiser Adam Sgdm Sgdm Sgdm Sgdm

Activation RELU ELU ELU ELU ELU, RELU

Learning rate 0.001 0.01 0.01 0.01 0.01

Batch size 1000 ‐ ‐ ‐ ‐

Initialising Xavier initialisation

F I G U R E 7 Results of validation accuracy using the proposed multilayer perceptron (MLP) model with different number of layers. (a) MLPproposed1;
(b) MLPproposed2; (c) MLPproposed3; (d) MLPproposed4

As mentioned above, besides finding the correct key, we candidates as indicated in Ref. [6]. The Hamming weight
aim to find out the most effective model from different MLP labelling technique is commonly known as an effective power
architectures as presented in Table 4. Furthermore, these model, which contains nine classes (9‐HW). In this experiment,
experimental results clarify the efficiency of 3‐HW labelling we perform a DLSCA training using MLPproposed4 with 9‐HW
compared to other techniques. In this case, the number of and compare it to the result of 3‐HW as shown in the previous
epochs and the value of accuracy are used is the metrics for experiment. As illustrated in Figure 8, both labelling techniques
determining the best model, which requires the smallest achieve good results in taking the correct key. However, it is
number of epochs and achieves the highest accuracy value to clear that the 9‐HW technique has a lower accuracy and re-
distinguish the correct key from the incorrect ones. As quires at least seven epochs to discriminate the correct key.
depicted in Figure 7, all proposed MLP models provide good More important, the number of power traces needed for the 3‐
results in taking the correct key. There are no big differences in HW technique is approximately 2100 compared to 3000 power
the results of different numbers of hidden layers. However, it traces of 9‐HW (less than approximately 30%).
can be seen that the most significant difference is in the first 10 In summary, in the case of using unprotected data like CW,
epochs. With regard to the accuracy, MLPproposed1 and MLPproposed4 provides stable results and requires fewer epochs
MLPproposed2 are lower than the rest. These results show that than others. Furthermore, the model using 3‐HW achieves
using more hidden layer (MLPproposed3, MLPproposed4) could better results and reduces the number of power traces needed
achieve higher accuracy in the case of unprotected data like for DLSCA. Therefore, for the rest of the paper, we choose
CW. Switching to the number of epochs, it is clear to see that MLPproposed4 using a 3‐HW label as the main MLP‐based
MLPproposed4 can distinguish the correct key with only five model for all experiments on the unprotected dataset.
epochs. In addition, the accuracy at epoch fifth of this model is
the highest. Therefore, MLPproposed4 is selected for the next
experiment. 4.2 | Unprotected data
Next, by using MLPproposed4, we do further experiments to
investigate the efficiency of three labelling techniques To demonstrate the efficiency of our proposals on different
compared to others. However, only HW labelling and LSB types of SCA data, we perform DLSCA on both EM and
labelling techniques are investigated since the identity labelling power consumption data. It is worth noting that all datasets in
will naturally lead to similar DL metrics for all the key this experiment are first‐order leakage data. Therefore,
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10
- DO ET AL.

MLPproposed4 and CNNproposed are selected to perform the labelling techniques, including 3‐HW‐based and LSB‐based in
attacks. the case of unprotected data.
Firstly, we perform DL training using MLPTCHES model on
Dataset1 with ne = 35 epochs per guess key. It is important to
4.2.1 | Unmasked ASCAD note that the dimension of data input of MLPTCHES is the
same as the original one. However, we perform MLPTCHES on
To investigate the efficiency of the proposed techniques on the MATLAB framework instead of Pytorch as in the original
EM data, we use unmasked ASCAD dataset reconstructed in work. Secondly, we use MLPproposed4 and CNNproposed with 3‐
Section 2. Simultaneously, we compare the efficiency of two HW‐based labelling technique, and these models are trained on
Dataset2 with ne = 32 epochs per guess key. The attack results
on the third byte are illustrated in Figure 9, where we can see
that by using the 3‐HW labels, the model has a lower validation
accuracy than TCHES 2019 because they use only two labels,
which leads to results for classification of at least 50%. Indeed,
the result of MLPTCHES on Figure 9a shows that the validation
accuracy of the correct key (red curve) increases gradually from
50% to approximately 68%, whereas the incorrect key ones
(blue curves) fluctuate around 50%. Obviously, MLPTCHES can
discriminate the correct key after 10 epochs.
Switching to our proposed models, as it can be clearly seen
in Figure 9b, the validation accuracy of the correct key guess
(red curve) increase in a zig‐zag way from 30% to 55.6% and
then keeps consistent. In contrast, the incorrect keys guess
(blue curves) update the values only in the first 10 epochs, then
remain unchanged below 40%. These results demonstrate that
our proposed model using 3‐HW has the ability to reveal the
correct key. Despite lower accuracy, our technique gives a high
probability of discriminating the correct key than the model in
Ref. [6]. In addition, the number of epochs needed for deter-
mining the correct key is only seven compared to 10 of
MLPTCHES (red dash line).
Similar to MLPproposed4, CNNproposed also gives a good
result as illustrated in Figure 9c. The accuracy of the correct
key is even higher than the MLPproposed4 one (about 60%). The
incorrect keys are nearly the same as MLPproposed4 and keep
values below 40%. More interestingly, CNNproposed needs only
four epochs to discriminate the correct key. However, the re-
sults also point out the effect of optimisers on DDLA. In the
original work, the authors used the Adam optimiser and ob-
F I G U R E 8 experimental results of the proposed model using (a) 9‐ tained a smoother curve than our model using SGDM.
HW and (b) 3‐HW labelling technique. However, with the fluctuating curves obtained, 3‐HW

F I G U R E 9 Results of ASCAD database for different models using 3‐HW and LSB labelling technique: (a) TCHES2019 and LSB; (b) The proposed
multilayer perceptron (MLP) and 3‐HW; (c) The proposed convolutional neural network (CNN) and 3‐HW.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 11

labelling‐based models using SGDM have good performance proposed MLP architectures are out of scope of this experi-
than using Adam in all experiments. ment. A new dataset called ASCAD3HW is created based on
ASCAD data, which is labelled using 3‐HW instead of the LSB
method. For convenience, we use the MLPTCHES model to
4.2.2 | RISC‐V data evaluate both LSB labelling and 3‐HW labelling techniques.
We perform two DLSCA attacks in this experiment. The
Regarding to power consumption data, we perform the same first attack is launched by the MLPTCHES model, which is
experiment above on SCA data collected from a 32‐bit RISC‐V trained on the original ASCAD dataset with two outputs (LSB
processor, as described in Section 2. Once again, we use the label). The second attack is implemented on the MLPTCHES
MLPTCHES model to compare with MLPproposed4 and model using ASCAD3HW dataset with three outputs (3‐HW
CNNproposed. But the difference here is that we use the label). The experimental results are shown in Figure 11. It can
reconstructed Dataset3 for both MLPTCHES and our proposed be seen that LSB labelling provides a clear distinction between
model with ne = 32 epochs per guess key. We will investigate correct and incorrect keys, whereas it is only a small gap be-
the adaption of those models for the new dataset. The labelling tween correct keys and incorrect keys in the case of the 3‐HW
technique is kept the same with prior experiments. As a result labelling method. It means that the correlation between the
shown in Figure 10, it is clear that our proposed CNN and combined second‐order leakage samples and HW model is
MLP models have good performance in discriminating the lower than the LSB model. However, this result indicates that
correct key with 7 and 10 epochs, respectively. Whereas, 3‐HW labelling is able to break masking countermeasure. In
TCHES 2019 model needs 13 epochs. addition, the proposed HW‐based technique helps reduce the
number of power traces by approximately 30% compared to
the LSB labelling technique.
4.3 | Protected data
According to the authors in Ref. [6], the MLP‐based model is 4.3.2 | Noise generation‐based hiding
suitable for masked data since it can combine second‐order countermeasure
leakage samples automatically without any pre‐processing
technique. On the other hand, the CNN‐based model is In this subsection, we evaluate the ability of non‐profiled
capable of fighting against hiding countermeasures. Therefore, DLSCA against the noise generation‐based hiding counter-
in this experiment, an MLP‐based model is used to investigate measure. In order to simulate the noise generator counter-
the efficiency of the proposed technique on a masked dataset. measure, we simply add different levels of Gaussian noise to
Regarding hiding countermeasure, both MLP‐based and CNN‐ power traces. As pointed out in Section 1, in non‐profiled
based models are selected to evaluate the performance. contexts, we utilise the relationship between the real power
However, unlike the hiding method in Ref. [6], we use the noise consumption and the power consumption model. In addition,
generation‐based hiding technique for all experiments. in Chapter 6 of Ref. [17], the authors showed that the additive
noise contributes to values of the signal‐to‐noise ratio as
follows:
4.3.1 | Masking �
Var P exp
SNR ¼ ð12Þ
As presented in Section 2, the POI selected dataset is only VarðP sw:noise þ P el:noise Þ
suitable in first‐order leakage data. Therefore, in this experi-
ment, we only investigate the efficiency of the proposed where Pel.noise is the electronic noise, Psw.noise is the switching
labelling technique. The reconstructed dataset method and noise and Pexp denotes the exploitable power consumption.

FIGURE 10 Results of RISC‐V target with different model using 3‐HW: (a) TCHES 2019; (b) Proposed MLP; (c) Proposed CNN.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12
- DO ET AL.

In our case, additive noise plays a role as Pel.noise. Further- DDLA. Three different datasets corresponding to three
more, in Chapter 8 of Ref. [17], the authors indicated that the different levels of Gaussian noise were employed as described
pffiffiffiffiffiffiffiffiffiffi
correlation is proportional to SNR. Therefore, when the in Table 5.
Gaussian noise is introduced, the relationship between the data First, we use CNNproposed to perform the training. The
input and the output (label) of DDLA decreases. As a result, the experimental results are shown in Figure 12. It is clearly shown
DL model is more difficult to learn the data correctly. that our proposed CNN models still provide good perfor-
In this experiment, the CW platform is chosen because the mance with the small level of noise, as shown in Figure 12a.
raw power traces from the CW platform are low noise, and we However, the validation accuracy decreases considerably when
assume that they have no impact of measurement equipment. the deviation of noise increases. This decrease proves that
Our method for adding noise is different from those in Ref. additive noise has a negative effect on DDLA and proves our
[14, 15] andthe Gaussian noise is added directly on the power assumptions to be correct. Especially in case of a high level of
traces. It means that we try to simulate the real scenario when noise, the correct key is hidden inside the remaining key byte
an attacker performs the power measurements. We will hypothesis, as illustrated in Figure 12c.
investigate our assumption on both CNN and MLP‐based Next, to analyse the effect of activation functions on the
DL model, we have conducted in turn two experiments with
CNN using ELU and ReLU functions. We use the same three
datasets with the noise‐added power traces. The results are
depicted in Figure 12d–f corresponding to three levels of
noise. We also provide the comparison chart between the re-
sults of two activation functions in Figure 12g–i. In the case of
a low level of noise (σ = 0.025), it is clear that the performance
of the ELU activation is nearly the same as RELU. However,
the ELU activation function provides more consistent and
higher accuracy (around 47% compared to around 45% from
the 15th epoch to the last) for the proposed CNN when using
power traces with the level of noise added is equal to 0.05. In
the next level of noise (σ = 0.075), the validation accuracy of
both using ELU and RELU do not have a clear difference. In
summary, there is no big difference between using ELU or
RELU activation in CNN‐based DDLA.
Switching to MLP‐based DDLA, we repeat the same ex-
periments as done with the CNN model. The results are
depicted in Figure 13, where the top row and middle row
illustrate the results of MLPproposed4 using ELU and RELU in
different levels of noise. The bottom row shows the compar-
ison of using ELU and RELU activation. Overall, MLPproposed4
provided good results in non‐profiled context. However, MLP‐
based DDLA needs more epochs to discriminate the correct
key than CNN‐based DDLA. It is the same as our observation
on ASCAD and RISC‐V data.
In terms of the effect of noise, similar to CNNproposed, the
correct key curve accuracy of MLPproposed4 goes down when
F I G U R E 1 1 Experimental results on masked advanced encryption
the standard deviation of Gaussian noise increases (from 0.025
standard (AES) data using LSB and 3‐HW labelling technique: (a) LSB to 0.075) as illustrated in Figure 13a–c. However, we can see a
label; (b) 3‐HW label. significant difference in performance between ELU and RELU

TABLE 5 The details of reconstructed datasets from noise added ChipWhisperer (CW) power traces

Standard deviation of Gaussian noise added


0.025 0.05 0.075
Dataset Number of traces Dimension Number of traces Dimension Number of traces Dimension Labelling technique
CWdatanoise 5000 10,000 5000 10,000 5000 10,000

Dataset41 ≈3000 50 ‐ ‐ ‐ ‐ 3‐HW

Dataset42 ‐ ‐ ≈3000 50 ‐ ‐ 3‐HW

Dataset43 ‐ ‐ ‐ ‐ ≈3000 50 3‐HW


17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 13

F I G U R E 1 2 Attack results of the Convolutional Neural Network (CNN) model fight against three levels of noise generation‐based hiding countermeasure.
Left column: 0.025, Centre column: 0.05, Right column: 0.075; Figure (a)–(c) are results of CNN using the exponential linear unit (ELU) activation function;
Figure (d)–(f) are results of CNN using the rectified linear unit (ReLU) activation function; Figure (g)–(i) compare the validation accuracy of CNN by using ELU
and ReLU.

activation as depicted in Figure 13g–i. In the first case that MLPproposed4 using ELU outperforms the RELU in
(σ = 0.025), it is clearly seen that the correct key accuracy of fighting against noise generation‐based hiding coun-tnqh_9;
MLPproposed4 using ELU activation (red curve) increases termeasures.
drastically and reaches the peak of about 60%, whereas the Finally, by increasing the size of the attack dataset, we
RELU‐used model (blue curve) increases gradually from 36% investigate the ability of DLSCA to break the noise generator if
to 49%. This means that the ELU‐used model provides greater power traces are collected enough. To simulate this scenario, we
learning ability than RELU. In addition, the number of epoch chose σ = 0.025 and σ = 0.055 to form three new dataset from
needed for discriminating the correct key is only 5 (ELU‐ red CW data as presented in Table 6. In this experiment, all attacks
dash line) compared to 7 (RELU‐ blue dash line). More are performed using MLPproposed4. The results of non‐profiled
interestingly, in the case of higher noise (σ = 0.05), ELU used DLSCA on the low noise dataset (Dataset51) are illustrated in
MLPproposed4 needs only 7 epochs compared to 19 epochs of Figure 14a. Evidently, our model provides good performance
MLPproposed4 using RELU to determine the correct key. with the presence of a small level of noise generator (σ = 0.025).
Especially, the results presented in Figure 13i show that both However, by increasing the noise (σ = 0.055) and keeping the
models using ELU and RELU fail to recover the key due to the same size of the dataset (Dataset52), noise generator counter-
presence of a high level of noise (σ = 0.075). However, the red measure fights against our MLP‐based model successfully as
curve is still higher than the blue curve. These results clarify depicted in Figure 14b. Compare to CPA attack, we perform a
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14
- DO ET AL.

F I G U R E 1 3 Attack results of the multilayer perceptron (MLP) model fight against three levels of noise generation‐based hiding countermeasure. Left
column: 0.025, Centre column: 0.05, Right column: 0.075; Figure (a)–(c) are results of MLP using exponential linear unit (ELU) activation function; Figure (d)–
(f) are results of MLP using rectified linear unit (ReLU) activation function; Figure (g)–(i) compare the validation accuracy of MLP using ELU and ReLU.

T A B L E 6 The details of reconstructed


Number of traces
datasets for evaluating the noise
Dataset Original Reconstructed Dimension Std of noise (σ) Labelling technique generation‐based hiding countermeasure

Dataset51 3000 2100 50 0.025 3‐HW

CPAdata 3000 ‐ 10,000 0.055 ‐

Dataset52 3000 2100 50 0.055 3‐HW

Dataset53 4000 2800 50 0.055 3‐HW

Dataset54 5000 3500 50 0.055 3‐HW

first‐order CPA attack on noisy data similar to Dataset52. It is generator countermeasure. Next, we keep the level of noise
noted that we use the same number of power traces (3000 power (σ = 0.055) and increase the size of the dataset to approximately
traces) used for creating Dataset52 in this CPA attacks. The re- 2800 and 3500 power traces (Dataset53 and Dataset54).
sults shown in Figure 14c indicate that the CPA attack out- Figure 15 depicts the attack results. Interestingly, by increasing
performs non‐profiled DLSCA in the case of applying a noise the size of the attack dataset, our model achieves better results
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 15

and breaks the noise generator countermeasure successfully as when utilising parallel computing (a multi‐core CPU or GPU).
shown in Figure 15c. These results clarify that increasing the The results are summarised in Table 7.
attack set size can help mitigate the hiding countermeasure in Firstly, the efficiency of the dimensionality reduction
the non‐profiled scenario. method and 3‐HW is demonstrated on the CW dataset. The
execution time of DDLA attacks using the 3‐HW label on a
reconstructed dataset is significantly faster compared to the
4.4 | Complexity original dataset using the LSB label (approximately 1 h 40 min
compared to 4 h 32 min) since the dimensionality of data input
In this section, we analyse the complexity of the proposed reduce from 10,000 to 50. The same trend of attack time can
methods compared to the previous work [6]. One drawback of be seen in the results of the ASCAD unmasked dataset.
DDLA is that it is necessary to perform DL training for each However, the attack time decreases slightly from 6 h 53 min to
key guess. In the previous subsection, we use the number of 6 h because the dimension input reduces from 700 to 50. Next,
epochs as the metric to quantify the efficiency of an attack. we consider the case of ASCAD masked dataset. In this
However, in practice, the time complexity could be reduced context, the architecture, number of traces, and input di-
when the model uses the Stop Early technique to finish the mensions are chosen the same as in Timon's work. Since the
attacks. This technique is out of scope in this study and will be dataset of 3‐HW discards approximately 30% number of po-
considered in our future work. To complete the performance wer traces, it can be seen that the execution time reduces from
analysis of DDLA attacks, we provide the execution time 20 h 28 min to 19 h 32 min compared to the model using the
comparison for one key byte attack. Firstly, we recorded the LSB labelling method.
execution times of the proposed model using the 3‐HW In summary, the dimensionality of data input strongly
labelling and LSB labelling technique. It is worthy to note impacts the performance of DDLA. Our proposed dimen-
that all experiments are performed on MATLAB software on a sionality reduction method has demonstrated efficiency in the
personal computer without a Graphic Processing Unit (GPU). case of first‐order leakage compared to Timon's work. In
It means that the execution time could be reduced dramatically addition, the size of data can be determined by the labelling the
technique used. In our scenario, the models using 3‐HW have

F I G U R E 1 4 Non‐profiled DLSCA and correlation power analysis (CPA) attack results against hiding countermeasure. (a) DLSCA, ≈2100 power traces,
σ = 0.025; (b) DLSCA, ≈2100 power traces, σ = 0.055; (c) CPA, 3000 power traces, σ = 0.055

F I G U R E 1 5 Experimental result of non‐profiled DLSCA using different sizes of dataset. (a) ≈ 2100 power traces, σ = 0.055; (b) ≈ 2800 power traces,
σ = 0.055; (c) ≈ 3500 power traces, σ = 0.055.
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16
- DO ET AL.

TABLE 7 The execution time of Deep Learning Analysis technique (DDLA) attacks using different labelling techniques and architecture

Target Model Label No. of traces No. of trace samples No. of epochs Execution time
CW no mask MLPexp [6] LSB 3000 10,000 30 4 h 31 min 59 s
MLPproposed4 3‐HW 3000 50 30 1 h 39 min 30 s

ASCAD unmasked MLPexp [6] LSB 10,000 700 30 6 h 53 min 44 s


MLPproposed4 3‐HW 10,000 50 30 6 h 0 min 41 s

ASCAD masked MLPexp [6] LSB 20,000 700 35 20 h 28 min 48 s


MLPexp 3‐HW 20,000 700 35 19 h 32 min 17 s

achieved better results than that using the LSB labelling However, the exponential computation in ELU (ex) requires a
method in terms of execution time. It is noted that the neural longer processing time. We take MLPproposed training time on
networks used for the experiments in this study were purposely CW data as an example. For one hypothesis key (the same
kept small to limit the complexity of the attacks. The archi- dataset, number of epochs (30 epochs)), the average time for
tectures used are surely not optimal, and other hyper‐ training MLPproposed using ELU is 39.03 s, whereas, the average
parameters of the network might lead to better results. time of MLPproposed using RELU is 36.93 s. For taking 16 bytes
of the secret key, MLPproposed using ELU consumes a delay time
of approximately 16 � 256 � (39.03–36.93) = 8601.6 s (2.39 h).
5 | FUTHER DISCUSSION From the disadvantages mentioned above, some open
problems can be considered. First, for DDLA countermea-
The experimental results described in the previous sections sures, a noise generator is a good solution. The authors in Ref.
show that the proposed architectures and dataset reconstruc- [10] have indicated that noise generators can help to harden the
tion method can work well in non‐profiled contexts. However, cryptographic devices against DDLA attacks. Second, for the
some limitations exist in our work. reduced dimension data, a new reconstruction data method
First, the dataset reconstructed method focuses only on the and DL architecture need to consider to decrease the number
first leakage information and depend on the power con- of training needed for recovering the correct key.
sumption model. This limits our DDLA attacks to the sce-
narios where the protected technique like masking is applied.
In addition, for different platforms, we must choose a suitable 6 | CONCLUSION
power consumption model. For example, the HW models will
be used in the case of AES software implemented on the CW In this study, we have shown the practical issues that occur in
platform or RISC‐V processor on the Sakura‐G board, DL‐based SCA for aligned power traces, which have a large
whereas the HD models will be chosen for hardware imple- number of samples. In addition, by utilising the HW model, the
mentations such as high‐speed AES implementation on imbalance dataset is detected. In order to tackle the issues
application‐specific integrated circuit or FPGA. mentioned above, we have introduced a data preparation
Second, for each time sample t, the operation on all power method, which reduces the dimension data (from 10,000 (CW)
traces is usually the same. Therefore, the variance of the or 9919 (RISC‐V) to 50). More interestingly, we show a novel
operation‐ dependent power consumption Var(Poperation) = 0, labelling technique based on the HW model to deal with the
Var(Pexp) depends on only Var(Pdata). Consequently, the imbalanced dataset problem. The experimental results have
signal‐to‐noise ratio (SNR) value in Equation (12) can be clarified that our data preparation technique is capable of
further simplified asfollows: extracting the useful features for the non‐profiled SCA based
on neural networks. Our proposed model architecture is fine‐
VarðP data Þ
SNR ¼ ð13Þ tuned with different numbers of hidden layers, labelling tech-
VarðP sw:noise þ P el:noise Þ niques, and activation functions. Especially, the proposed
model using 3‐HW labelling and ELU function provides reli-
As presented in Section 2.3, our proposed labelling tech- able results for non‐profiled attacks on various datasets like
nique uses the values of HW as labels (HW = 3,4,5). In ASCAD, RISC‐V MCU, or CW. Significantly, the results have
addition, from Table 1, it is clear that HW with a big deviation shown that a noise generator becomes a promising SCA
appears with low probability. Therefore, using HW = 3,4,5 will countermeasure to prevent non‐profiled DLSCA. We also
decrease the deviation of data (Var(Pdata)). Consequently, SNR experimentally demonstrate that the proposed MLP model
decreases when only power traces correspond to HW = 3,4,5 using the ELU activation function provides better perfor-
are used. As mentioned
pffiffiffiffiffiffiin
ffiffiffiffi the previous section, the correlation mance than that of ReLU in fighting against noise generation‐
is propositional to SNR. As a result, the higher the level of based hiding countermeasure. In contrast, there is no clear
additive noise, the harder it is to attack. difference in using ELU and RELU activation on the CNN
In terms of activation functions, we have clarified that model. To complete the performance analysis, the time
MLPproposed using ELU outperforms the RELU‐used case. complexity of DDLA attacks was investigated. The
17518717, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12102 by Readcube (Labtiva Inc.), Wiley Online Library on [14/04/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DO ET AL.
- 17

experimental results have illustrated the relationship between 7. Do, N.T., Hoang, V.P., Doan, V.S.: Performance analysis of non‐profiled
time complexity and the dimensionality of data input. By using side channel attacks based on convolutional neural networks. In: 2020
the dimensionality reduction method on the first leakage order IEEE Asia Pacific Conference on Circuits and Systems (APCCAS),
pp. 66–69. (2020)
data, we achieved better results than the original work of 8. Picek, S., et al.: On the performance of convolutional neural networks for
Timon et al. Furthermore, it is also clarified that a suitable side‐channel analysis. In: Chattopadhyay, A., Rebeiro, C., Yarom, Y. (eds.)
labelling technique should be selected to reduce the data Security, Privacy, and Applied Cryptography Engineering, pp. 157–176.
complexity of DDLA attacks. In the future work, we aim to Springer International Publishing, Cham (2018)
increase neural networks' performance for non‐profiled attacks 9. Jolliffe, I.: Principal Component Analysis, pp. 1094–1096. Springer Berlin
Heidelberg, Berlin (2011)
by investigating other common DL techniques such as the 10. Alipour, A., et al.: On the performance of non‐profiled differential deep
Early Stop or parallel architectures. learning attacks against an aes encryption algorithm protected using a
correlated noise generation based hiding countermeasure. In: 2020
AUT HO R CO NT R I B U T I ONS Design, Automation Test in Europe Conference Exhibition (DATE),
pp. 614–617. (2020)
Ngoc‐Tuan Do: Resources; visualization; writing – original
11. Kuroda, K., et al.: Practical aspects on non‐profiled deep‐learning side‐
draft. Van‐Phuc Hoang: Conceptualization; funding acquisi- channel attacks against aes software implementation with two types of
tion; investigation; methodology; project administration; masking countermeasures including rsm. In: Proceedings of the 5th
writing – review and editing. Van Sang Doan: Methodology; Workshop on Attacks and Solutions in Hardware Security, Ser. ASHES
software. Cong‐Kha Pham: Writing – review and editing. ’21, pp. 29–40. Association for Computing Machinery, New York (2021).
https://doi.org/10.1145/3474376.3487285
12. Picek, S., et al.: The curse of class imbalance and conflicting metrics with
ACKN OW L ED GE ME N T machine learning for side‐channel evaluations. Transactions on Crypto-
This research is funded by Vietnam NAFOSTED under grant graphic Hardware and Embedded Systems 2019 (2018)
number 102.02‐2020.14. 13. Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data
augmentation against jitter‐based countermeasures. In: Fischer, W.,
Homma, N. (eds.) Cryptographic Hardware and Embedded Systems –
CON F LI CT O F I N T E RE ST
CHES 2017, pp. 45–68. Springer International Publishing, Cham (2017)
The authors declare that there is no conflict of interest that 14. Kim, J., et al.: Make some noise. unleashing the power of convolutional
could be perceived as prejudicing the impartiality of the neural networks for profiled side‐channel analysis. IACR Transactions on
research reported. Cryptographic Hardware and Embedded Systems, 148–179 (2019).
https://doi.org/10.46586/tches.v2019.i3.148‐179
DATA AVA IL A BI L I T Y S TA T E ME N T 15. Maghrebi, H.: Deep learning based side channel attacks in practice. IACR
Cryptol. ePrint Arch. 2019, 578 (2019)
No data is available. 16. Prouff, E., et al.: Study of deep learning techniques for side‐channel
analysis and introduction to ascad database. IACR Cryptol. ePrint
P ER MIS SI ON T O R EP ROD U CE M AT E R I A L S Arch. 2018, 53 (2018)
F RO M O TH ER S OU RC ES 17. Mangard, S., Oswald, E., Popp, T.: Power Analysis Attacks ‐ Revealing
None. the Secrets of Smart Cards. Springer (2007)
18. Moussa Ali Abdellatif, K., et al.: Filtering‐based CPA: a successful side‐
channel attack against desynchronization countermeasures. In: Fourth
O RC ID Workshop on Cryptography and Security in Computing Systems (CS2
Ngoc‐Tuan Do https://orcid.org/0000-0002-9195-658X ’17), pp. 29–32, Stockholm (2017). https://hal‐emse.ccsd.cnrs.fr/emse‐
Van‐Phuc Hoang https://orcid.org/0000-0003-0944-8701 01490735
Van Sang Doan https://orcid.org/0000-0001-9048-4341 19. Do, N.‐T., Hoang, V.‐P.: An efficient side channel attack technique with
improved correlation power analysis. In: Vo, N.‐S., Hoang, V.‐P. (eds.)
Cong‐Kha Pham https://orcid.org/0000-0001-5255-4919 Industrial Networks and Intelligent Systems, pp. 291–300. Springer In-
ternational Publishing, Cham (2020)
R E FE R ENC ES 20. Lerman, L., Bontempi, G., Markowitch, O.: A machine learning approach
1. Yamauchi, M., et al.: Anomaly detection in smart home operation from against a masked aes. Journal of Cryptographic Engineering 5(2),
user behaviors and home conditions. IEEE Trans. Consum. Electron. 123–139 (2014). https://doi.org/10.1007/s13389‐014‐0089‐3
66(2), 183–192 (2020). https://doi.org/10.1109/tce.2020.2981636 21. Gilmore, R., Hanley, N., O’Neill, M.: Neural network based attack on
2. Le, T.H., Canovas, C., Clédière, J.: An overview of side channel analysis a masked implementation of aes. In: 2015 IEEE International Symposium
attacks, pp. 33–43. (2008) on Hardware Oriented Security and Trust (HOST), pp. 106–111. (2015)
3. Fahn, P.N., Pearson, P.K.: IPA: a new class of power attacks. In: Koç, Ç. 22. Kamoun, N., Bossuet, L., Ghazel, A.: Correlated power noise generator
as a low cost dpa countermeasures to secure hardware aes cipher. In:
K., Paar, C. (eds.) Cryptographic Hardware and Embedded Systems,
pp. 173–186. Springer Berlin Heidelberg, Berlin (1999) 2009 3rd International Conference on Signals, Circuits and Systems
4. Chari, S., Rao, J.R., Rohatgi, P.: Template attacks. In: Kaliski, B.S., Koç, (SCS), pp. 1–6. (2009)
ç. K., Paar, C. (eds.) Cryptographic Hardware and Embedded Systems ‐
CHES 2002, pp. 13–28. Springer Berlin Heidelberg, Berlin (2003)
5. Ouladj, M., Guilley, S.: Foundations of Side‐Channel Attacks, pp. 9–20. How to cite this article: Do, N.‐T., et al.: On the
Springer International Publishing, Cham (2021) performance of non‐profiled side channel attacks based
6. Timon, B.: Non‐profiled deep learning‐based side‐channel attacks with on deep learning techniques. IET Inf. Secur. 1–17
sensitivity analysis. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 107–131 (2019). https://doi.org/10.46586/tches.
(2022). https://doi.org/10.1049/ise2.12102
v2019.i2.107‐131

You might also like