0% found this document useful (0 votes)

109 views7 pages

Cache-Collision Attacks On GPU-based AES Implementation With Electro-Magnetic Leakages

Attack on gpu

Uploaded by

vijaya gunji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views7 pages

Cache-Collision Attacks On GPU-based AES Implementation With Electro-Magnetic Leakages

Attack on gpu

Uploaded by

vijaya gunji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th

IEEE International Conference On Big Data Science And Engineering

Cache-Collision Attacks on GPU-based AES

Implementation with Electro-Magnetic Leakages
Yiwen GAO1,2 , Wei CHENG1 , Hailong ZHANG1 , Yongbin ZHOU1,2( )
1
State Key Laboratory of Information Security,
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
{gaoyiwen, chengwei, zhanghailong, zhouyongbin}@[Link]

Abstract—For computationally-intensive tasks like crypto- of information as well as hardware resources, which makes
graphic applications, GPU is thought to be an ideal platform due them much more powerful than attackers targeting client-side
to its parallel computing power. However, some vulnerabilities devices. What is worse, for many cryptographic applications
of GPU have been published due to overflow attacks, covert-
channel attacks and side-channel attacks. In this work, for the such as VPN and disk encryption, secret keys are not updated
first time, we investigate cache-collision attacks on GPU-based in a short time period, which provides a sufficient amount of
AES implementation utilizing Electro-Magnetic (EM) leakages. time for malicious insiders to launch attacks.
We construct a much efficient leakage model based on generalized So far, some progresses have been made on side-channel
simultaneous cache-collision in multi-threads scenarios, and we attacks against GPU. Luo et al. proposed the first power
mount a key-recovery attack with Differential Electro-Magnetic
Analysis (DEMA). Our evaluation results show that the 16-byte analysis attack to a GPU-based AES implementation [6]. They
secret key of GPU-based AES implementation can be recovered built a simplified leakage model to avoid the synchronization
with only 5,000 EM traces, and 600 EM traces are enough when of power traces in the time domain. Finally, they employed
assisted with appropriate key enumeration algorithm (KEA). This Correlation Power Analysis (CPA) to recover 16-byte secret
work suggests that cache-collision on GPU does give rise to key of GPU-based AES implementation with 160,000 power
leakages via EM side-channels and it should be considered in
the design of secure GPU-based cryptographic implementations. traces. The attack is conducted in a chosen-plaintext scenario,
because it requires the adversary be capable of encrypt-
Keywords—Cache-Collision Attacks, Side-Channel Attacks, ing the same plaintexts for all block threads. In fact, it is
Electro-Magnetic Attacks, Physical Security rather difficult to perform side-channel attacks successfully in
known-plaintext and highly-occupied scenarios against GPU-
I. I NTRODUCTION based cryptographic implementations. After that, Jiang et al.
With the advent of programmable shader cores and the published two cache-timing attacks against a GPU-based AES
support of programming frameworks, GPU has evolved from implementation based on time differences induced by L1 cache
a special-purpose device for graphics rendering into a general- line access serialization and shared memory bank conflict
purpose platform for high-performance computing. As a [8][9]. They recovered 16-byte secret key of the GPU-based
highly-parallel computing platform, GPU is well suited to AES implementation by Correlation Timing Analysis (CTA)
computationally-intensive tasks such as cryptographic appli- and Differential Timing Analysis (DTA), respectively. As a
cations, which have been widely deployed to provide data matter of fact, cache-collision attacks have been studied in
protection and security. As a matter of fact, some works different cases, and most of them belong to time-driven cache
have studied the implementation of cryptographic applications attacks [10]. Lauradoux proposed a power-based side-channel
on GPUs in order to exploit their great computing power attack against an AES implementation on a general processor
[1][2][3]. However, every new technology has its advantages by combining both collision attacks and cache attacks together
and disadvantages. So far, many attacks against GPU, such [11], which inspired our research.
as overflow attacks [4], covert-channel attacks [5] and side- In this work, for the first time, we investigate cache-collision
channel attacks [6], have been successful, of which side- attack on a GPU-based AES implementation based on its
channel attacks pose a great threat to GPU-based applications. EM leakages. Specifically, we propose a novel EM leakage
Compared with client-side devices, server-side devices such model based on the architecture feature of CUDA-enabled
as GPU working in cloud computing environment are usually GPU in multi-threads scenarios. The leakage model is much
not accessible to general-purpose users, but it does not mean efficient in detecting the internal collisions induced by cache
that server-side devices are free of side-channel attacks like access serialization. We evaluate the effectiveness and the
Electro-Magnetic Analysis (EMA). Instead, it is more likely efficiency of our leakage model with Differential Electro-
for them to suffer from side-channel attacks due to powerful Magnetic Analysis (DEMA) against AES implementation on
insider attacks, which are performed by malicious employees a NVIDIA Fermi GPU, and the evaluation results show that
inside an organization [7]. The potential malicious insiders the 16-byte secret key of GPU-based AES implementation
in the cloud might have access to an unprecedented amount can be recovered with only 5,000 EM traces. What is more,

2324-9013/18/31.00 ©2018 IEEE 300

DOI 10.1109/TrustCom/BigDataSE.2018.00053
600 EM traces are sufficient when combined with appropriate each of 10 rounds encryption is composed of four opera-
key enumeration algorithm (KEA) at the expense of less than tions named SubByte, ShiftRow, MixCol and AddRoundKey
100 milliseconds for key exhaustive search. It is much more except the 10th round, which does not integrate MixCol.
efficient than previous styles of power analysis attack. Many implementations of AES on general processors have
The rest of this paper is organized as follows. In section been published, of which S-Box (Look Up Table, LUT) LUT-
II, we give a brief introduction to CUDA-enabled GPUs and based AES is the most original version. Since GPU is a
GPU-based AES implementation. In section III, we provide SIMT device, the simplest way to implement a paralleled AES
details about our leakage acquisition and leakage detection. on GPUs is to assign block threads with independent AES
In section IV, we propose our leakage model in three chosen- encrytion/decryption tasks, which is also referred to as task-
plaintext scenarios. In section V, we evaluate the effectiveness level parallelism.
and the efficiency of the proposed leakage model with DEMA. As is known, L1 cache on GPU chip is designed to
Finally, conclusions are given in section VI. accelerate global memory accesses. With the L1 cache line
size of 128 bytes on Fermi device, S-Box LUT of 256 bytes
II. P RELIMINARY is loaded into two cache lines. When 32 block threads in a
A. CUDA-enabled GPUs warp access the same cache line, generalized simultaneous
cache-collision (Def. 2) happens, or generalized simultaneous
A CUDA-enabled GPU is composed of M Streaming Multi- cache-collision does not happen.
processors (SM) and a global memory. Each SM has N Scalar
Processor (SP), a shared memory, several 32-bits registers, and C. Definitions and Notations
a shared instruction unit. Warps are the basic unit of execution Definition 1. For a warp of block threads processing 32
in an SM. When you launch a grid of thread blocks, the plaintext blocks P1 , P2 , ..., P32 , respectively, if P1 = P2 =
thread blocks in the gird are distributed among SMs. Once ... = P 32 , P 32 = P 32 = ... = P 64 , ... and P32− 32 =
H H +1 H +2 H H +1
a thread block is scheduled to an SM, threads in the thread P32− 32 = ... = P32 , then it is called H-group encryption.
H +2
block are further partitioned into warps. A warp consists of 32
consecutive threads and all threads in a warp are executed in Definition 2. For a warp of block threads accessing cache
Single Instruction Multiple Thread (SIMT) fashion; that is, all lines by index I1 , I2 , ..., I32 , respectively, if [I1 ]log2 M =
threads execute the same instruction, and each thread carries [I2 ]log2 M = ... = [I32 ]log2 M , then we call it generalized
out that operation on its own private data. simultaneous cache-collision, where M is the number of
Global memory resides in device memory and is accessible cache lines and [x]n stands for the n most significant bits
via 32-byte, 64-byte, or 128-byte memory transactions. When (MSBs) of x.
a warp performs a memory load/store, the number of We introduce the following notations across the paper.
transactions required to satisfy that request typically depends Pl /Cl : denotes the l-th byte of 16-byte plaintext/ciphertext,
on the following two factors. There is one L1 cache per-SM where l ∈ [0, 15] ∩ Z.
and one L2 cache shared by all SMs. Both L1 and L2 caches (h) (h)
Pl /Cl : denotes any byte of plaintext/ciphertext from
are used to store data in local and global memory, includ- the h-th group of threads in H-group plaintexts encryption
ing register spills. On Fermi (compute capability = 2.x) scenario, where h ∈ {1, 2, 3, ..., H}.
GPUs, CUDA allows you to configure whether reads are −−→ −−→
(h) (h)
cached in both L1 and L2, or only L2. All accesses to global Pl /Cl : denotes the multi-dimensional column vector
(h) (h)
memory go through the L2 cache. Many accesses also pass of Pl /Cl corresponding to multiple samples.
through the L1 cache, depending on the type of access. If both Tm : denotes the m-th sample points of any single trace T
L1 and L2 caches are used, a memory access is serviced by a in the time domain.
−→
128-byte memory transaction. If only the L2 cache is used, a Tm : denotes the multi-dimensional column vector of Tm
memory access is serviced by a 32-byte memory transaction. corresponding to multiple samples.
→
− −
→
On architectures that allow the L1 cache to be used for global { X }Ni : denotes the i-th scalar of X of a N -dimensional
memory caching, the L1 cache can be explicitly enabled or column vector or row vector.
disabled at compile time. {X}N i·
×M
/{X}N·j
×M
: denotes the i-th row column or
the j-th column vector of a N × M matrix X.
B. AES Implementation on GPU
III. L EAKAGE ACQUISITION AND D ETECTION
Advanced Encryption Standard (AES) is based on a design
principle known as a substitution-permutation network (SPN). A. Set-ups
It is a variant of Rijndael which has a fixed block size of 128 In this work, we investigate the side-channel vulnerabilities
bits, and a key size of 128, 192, or 256 bits. AES operates of NVIDIA’s GPU towards cache-collision attacks on AES
on a 4×4 column-major order matrix of bytes, termed the implementation. Specifically, we target a GeForce GT 620
state. There are totally 10, 12, or 14 rounds encryption for GPU connected to a host computer with a PCIe bus. The
AES of 128-bit, 192-bit, or 256-bit key size, respectively. device has one streaming multiprocessor of 48 SPs, a L2 cache
Take 128-bit key size version AES (AES-128) for example, of 64KiB, and it is equipped with an off-chip device memory

301
of 454MiB. Although the GPU is of Fermi architecture, it is trigger mode with Voltage High=-0.11 and Voltage Low=-
enough to show the vulnerability of GPUs. 0.13. Then, almost aligned traces are acquired.
The original AES implementation is ported into our Fermi To align. Although we have captured almost aligned EM
GPU from a famous open source library [12]. We do not traces with our delicate trigger, it is still not enough to perform
change any code except some CUDA-specific operations in a successful attack. More accurate alignment techniques are
order to make our attack more convincing. The GPU-based needed. First, we observe the special patterns on the trace,
implementation follows the most general procedures of a and find a two-peak (A in Fig.1) pattern that is shared by all
CUDA program. First, the 32 plaintext blocks to be encrypted, traces. So it is likely an ideal reference to align all traces in
the S-Box LUT of 256 bytes as well as 11 subkeys of totally the time domain. Second, we match the pattern among several
176 bytes are transferred from the host memory to the GPU traces and find that the patterns in different traces are strongly
device memory with memcpy(·) function before a kernel correlated (Pearson Correlation Coefficient, PCC > 0.9). Third,
launch. Second, a kernel is launched with parameters of 1 for all traces, we search the pattern by fixing one trace and
block and 32 threads per block. Third, the 32 ciphertext blocks sliding the others within a small range to find the position at
are copied from the GPU device memory back to the host which the pattern hold the maximum PCC with the pattern in
memory with memcpy(·) function after the kernel finishes. the fixed trace. We exclude traces that the maximum PCC is
Note that we do not use any types of memory except global less than 0.90. Then, all traces with the maximum PCC no less
memory in the target GPU-based AES implementation. than 0.90 could be aligned properly.
B. Leakage Acquisition
Electro-Magnetic emanation always accompanies electronic #

devices, so it can be captured without any difficulties. How-

ever, it is not so easy to measure useful signals from electro- 8QNVCIG
8
magnetic emanation in practical scenarios. In this work, the

useful signals are informative leakages of AES encryption on

GPU, which is crucial to a successful EM attack. 6JG(KPCN4QWPF'PET[RVKQP
Compared with power analysis, EM analysis enables us to $

take advantage of localization effects, which makes our attack
more efﬁcient than power-based counterpart. In this work, we 6KOG
UGE

use a small magnetic probe Rohde Schwarz RF B 3-2 instead Fig. 1. EM trace of GPU-based AES implementation.
of a larger one so as to probe localized leakages from near-
field emanation [13]. Theoretically, the region located less than
1/2π of wavelength away from the source is called near field. C. Leakage Detection
All our probings in this work are conducted in this region. Cache-collisions are ubiquitous for any processors with
Specifically, the procedures are as follows: cached-memory architecture, and they usually cause distinctive
To locate. A printed circuit board (PCB) like GPU card power consumptions. For generalized simultaneous cache-
is generally composed of hundreds of electronic parts and collision on GPU, we assume that block threads with the
components including chips, capacitors, resistors, inductors collisions consume different power from those without the
and so on, but it is unnecessary for our experiments to check collisions. Thus the internal information within algorithms are
all of these elements. Generally speaking, only the right above leaked in the process of execution.
of GPU chip and capacitors on the back of GPU chip should In this work, the Welch’s t-test [14] is employed to detect
be considered, because it is more likely for these components leakages induced by generalized simultaneous cache-collisions
or positions to produce informative leakages. In fact, it is on GPU. The aim of t-test is to provide a quantitative value as
confirmed afterwards in our experiment. We run our AES a probability that the mean of two sets are different. In other
encryption program in a loop and adjust EM probes on the words, a t-test gives a probability to examine the validity of the
candidate components within their near-field zone, until we null hypothesis as the samples in both sets were drawn from
find a position in which the oscilloscope (Agilent KeySight the same population, i.e., the two sets are not distinguishable.
DSO9104A) captures a periodic signal. If some patterns in In our leakage detection, N 2-group plaintexts are encrypt-
the periodic signal repeat 9 to 10 times when zooming in it, ed:
leakage positions are found. −−→ −−→ −−→ −−→ −−→ −−→
(1) (1) (1) (2) (2) (2)
To capture. Although we have identified the target signal, P = P0 , P1 , ..., P15 , P0 , P1 , ..., P15 (1)
it is still not easy to capture it without external triggers.
In fact, it is impractical to provide external trigger in real −−→
(·)
scenarios, so we have to exploit special patterns within the where P· is a N -dimensional column vector. At the same
target signal as an internal trigger. We design a delicate trigger time, we obtain N 2-group ciphertexts and N EM traces:
−−→ −−→ −−→ −−→ −−→ −−→
by using the minimal voltage (B in Fig.1) within target signal. (1) (1) (1) (2) (2) (2)
Specifically, the oscilloscope is configured as Window Exit C = C0 , C1 , ..., C15 , C0 , C1 , ..., C15 (2)

302
−
→ − → −→ −
→
T = T1 , T2 , ..., TM (3) maxi∈{1,2,...,M } {| t |}M
i with respect to the number of sam-
ples N (Fig.2). Obviously, |t|max in the second setting
−−→ −
→
(·) (Equ.6) is below the threshold for any number of samples in
where C· and T· are also N -dimensional column vectors,
[50, 3000] ∩ Z, while |t|max in the first setting (Equ.4) is sig-
and M is the number of sample point in the time domain.
(1) (2) nificantly above the threshold and keep increasing with more
We take the first intermediates r0 , r0 of the final round
samples. So it can be concluded that generalized simultaneous
AES encryption for example, and detect their leakage in three
cache-collision does cause leakages.
procedures: −
→ −
→
By comparing t with tR on 1,000 samples, 2,000 samples
First, N EM traces are partitioned into two groups G0
(1) (2) and 3,000 samples (Fig.3), it is obvious that the three figures
and G1 with respect to the intermediates r0 , r0 (precisely,
on the left show leakages caused by generalized simultaneous
MSBs of intermediates). G0 contains EM traces that S-Box
(1) (2) cache-collisions happen at many time points and increase
accesses by r0 and r0 collide, while G1 contains the rest
remarkably with more samples, while the three figures on the
of EM traces. More formally,
right resemble each other and t-statistics are totally within the
−→
N −→
N threshold as expected.
N ×M (1) (2)
G0 = {T }i· r0 = r0 The same conclusions are also reached after performing t-
i 1 i 1 tests on sets grouped by MSBs or at random based on other
(4)
−→
N −→
N
15 intermediate bytes. So it is feasible to recover 16-byte
×M (1) (2) secret key of AES with DEMA based on the generalized
G1 = {T }N
i· r0 = r0
i 1 i 1
simultaneous cache-collision leakage model.

where [x]1 stands for the most signiﬁcant bit (MSB) of x. So IV. C ACHE -C OLLISION ATTACKS ON GPU S
−
→
the t-statistic t between G0 and G1 is:
→−−
− → Since warp is the basic unit of execution on GPU, we carry
→
− μ 0 μ 1 out our attacks on a single warp with chosen-plaintexts. As
t = → − →
−
(5)
2 s1 2
s0
−→ + −→
is mentioned above, generalized simultaneous cache-collision
n0 n1
among threads happens when all threads in a warp read from
−
→
where − →, −
μ → − → − → − → − →
0 μ1 , s1 , s2 , n0 , n1 and t are M -dimensional row
the same cache line. There are totally 160 S-Box LUT READ
−
→ −
→
vectors, and μ0 , μ1 are the means of G0 and G1 by columns, operations for 10 round encryptions, and for every READ
and − →s0 , −
→
s1 are the standard deviations of G0 and G1 by operation a warp of threads are likely to collide. Due to the
columns, and − → and −
n0
→ are the cardinality of G and G .
n 1 0 1
diffusion effect in every single round encryption, collisions
2 or non-collisions are randomly distributed for every S-Box
In
√ addition, all operations including · + ·, · − ·, ·/·, (·) and
· in Equ.5 are component-wise. accesses. Our leakage model is based on the basic hypothesis
that the power consumption, which is directly correlated with
electro-magnetic emanation, is of great difference between
10 grouped by MSBs S-Box look-ups (cache accesses) with collisions and S-Box
grouped at random
Maximal | t|

8 threshold look-ups without collisions. Since our EM traces are aligned

from the back, we make a trial on 16 S-Box look-ups in the
6
t = 4.5
ﬁnal round of AES encryption. As is known, the ﬁnal round
4 encryption is:
500 1000 1500 2000 2500 3000
cl ← SBox(rl ) ⊕ kl , (l, l ∈ {0, 1, ..., 15}) (7)
Number of Traces

where SBox(·) is S-Box LUT, kl and cl are the l -th byte of
Fig. 2. Maximal |t| in the time domain vs. the number of sample traces.
the ﬁnal round key and the l -th byte of ciphertext, respectively,
and rl is the intermediate value, which is used to model
Second, for comparison, N EM traces are randomly parti-
predicted EM leakage. l is not necessarily equal to l due to
tioned into two groups of equal size (not necessarily), specif-
ShiftRow operation. Additionally, the order of 16 S-Box look-
ically,
ups in the ﬁnal round encryption is determined by compilers.
×M

G0 = {T }N i· |i ∈ {1, 2, ..., N2 } For simplicity, we start with 2-group case, in which case
(6) only two different intermediate values at most happen among
(1) (2)
G1 = {T }N ×M
|i ∈ { N
+ 1, N
+ 2, ..., N } 32 values. The two intermediate values, named rl and rl ,
i· 2 2
can be easily calculated using the corresponding ciphertext
−
→
(1) (2)
and t-statistic tR between G0 and G1 is computed as above. byte cl , cl and a guessed key byte kguess as:
Third, an appropriate threshold is needed to decide the
(1) (1)
ACCEPT/REJECT status of above t-test. Generally speak- rl = SBox−1 (cl ⊕ kguess )
ing, two groups of samples are assumed to be from differ- (8)
ent populations, if |t| > 4.5 [15]. We evaluate |t|max = (2) (2)
rl = SBox−1 (cl ⊕ kguess )

303
Fig. 3. t statistic in the time domain.

where −→
(1) (2)
where SBox−1 (·) is the inverse S-Box LUT. Since cl and cl g is M -dimensional row vector, and n0 and n1 are the
(1) (2)
are known, rl and rl are determined by key byte guesses, cardinality of G0 (k) and G1 (k), respectively, and
(1) (2)
denoted rl (k) and rl (k), respectively. Our leakage model
−−−−→N −−−−→N
is defined as: ×M (1) (2)
G0 (k) = {T }N
i· rl (k) = rl (k)
i 1 i 1
(1) (2)
El = E(rl (k), rl (k)) +B (9) (12)
−−−−→N −−−−→N
N ×M (1) (2)
G1 (k) = {T }i· rl (k) = rl (k)
where El is the predicted EM leakage, B is assumed to be i 1 i 1
Gaussian noise, and
⎧ So the correct key byte can be calculated by:
⎨ E0 , [x]1 = [y]1
E(x, y) =
⎩
(10) −−−→
E1 , [x]1 = [y]1 kcorrect = argmax max {Δ(k)}M
i
k i={1,2,...,M },k={0,1,...,255}
(13)
E0 and E1 are assumed to be Gaussian variables with signif- We recover all 16 key bytes in a divide-and-conquer manner
icant difference-of-mean (DoM). In fact, the significant DoM (Algorithm 1) by assigning l = 0, 1, 2, ..., 15 in Equ.12,
between them has been verified in our leakage detection. For respectively.
N EM traces, we group them into two sets (Equ.12). One set In 2-group case, for any two random intermediate value
G0 is composed of EM traces that generalized simultaneous r(1) and r(2) , the occurrence probability of generalized simul-
cache-collisions happen, and the rest of EM traces belong to taneous cache-collision is 1/2, but it drops to 1/23 , 1/27 ,
the other set G1 . With the correct key guess k = kcorrect and 1/215 and 1/231 for 4-group, 8-group, 16-group and 32-
sufficient amount of EM traces, the following quantity will be group cases, respectively. Obviously, generalized simultaneous
significantly greater than that with incorrect key guesses: cache-collision scarcely happens in 16-group case and 32-
−−−→ 1
group case, so we do not attack in these cases. For 4-group
−
→ 1 −
→
Δ(k) = g − g (11) case, there are only N/8 out of N EM traces colliding in the
n0 →
− n1 →
− final round encryption. In 4-group case, we compute the DoM
g ∈G0 (k) g ∈G1 (k)

304
between the following two group of EM traces: Algorithm 1 DEMA with Cache-Collision Leakage Model
− −→
−−−−→
N → − →
Input: N EM traces: T = T1 , T2 , ..., TM , where T is a
N ×M (j)
G0 (k) = {T }i· rl (k) = 0 or 4 −
→ − → −→
j∈[1,4]∩Z i 1 N ×M matrix, and T1 , T2 , ... , TM are N -dimensional column

vectors; −−→ −−→ −−→
G1 (k) = T − G0 (k) (1) (1) (1)
N the 1st ciphertexts: C (1) = C0 , C1 , ..., C15 , where
(14)
Similarly, in 8-group case, −−→ −−→ −−→
(1) (1) (1)
C (1) is a N × 16 matrix, and C0 ,C1 ,...,C15 are N -
−−−−→
N −−→ −−→ −−→
dimensional column vectors;
×M (j)
H0 (k) = {T }N rl (k) = 0 or 8 (2) (2) (2)
i·
j∈[1,8]∩Z i
N the 2nd ciphertexts: C (2) = C0 , C1 , ..., C15 , where
1
−−→ −−→ −−→
(2) (2) (2)

H1 (k) = T − H0 (k) C (2) is a N × 16 matrix, and C0 ,C1 ,...,C15 are N -
(15) dimensional column vectors.
where all notations are defined as in Equ.12. Output: K = [k0 , k1 , ..., k15 ]: 16-byte correct secret key.
1: D = {0, 13, 10, 7, 4, 1, 14, 11, 8, 5, 2, 15, 12, 9, 6, 3}
V. E VALUATION R ESULTS 2: for l ← 0 to 15 do
In our experiment, we set up a chosen-plaintext attack 3: for kguess ← 0 to 255 −− do
−−→ −→
scenario, in which attackers are capable of encrypting any 4: (1)
R ← SBox −1 (1)
CD[l] ⊕ kguess
plaintexts, and obtain the corresponding ciphertexts as well −−−→
as EM traces. We mount DEMA attacks on GPU-based AES −−→ (2)
5: R(2) ← SBox−1 CD[l] ⊕ kguess
implementation in three cases, and analyze the effectiveness
−−→ −−→
and efficiency of the proposed leakage model in these cases. (1) (1)
6: R
−−→ ← R −−→>>7
First, the relations between DoMs and the number of EM R(2) ← (2)
traces are evaluated for all possible key byte candidates in all
7:
−
→ −−→ >>7
R −−→
8: R ← R(1) ⊕ R(2)
cases. We find that DEMA on the 6th key byte in 2-group case −
→ −
→
9: S ← 1 −R
performs the worst among all key bytes in all three cases. As N −
→
10: rsum ← i←1 R i
is shown in Fig.4, the secret key byte stands out from 256 key N − →
11: ssum ← i←1 S i
byte candidates with approximately 8,000 EM traces. Since
12: for m ← 1 to M do
this is the worst case, other key bytes recovery with DEMA →
− −→ − →
13: U ← Tm · R /* component-wise multiply */
consumes much less EM traces. →
− −→ − →
14: V ← Tm · S /* component-wise multiply */
Second, the efficiency of DEMA with the proposed leakage 1
N − →
15: umean ← rsum i←1 U i
model are evaluated. Specifically, we investigate the global
1
N − →
success rate (GSR) versus the number of EM traces in 2/4/8- 16: vmean ← ssum i←1 V i
group cases (Fig.7). The evaluation results show that DEMA 17: Δm ← |umean − vmean |
in 4-group case outperforms DEMAs in the other cases. As 18: Wkguess ← max{Δ1 , Δ2 , ..., ΔM }
is shown in Fig.5, 16 correct key bytes are clearly visible 19: Wmax ← max{W0 , W1 , ..., W255 }
in circles, when attacked with DEMA in 4-group case. That 20: kD[l] ← argmax(Wmax )
is to say, the 16-byte secret key of the GPU-based AES k
implementation can be recovered with 5,000 EM traces in 21: return K = [k0 , k1 , ..., k15 ]
chosen-plaintexts scenarios. We also investigate the number
of recovered key byte when sampling different number of EM
traces (Fig.6) in 2/4/8-group cases. It is obvious that most needs much less traces than previous styles of power analysis
of 16 key bytes are recovered efficiently except some special attack.
ones, so we combine KEA (Key Enumeration Algorithm) with As the first study about cache-collision attack with electro-
DEMA in our key-recovery attacks to improve performances. magnetic leakages against GPU-based AES implementation,
When assisted with KEA, 600 EM traces will suffice at the this work suggests that generalized simultaneous cache-
expense of less than 100 milliseconds for key exhaustive collision within GPUs does cause leakages via electro-
search. magnetic side-channels. So cache-collision attacks should be
VI. C ONCLUSION considered in the design of secure GPU-based cryptographic
implementations.
This paper presents a cache-collision attack on GPU-based
AES implementation with its electro-magnetic side-channel VII. ACKNOWLEDGMENT
leakages. We propose a novel leakage model based on gen-
This work is supported in part by the National Natural
eralized simultaneous cache-collisions and mount a complete
Science Foundation of China (No. 61472416, 61632020 and
key-recovery attack with a KEA-assisted DEMA. Our attack
61602468), and the Fundamental Theory and Cutting Edge

305
10-3 The 6th key byte guess in 2-group case 1
5

Global Success Rate (GSR)

0.8
4
0.6
3 correct value
DoM

0.4 DEMA in 2-group case

2 DEMA in 4-group case
0.2 DEMA in 8-group case
DEMA+KEA in 4-group case
1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
0 Number of Traces
2000 4000 6000 8000 10000 12000 14000
Number of Traces
Fig. 7. GSR vs. the number of traces.
Fig. 4. The number of traces vs. DoM for all possible key candidates.

[2] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation

10 -3 S-Box 1 10 -3 S-Box 5 10 -3 S-Box 9 10 -3 S-Box 13
5 5 and analysis of AES encryption on GPU,” in 14th IEEE International
5 Conference on High Performance Computing and Communication & 9th
DoM
DoM
DoM

DoM

5
IEEE International Conference on Embedded Software and Systems,
0 0 0 0 HPCC-ICESS 2012, Liverpool, United Kingdom, June 25-27, 2012
0 100 200 KG 0 100 200 KG 0 100 200KG 0 100 200KG 2012, pp. 843–848. ,
10 -3 S-Box 2 10 -3 S-Box 6 5 10 -3
S-Box 10 10 -3 S-Box 14
5 1.5 5 [3] N. Nishikawa, H. Amano, and K. Iwai, “Implementation of bitsliced
DoM

DoM

AES encryption on cuda-enabled GPU,” in Network and System Security

DoM

1
DoM

0.5 - 11th International Conference, NSS 2017, Helsinki, Finland, August

0 0 0 0 21-23, 2017, Proceedings, 2017, pp. 273–287.
0 100 200 KG 0 100 200 KG 0 100 200 KG 0 100 200 KG
10 -3 S-Box 3 S-Box 7 10 -3 S-Box 11 0.01 S-Box 15 [4] A. Miele, “Buffer overﬂow vulnerabilities in CUDA: a preliminary
5
analysis,” J. Computer Virology and Hacking Techniques, vol. 12, no. 2,
0.01
DoM

1 0.005 pp. 113–120, 2016.

DoM

0.005
DoM

[5] H. Naghibijouybari, K. N. Khasawneh, and N. B. Abu-Ghazaleh, “Con-

DoM

0 0 0 0
0 100 200KG 0 100 200 KG 0 100 200 KG 0 100 200 KG structing and characterizing covert channels on gpgpus,” in
10 -3 S-Box 4 S-Box 8 S-Box 12 0.02 S-Box 16 of the 50th Annual IEEE/ACM International SymposiumProceedings
on Microar-
5 0.01 0.02
chitecture, MICRO 2017, Cambridge, MA, USA, October 14-18, 2017
,
DoM

0.005 0.01 0.01 2017, pp. 354–366.

DoM

0
[6] C. Luo, Y. Fei, P. Luo, S. Mukherjee, and D. R. Kaeli, “Side-channel
0 0 0
0 100 200KG 0 100 200 KG 0 100 200 KG 0 100 200 KG power analysis of a GPU AES implementation,” in 33rd IEEE Inter-
national Conference on Computer Design, ICCD 2015, New York City,
KG: key guess, DoM: difference of mean. NY, USA, October 18-21, 2015, 2015, pp. 281–288.
[7] A. J. Duncan, S. Creese, and M. Goldsmith, “Insider attacks in cloud
computing,” in 11th IEEE International Conference on Trust, Security
Fig. 5. Evaluation results of DEMA in 4-group case. and Privacy in Computing and Communications, TrustCom 2012, Liv-
erpool, United Kingdom, June 25-27, 2012, 2012, pp. 857–862.
[8] Z. H. Jiang, Y. Fei, and D. R. Kaeli, “A complete key recovery timing
attack on a GPU,” in 2016 IEEE International Symposium on High
Number of Recovered Key Byte

15
Performance Computer Architecture, HPCA 2016, Barcelona, Spain,
March 12-16, 2016, 2016, pp. 394–405.
[9] ——, “A novel side-channel timing attack on GPUs,” in
10 Proceedings of
the on Great Lakes Symposium on VLSI 2017, Banff, AB, Canada, May
10-12, 2017, 2017, pp. 167–172.
[10] A. Bogdanov, T. Eisenbarth, C. Paar, and M. Wienecke, “Differential
5 cache-collision timing attacks on AES with applications to embedded
DEMA in 2-group case
DEMA in 4-group case cpus,” in Topics in Cryptology - CT-RSA 2010, The Cryptographers’
DEMA in 8-group case Track at the RSA Conference 2010, San Francisco, CA, USA, March
0 1-5, 2010. Proceedings, 2010, pp. 235–251.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
[11] C. Lauradoux, “Collision attacks on processors with cache and counter-
Number of Traces measures,” in WEWoRC 2005 - Western European Workshop on Research
in Cryptology, July 5-7, 2005, Leuven, Belgium, 2005, pp. 76–85.
Fig. 6. The number of recovered bytes vs. the number of traces. [12] PolarSSL, “An open source SSL library licensed by ARM limited,”
https : //[Link].
[13] D. Agrawal, B. Archambeault, J. R. Rao, and P. Rohatgi, “The EM
side-channel(s),” in Cryptographic Hardware and Embedded Systems -
Technology Research Program of Institute of Information En- CHES 2002, 4th International Workshop, Redwood Shores, CA, USA,
gineering, Chinese Academy of Sciences (No. Y7Z0401102). August 13-15, 2002, Revised Papers, 2002, pp. 29–45.
[14] J. G. Goodwill, J. Jaffe, and P. Rohatgi, “A testing methodology for
We would like to acknowledge their supports. side-channel resistance validation,” in NIST non-invasive attack testing
workshop, 2011, 2011.
R EFERENCES [15] T. Schneider and A. Moradi, “Leakage assessment methodology - A
clear roadmap for side-channel evaluations,” in Cryptographic Hardware
[1] Y. Yang, Z. Guan, H. Sun, and Z. Chen, “Accelerating RSA with and Embedded Systems - CHES 2015 - 17th International Workshop,
ﬁne-grained parallelism using GPU,” in Information Security Practice Saint-Malo, France, September 13-16, 2015, Proceedings
and Experience - 11th International Conference, ISPEC 2015, Beijing, , 2015, pp.
China, May 5-8, 2015. Proceedings, 2015, pp. 454–468. 495–513.

306

GPU AES Vulnerability Analysis
No ratings yet
GPU AES Vulnerability Analysis
8 pages
ccs18 Gpu Side Channel
No ratings yet
ccs18 Gpu Side Channel
15 pages
Trident: A Hybrid Correlation-Collision GPU Cache Timing Attack For AES Key Recovery
No ratings yet
Trident: A Hybrid Correlation-Collision GPU Cache Timing Attack For AES Key Recovery
13 pages
Understanding The Security of Discrete GPUs
No ratings yet
Understanding The Security of Discrete GPUs
11 pages
Web GPUAttacks
No ratings yet
Web GPUAttacks
13 pages
GPU Memory Exploitation
No ratings yet
GPU Memory Exploitation
19 pages
GPU Memory Forensics and Data Remanence
No ratings yet
GPU Memory Forensics and Data Remanence
6 pages
Common Counters: Compressed Encryption Counters For Secure GPU Memory
No ratings yet
Common Counters: Compressed Encryption Counters For Secure GPU Memory
13 pages
RSA Vulnerability via GPU Overclocking
No ratings yet
RSA Vulnerability via GPU Overclocking
21 pages
Jang Asplos19
No ratings yet
Jang Asplos19
14 pages
Parallel AES Implementation with CUDA
No ratings yet
Parallel AES Implementation with CUDA
8 pages
Physical Key Extraction Attacks On Pcs
No ratings yet
Physical Key Extraction Attacks On Pcs
10 pages
GPU Overdrive Fault Attack Explained
No ratings yet
GPU Overdrive Fault Attack Explained
6 pages
Physical Side-Channel Attacks and Covert Communication On FPGAs
No ratings yet
Physical Side-Channel Attacks and Covert Communication On FPGAs
19 pages
Osdi18-Volos-Graviton-Trusted Execution Environments On GPUs
No ratings yet
Osdi18-Volos-Graviton-Trusted Execution Environments On GPUs
17 pages
INVITED Protecting RISC-V Against Side-Channel Attacks
No ratings yet
INVITED Protecting RISC-V Against Side-Channel Attacks
4 pages
Power Analysis Attacks on Embedded Systems
100% (1)
Power Analysis Attacks on Embedded Systems
52 pages
Power Variability in Nanoscale Cryptography
100% (1)
Power Variability in Nanoscale Cryptography
20 pages
6 - Side-Channel Security of Superscalar CPUs - Evaluating The Impact of Micro-Architectural Features
No ratings yet
6 - Side-Channel Security of Superscalar CPUs - Evaluating The Impact of Micro-Architectural Features
6 pages
Nvidia Gpu 25
No ratings yet
Nvidia Gpu 25
15 pages
GPU Acceleration in Asymmetric Cryptography
No ratings yet
GPU Acceleration in Asymmetric Cryptography
21 pages
TCAD Hardware Security Survey
No ratings yet
TCAD Hardware Security Survey
29 pages
FPGA AES Attacks via Electromagnetic Analysis
No ratings yet
FPGA AES Attacks via Electromagnetic Analysis
16 pages
IoT Security: Raspberry Pi AES-128 Attack
No ratings yet
IoT Security: Raspberry Pi AES-128 Attack
7 pages
AES Design Improvements Towards Information
No ratings yet
AES Design Improvements Towards Information
5 pages
Lightweight AES for IoT Security
No ratings yet
Lightweight AES for IoT Security
11 pages
Embedded System Security Threats & Solutions
No ratings yet
Embedded System Security Threats & Solutions
8 pages
Radio-Based Key Extraction Attacks
100% (2)
Radio-Based Key Extraction Attacks
28 pages
Entropy 27 00202
No ratings yet
Entropy 27 00202
15 pages
IoT CPU-Core Security: AES-GCM Analysis
No ratings yet
IoT CPU-Core Security: AES-GCM Analysis
10 pages
Secure RISC-V AES Accelerator Design
No ratings yet
Secure RISC-V AES Accelerator Design
94 pages
Samira Briongos Herrero
No ratings yet
Samira Briongos Herrero
233 pages
HS Unit 3
No ratings yet
HS Unit 3
21 pages
GPU Acceleration of RSA Cryptography
No ratings yet
GPU Acceleration of RSA Cryptography
17 pages
Hardware Security Insights
No ratings yet
Hardware Security Insights
35 pages
Remote Side Channel Attack Strategies
No ratings yet
Remote Side Channel Attack Strategies
20 pages
Crypt Emb Sys
No ratings yet
Crypt Emb Sys
516 pages
Introduction to Side-Channel Attacks
100% (1)
Introduction to Side-Channel Attacks
16 pages
GPU-Accelerated Parallel AES Encryption
No ratings yet
GPU-Accelerated Parallel AES Encryption
6 pages
New Cache Designs For Thwarting Software Cache-Based Side Channel Attacks
No ratings yet
New Cache Designs For Thwarting Software Cache-Based Side Channel Attacks
12 pages
Gate Bleed
No ratings yet
Gate Bleed
17 pages
Disabling VMware Side-Channel Mitigations
No ratings yet
Disabling VMware Side-Channel Mitigations
12 pages
Diss David Oswald 2
No ratings yet
Diss David Oswald 2
240 pages
Provable Secure Software Masking in The Real-World
No ratings yet
Provable Secure Software Masking in The Real-World
22 pages
FPGA Security in Embedded Systems
No ratings yet
FPGA Security in Embedded Systems
24 pages
Ec3401-Networks and Security - 848492139-n&s Unit 5
No ratings yet
Ec3401-Networks and Security - 848492139-n&s Unit 5
13 pages
Side-Channel Attack Resistant ASIC Design
No ratings yet
Side-Channel Attack Resistant ASIC Design
6 pages
GPU-Based Secret Key Cryptography
No ratings yet
GPU-Based Secret Key Cryptography
18 pages
11 - MIFARE Classic Is Completely Broken
No ratings yet
11 - MIFARE Classic Is Completely Broken
37 pages
Co-Design for Secure Neural Network Inference
No ratings yet
Co-Design for Secure Neural Network Inference
12 pages
Defense Against ML-based Power Side-Channel Attacks On DNN Accelerators With Adversarial Attacks
No ratings yet
Defense Against ML-based Power Side-Channel Attacks On DNN Accelerators With Adversarial Attacks
13 pages
Side-Channel Attacks Explained
100% (1)
Side-Channel Attacks Explained
37 pages
Secret Key Cryptography Using Graphics Cards
No ratings yet
Secret Key Cryptography Using Graphics Cards
14 pages
Final Proposal PDF
No ratings yet
Final Proposal PDF
14 pages
Side Channel Leakage in RISC CPU Analysis
100% (1)
Side Channel Leakage in RISC CPU Analysis
25 pages
Power Modulation for IoT Security
100% (1)
Power Modulation for IoT Security
5 pages
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
No ratings yet
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
5 pages
AES 32 An FPGA Implementation of Lightweight-AES For
No ratings yet
AES 32 An FPGA Implementation of Lightweight-AES For
10 pages
Unit 2 Chapter 5 AwvD7QQvvjtrDEoj
No ratings yet
Unit 2 Chapter 5 AwvD7QQvvjtrDEoj
15 pages
Challenges and Opportunities in Ethiopia's Manufacturing Sector
100% (1)
Challenges and Opportunities in Ethiopia's Manufacturing Sector
7 pages
Human Resource Management Overview
No ratings yet
Human Resource Management Overview
20 pages
Air-Standard Cycle Analysis and Comparisons
No ratings yet
Air-Standard Cycle Analysis and Comparisons
28 pages
Application Form For Promotion of Faculty Members
No ratings yet
Application Form For Promotion of Faculty Members
9 pages
DC135A AFE7900EVM SCH
No ratings yet
DC135A AFE7900EVM SCH
32 pages
Siemens-Sw-A-Closer-Look-Next-Generation-White-Paper Tcm27-85886
No ratings yet
Siemens-Sw-A-Closer-Look-Next-Generation-White-Paper Tcm27-85886
10 pages
SPA 400 Becker AR3201 Inst
No ratings yet
SPA 400 Becker AR3201 Inst
1 page
Lesson Plan Expression of Congratulation
0% (1)
Lesson Plan Expression of Congratulation
10 pages
Networking Principles for Businesses
No ratings yet
Networking Principles for Businesses
70 pages
Packet Tracer - Configuring The Cloud
0% (2)
Packet Tracer - Configuring The Cloud
9 pages
Listening and Grammar Practice Test
No ratings yet
Listening and Grammar Practice Test
13 pages
Sensation and Perception Overview
No ratings yet
Sensation and Perception Overview
40 pages
GWR Team Final
No ratings yet
GWR Team Final
2 pages
DOS Commands for Networking Students
No ratings yet
DOS Commands for Networking Students
17 pages
Width To Thickness
No ratings yet
Width To Thickness
2 pages
ESAL Calculation for Pavement Design
No ratings yet
ESAL Calculation for Pavement Design
8 pages
Class 150 & 300 Ball Valve Specifications
No ratings yet
Class 150 & 300 Ball Valve Specifications
1 page
Org Man - Q2 M1
No ratings yet
Org Man - Q2 M1
15 pages
Gaurav Aggarwal Resume - Product Management
No ratings yet
Gaurav Aggarwal Resume - Product Management
1 page
Types & Components of Computer Systems
No ratings yet
Types & Components of Computer Systems
27 pages
Chemistry IA Exemplar Document
No ratings yet
Chemistry IA Exemplar Document
15 pages
EVOLIS Twin Plus User Manual EN (English)
No ratings yet
EVOLIS Twin Plus User Manual EN (English)
88 pages
UCCA Air Handler Specifications and Features
No ratings yet
UCCA Air Handler Specifications and Features
26 pages
Ansys Question Paper
No ratings yet
Ansys Question Paper
2 pages
Samridhi Math Isc Project
No ratings yet
Samridhi Math Isc Project
9 pages
2 Magalhaes G
No ratings yet
2 Magalhaes G
1 page
Childhood
No ratings yet
Childhood
3 pages
Links To Syllabus 2022 23
No ratings yet
Links To Syllabus 2022 23
7 pages
Chelation Therapy & CAD: Presented By: Ms. Kusum MSC - Nursing, 4 Semester
No ratings yet
Chelation Therapy & CAD: Presented By: Ms. Kusum MSC - Nursing, 4 Semester
19 pages

Cache-Collision Attacks On GPU-based AES Implementation With Electro-Magnetic Leakages

Uploaded by

Cache-Collision Attacks On GPU-based AES Implementation With Electro-Magnetic Leakages

Uploaded by

2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th

IEEE International Conference On Big Data Science And Engineering

Cache-Collision Attacks on GPU-based AES

2324-9013/18/31.00 ©2018 IEEE 300

useful signals are informative leakages of AES encryption on 

8 threshold look-ups without collisions. Since our EM traces are aligned

Global Success Rate (GSR)

0.4 DEMA in 2-group case

[2] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation

AES encryption on cuda-enabled GPU,” in Network and System Security

0.5 - 11th International Conference, NSS 2017, Helsinki, Finland, August

1 0.005 pp. 113–120, 2016.

[5] H. Naghibijouybari, K. N. Khasawneh, and N. B. Abu-Ghazaleh, “Con-

0.005 0.01 0.01 2017, pp. 354–366.

You might also like

useful signals are informative leakages of AES encryption on