0% found this document useful (0 votes)
109 views7 pages

Cache-Collision Attacks On GPU-based AES Implementation With Electro-Magnetic Leakages

Attack on gpu

Uploaded by

vijaya gunji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views7 pages

Cache-Collision Attacks On GPU-based AES Implementation With Electro-Magnetic Leakages

Attack on gpu

Uploaded by

vijaya gunji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th

IEEE International Conference On Big Data Science And Engineering

Cache-Collision Attacks on GPU-based AES


Implementation with Electro-Magnetic Leakages
Yiwen GAO1,2 , Wei CHENG1 , Hailong ZHANG1 , Yongbin ZHOU1,2( )
1
State Key Laboratory of Information Security,
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
2
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
{gaoyiwen, chengwei, zhanghailong, zhouyongbin}@[Link]

Abstract—For computationally-intensive tasks like crypto- of information as well as hardware resources, which makes
graphic applications, GPU is thought to be an ideal platform due them much more powerful than attackers targeting client-side
to its parallel computing power. However, some vulnerabilities devices. What is worse, for many cryptographic applications
of GPU have been published due to overflow attacks, covert-
channel attacks and side-channel attacks. In this work, for the such as VPN and disk encryption, secret keys are not updated
first time, we investigate cache-collision attacks on GPU-based in a short time period, which provides a sufficient amount of
AES implementation utilizing Electro-Magnetic (EM) leakages. time for malicious insiders to launch attacks.
We construct a much efficient leakage model based on generalized So far, some progresses have been made on side-channel
simultaneous cache-collision in multi-threads scenarios, and we attacks against GPU. Luo et al. proposed the first power
mount a key-recovery attack with Differential Electro-Magnetic
Analysis (DEMA). Our evaluation results show that the 16-byte analysis attack to a GPU-based AES implementation [6]. They
secret key of GPU-based AES implementation can be recovered built a simplified leakage model to avoid the synchronization
with only 5,000 EM traces, and 600 EM traces are enough when of power traces in the time domain. Finally, they employed
assisted with appropriate key enumeration algorithm (KEA). This Correlation Power Analysis (CPA) to recover 16-byte secret
work suggests that cache-collision on GPU does give rise to key of GPU-based AES implementation with 160,000 power
leakages via EM side-channels and it should be considered in
the design of secure GPU-based cryptographic implementations. traces. The attack is conducted in a chosen-plaintext scenario,
because it requires the adversary be capable of encrypt-
Keywords—Cache-Collision Attacks, Side-Channel Attacks, ing the same plaintexts for all block threads. In fact, it is
Electro-Magnetic Attacks, Physical Security rather difficult to perform side-channel attacks successfully in
known-plaintext and highly-occupied scenarios against GPU-
I. I NTRODUCTION based cryptographic implementations. After that, Jiang et al.
With the advent of programmable shader cores and the published two cache-timing attacks against a GPU-based AES
support of programming frameworks, GPU has evolved from implementation based on time differences induced by L1 cache
a special-purpose device for graphics rendering into a general- line access serialization and shared memory bank conflict
purpose platform for high-performance computing. As a [8][9]. They recovered 16-byte secret key of the GPU-based
highly-parallel computing platform, GPU is well suited to AES implementation by Correlation Timing Analysis (CTA)
computationally-intensive tasks such as cryptographic appli- and Differential Timing Analysis (DTA), respectively. As a
cations, which have been widely deployed to provide data matter of fact, cache-collision attacks have been studied in
protection and security. As a matter of fact, some works different cases, and most of them belong to time-driven cache
have studied the implementation of cryptographic applications attacks [10]. Lauradoux proposed a power-based side-channel
on GPUs in order to exploit their great computing power attack against an AES implementation on a general processor
[1][2][3]. However, every new technology has its advantages by combining both collision attacks and cache attacks together
and disadvantages. So far, many attacks against GPU, such [11], which inspired our research.
as overflow attacks [4], covert-channel attacks [5] and side- In this work, for the first time, we investigate cache-collision
channel attacks [6], have been successful, of which side- attack on a GPU-based AES implementation based on its
channel attacks pose a great threat to GPU-based applications. EM leakages. Specifically, we propose a novel EM leakage
Compared with client-side devices, server-side devices such model based on the architecture feature of CUDA-enabled
as GPU working in cloud computing environment are usually GPU in multi-threads scenarios. The leakage model is much
not accessible to general-purpose users, but it does not mean efficient in detecting the internal collisions induced by cache
that server-side devices are free of side-channel attacks like access serialization. We evaluate the effectiveness and the
Electro-Magnetic Analysis (EMA). Instead, it is more likely efficiency of our leakage model with Differential Electro-
for them to suffer from side-channel attacks due to powerful Magnetic Analysis (DEMA) against AES implementation on
insider attacks, which are performed by malicious employees a NVIDIA Fermi GPU, and the evaluation results show that
inside an organization [7]. The potential malicious insiders the 16-byte secret key of GPU-based AES implementation
in the cloud might have access to an unprecedented amount can be recovered with only 5,000 EM traces. What is more,

2324-9013/18/31.00 ©2018 IEEE 300


DOI 10.1109/TrustCom/BigDataSE.2018.00053
600 EM traces are sufficient when combined with appropriate each of 10 rounds encryption is composed of four opera-
key enumeration algorithm (KEA) at the expense of less than tions named SubByte, ShiftRow, MixCol and AddRoundKey
100 milliseconds for key exhaustive search. It is much more except the 10th round, which does not integrate MixCol.
efficient than previous styles of power analysis attack. Many implementations of AES on general processors have
The rest of this paper is organized as follows. In section been published, of which S-Box (Look Up Table, LUT) LUT-
II, we give a brief introduction to CUDA-enabled GPUs and based AES is the most original version. Since GPU is a
GPU-based AES implementation. In section III, we provide SIMT device, the simplest way to implement a paralleled AES
details about our leakage acquisition and leakage detection. on GPUs is to assign block threads with independent AES
In section IV, we propose our leakage model in three chosen- encrytion/decryption tasks, which is also referred to as task-
plaintext scenarios. In section V, we evaluate the effectiveness level parallelism.
and the efficiency of the proposed leakage model with DEMA. As is known, L1 cache on GPU chip is designed to
Finally, conclusions are given in section VI. accelerate global memory accesses. With the L1 cache line
size of 128 bytes on Fermi device, S-Box LUT of 256 bytes
II. P RELIMINARY is loaded into two cache lines. When 32 block threads in a
A. CUDA-enabled GPUs warp access the same cache line, generalized simultaneous
cache-collision (Def. 2) happens, or generalized simultaneous
A CUDA-enabled GPU is composed of M Streaming Multi- cache-collision does not happen.
processors (SM) and a global memory. Each SM has N Scalar
Processor (SP), a shared memory, several 32-bits registers, and C. Definitions and Notations
a shared instruction unit. Warps are the basic unit of execution Definition 1. For a warp of block threads processing 32
in an SM. When you launch a grid of thread blocks, the plaintext blocks P1 , P2 , ..., P32 , respectively, if P1 = P2 =
thread blocks in the gird are distributed among SMs. Once ... = P 32 , P 32 = P 32 = ... = P 64 , ... and P32− 32 =
H H +1 H +2 H H +1
a thread block is scheduled to an SM, threads in the thread P32− 32 = ... = P32 , then it is called H-group encryption.
H +2
block are further partitioned into warps. A warp consists of 32
consecutive threads and all threads in a warp are executed in Definition 2. For a warp of block threads accessing cache
Single Instruction Multiple Thread (SIMT) fashion; that is, all lines by index I1 , I2 , ..., I32 , respectively, if [I1 ]log2 M =
threads execute the same instruction, and each thread carries [I2 ]log2 M = ... = [I32 ]log2 M , then we call it generalized
out that operation on its own private data. simultaneous cache-collision, where M is the number of
Global memory resides in device memory and is accessible cache lines and [x]n stands for the n most significant bits
via 32-byte, 64-byte, or 128-byte memory transactions. When (MSBs) of x.
a warp performs a memory load/store, the number of We introduce the following notations across the paper.
transactions required to satisfy that request typically depends Pl /Cl : denotes the l-th byte of 16-byte plaintext/ciphertext,
on the following two factors. There is one L1 cache per-SM where l ∈ [0, 15] ∩ Z.
and one L2 cache shared by all SMs. Both L1 and L2 caches (h) (h)
Pl /Cl : denotes any byte of plaintext/ciphertext from
are used to store data in local and global memory, includ- the h-th group of threads in H-group plaintexts encryption
ing register spills. On Fermi (compute capability = 2.x) scenario, where h ∈ {1, 2, 3, ..., H}.
GPUs, CUDA allows you to configure whether reads are −−→ −−→
(h) (h)
cached in both L1 and L2, or only L2. All accesses to global Pl /Cl : denotes the multi-dimensional column vector
(h) (h)
memory go through the L2 cache. Many accesses also pass of Pl /Cl corresponding to multiple samples.
through the L1 cache, depending on the type of access. If both Tm : denotes the m-th sample points of any single trace T
L1 and L2 caches are used, a memory access is serviced by a in the time domain.
−→
128-byte memory transaction. If only the L2 cache is used, a Tm : denotes the multi-dimensional column vector of Tm
memory access is serviced by a 32-byte memory transaction. corresponding to multiple samples.

− −

On architectures that allow the L1 cache to be used for global { X }Ni : denotes the i-th scalar of X of a N -dimensional
memory caching, the L1 cache can be explicitly enabled or column vector or row vector.
disabled at compile time. {X}N i·
×M
/{X}N·j
×M
: denotes the i-th row column or
the j-th column vector of a N × M matrix X.
B. AES Implementation on GPU
III. L EAKAGE ACQUISITION AND D ETECTION
Advanced Encryption Standard (AES) is based on a design
principle known as a substitution-permutation network (SPN). A. Set-ups
It is a variant of Rijndael which has a fixed block size of 128 In this work, we investigate the side-channel vulnerabilities
bits, and a key size of 128, 192, or 256 bits. AES operates of NVIDIA’s GPU towards cache-collision attacks on AES
on a 4×4 column-major order matrix of bytes, termed the implementation. Specifically, we target a GeForce GT 620
state. There are totally 10, 12, or 14 rounds encryption for GPU connected to a host computer with a PCIe bus. The
AES of 128-bit, 192-bit, or 256-bit key size, respectively. device has one streaming multiprocessor of 48 SPs, a L2 cache
Take 128-bit key size version AES (AES-128) for example, of 64KiB, and it is equipped with an off-chip device memory

301
of 454MiB. Although the GPU is of Fermi architecture, it is trigger mode with Voltage High=-0.11 and Voltage Low=-
enough to show the vulnerability of GPUs. 0.13. Then, almost aligned traces are acquired.
The original AES implementation is ported into our Fermi To align. Although we have captured almost aligned EM
GPU from a famous open source library [12]. We do not traces with our delicate trigger, it is still not enough to perform
change any code except some CUDA-specific operations in a successful attack. More accurate alignment techniques are
order to make our attack more convincing. The GPU-based needed. First, we observe the special patterns on the trace,
implementation follows the most general procedures of a and find a two-peak (A in Fig.1) pattern that is shared by all
CUDA program. First, the 32 plaintext blocks to be encrypted, traces. So it is likely an ideal reference to align all traces in
the S-Box LUT of 256 bytes as well as 11 subkeys of totally the time domain. Second, we match the pattern among several
176 bytes are transferred from the host memory to the GPU traces and find that the patterns in different traces are strongly
device memory with memcpy(·) function before a kernel correlated (Pearson Correlation Coefficient, PCC > 0.9). Third,
launch. Second, a kernel is launched with parameters of 1 for all traces, we search the pattern by fixing one trace and
block and 32 threads per block. Third, the 32 ciphertext blocks sliding the others within a small range to find the position at
are copied from the GPU device memory back to the host which the pattern hold the maximum PCC with the pattern in
memory with memcpy(·) function after the kernel finishes. the fixed trace. We exclude traces that the maximum PCC is
Note that we do not use any types of memory except global less than 0.90. Then, all traces with the maximum PCC no less
memory in the target GPU-based AES implementation. than 0.90 could be aligned properly.
B. Leakage Acquisition 
Electro-Magnetic emanation always accompanies electronic #

devices, so it can be captured without any difficulties. How-

ever, it is not so easy to measure useful signals from electro- 8QNVCIG
8
magnetic emanation in practical scenarios. In this work, the 

useful signals are informative leakages of AES encryption on 


GPU, which is crucial to a successful EM attack.  6JG(KPCN4QWPF'PET[RVKQP
Compared with power analysis, EM analysis enables us to $

take advantage of localization effects, which makes our attack           
more efficient than power-based counterpart. In this work, we 6KOG
UGE 

use a small magnetic probe Rohde Schwarz RF B 3-2 instead Fig. 1. EM trace of GPU-based AES implementation.
of a larger one so as to probe localized leakages from near-
field emanation [13]. Theoretically, the region located less than
1/2π of wavelength away from the source is called near field. C. Leakage Detection
All our probings in this work are conducted in this region. Cache-collisions are ubiquitous for any processors with
Specifically, the procedures are as follows: cached-memory architecture, and they usually cause distinctive
To locate. A printed circuit board (PCB) like GPU card power consumptions. For generalized simultaneous cache-
is generally composed of hundreds of electronic parts and collision on GPU, we assume that block threads with the
components including chips, capacitors, resistors, inductors collisions consume different power from those without the
and so on, but it is unnecessary for our experiments to check collisions. Thus the internal information within algorithms are
all of these elements. Generally speaking, only the right above leaked in the process of execution.
of GPU chip and capacitors on the back of GPU chip should In this work, the Welch’s t-test [14] is employed to detect
be considered, because it is more likely for these components leakages induced by generalized simultaneous cache-collisions
or positions to produce informative leakages. In fact, it is on GPU. The aim of t-test is to provide a quantitative value as
confirmed afterwards in our experiment. We run our AES a probability that the mean of two sets are different. In other
encryption program in a loop and adjust EM probes on the words, a t-test gives a probability to examine the validity of the
candidate components within their near-field zone, until we null hypothesis as the samples in both sets were drawn from
find a position in which the oscilloscope (Agilent KeySight the same population, i.e., the two sets are not distinguishable.
DSO9104A) captures a periodic signal. If some patterns in In our leakage detection, N 2-group plaintexts are encrypt-
the periodic signal repeat 9 to 10 times when zooming in it, ed:
leakage positions are found. −−→ −−→ −−→ −−→ −−→ −−→
(1) (1) (1) (2) (2) (2)
To capture. Although we have identified the target signal, P = P0 , P1 , ..., P15 , P0 , P1 , ..., P15 (1)
it is still not easy to capture it without external triggers.
In fact, it is impractical to provide external trigger in real −−→
(·)
scenarios, so we have to exploit special patterns within the where P· is a N -dimensional column vector. At the same
target signal as an internal trigger. We design a delicate trigger time, we obtain N 2-group ciphertexts and N EM traces:
−−→ −−→ −−→ −−→ −−→ −−→
by using the minimal voltage (B in Fig.1) within target signal. (1) (1) (1) (2) (2) (2)
Specifically, the oscilloscope is configured as Window Exit C = C0 , C1 , ..., C15 , C0 , C1 , ..., C15 (2)

302
−
→ − → −→ −

T = T1 , T2 , ..., TM (3) maxi∈{1,2,...,M } {| t |}M
i with respect to the number of sam-
ples N (Fig.2). Obviously, |t|max in the second setting
−−→ −

(·) (Equ.6) is below the threshold for any number of samples in
where C· and T· are also N -dimensional column vectors,
[50, 3000] ∩ Z, while |t|max in the first setting (Equ.4) is sig-
and M is the number of sample point in the time domain.
(1) (2) nificantly above the threshold and keep increasing with more
We take the first intermediates r0 , r0 of the final round
samples. So it can be concluded that generalized simultaneous
AES encryption for example, and detect their leakage in three
cache-collision does cause leakages.
procedures: −
→ −

By comparing t with tR on 1,000 samples, 2,000 samples
First, N EM traces are partitioned into two groups G0
(1) (2) and 3,000 samples (Fig.3), it is obvious that the three figures
and G1 with respect to the intermediates r0 , r0 (precisely,
on the left show leakages caused by generalized simultaneous
MSBs of intermediates). G0 contains EM traces that S-Box
(1) (2) cache-collisions happen at many time points and increase
accesses by r0 and r0 collide, while G1 contains the rest
remarkably with more samples, while the three figures on the
of EM traces. More formally,
   right resemble each other and t-statistics are totally within the
 −→
N −→
N threshold as expected.
N ×M  (1) (2)
G0 = {T }i·  r0 = r0 The same conclusions are also reached after performing t-
i 1 i 1 tests on sets grouped by MSBs or at random based on other
   (4)
−→
N −→
N
 15 intermediate bytes. So it is feasible to recover 16-byte
×M  (1) (2) secret key of AES with DEMA based on the generalized
G1 = {T }N
i·  r0 = r0
i 1 i 1
simultaneous cache-collision leakage model.

where [x]1 stands for the most significant bit (MSB) of x. So IV. C ACHE -C OLLISION ATTACKS ON GPU S


the t-statistic t between G0 and G1 is:
→−−
− → Since warp is the basic unit of execution on GPU, we carry

− μ 0 μ 1 out our attacks on a single warp with chosen-plaintexts. As
t = → − →

(5)
2 s1 2
s0
−→ + −→
is mentioned above, generalized simultaneous cache-collision
n0 n1
among threads happens when all threads in a warp read from


where − →, −
μ → − → − → − → − →
0 μ1 , s1 , s2 , n0 , n1 and t are M -dimensional row
the same cache line. There are totally 160 S-Box LUT READ

→ −

vectors, and μ0 , μ1 are the means of G0 and G1 by columns, operations for 10 round encryptions, and for every READ
and − →s0 , −

s1 are the standard deviations of G0 and G1 by operation a warp of threads are likely to collide. Due to the
columns, and − → and −
n0
→ are the cardinality of G and G .
n 1 0 1
diffusion effect in every single round encryption, collisions
2 or non-collisions are randomly distributed for every S-Box
In
√ addition, all operations including · + ·, · − ·, ·/·, (·) and
· in Equ.5 are component-wise. accesses. Our leakage model is based on the basic hypothesis
that the power consumption, which is directly correlated with
electro-magnetic emanation, is of great difference between
10 grouped by MSBs S-Box look-ups (cache accesses) with collisions and S-Box
grouped at random
Maximal | t|

8 threshold look-ups without collisions. Since our EM traces are aligned


from the back, we make a trial on 16 S-Box look-ups in the
6
t = 4.5
final round of AES encryption. As is known, the final round
4 encryption is:
500 1000 1500 2000 2500 3000
cl ← SBox(rl ) ⊕ kl , (l, l ∈ {0, 1, ..., 15}) (7)
Number of Traces

where SBox(·) is S-Box LUT, kl and cl are the l -th byte of
Fig. 2. Maximal |t| in the time domain vs. the number of sample traces.
the final round key and the l -th byte of ciphertext, respectively,
and rl is the intermediate value, which is used to model
Second, for comparison, N EM traces are randomly parti-
predicted EM leakage. l is not necessarily equal to l due to
tioned into two groups of equal size (not necessarily), specif-
ShiftRow operation. Additionally, the order of 16 S-Box look-
ically,
ups in the final round encryption is determined by compilers.
  ×M

G0 = {T }N i· |i ∈ {1, 2, ..., N2 } For simplicity, we start with 2-group case, in which case
(6) only two different intermediate values at most happen among
   (1) (2)
G1 = {T }N ×M
|i ∈ { N
+ 1, N
+ 2, ..., N } 32 values. The two intermediate values, named rl and rl ,
i· 2 2
can be easily calculated using the corresponding ciphertext

→  
(1) (2)
and t-statistic tR between G0 and G1 is computed as above. byte cl , cl and a guessed key byte kguess as:
Third, an appropriate threshold is needed to decide the
(1) (1)
ACCEPT/REJECT status of above t-test. Generally speak- rl = SBox−1 (cl ⊕ kguess )
ing, two groups of samples are assumed to be from differ- (8)
ent populations, if |t| > 4.5 [15]. We evaluate |t|max = (2) (2)
rl = SBox−1 (cl ⊕ kguess )

303
Fig. 3. t statistic in the time domain.

where −→
(1) (2)
where SBox−1 (·) is the inverse S-Box LUT. Since cl and cl g is M -dimensional row vector, and n0 and n1 are the
(1) (2)
are known, rl and rl are determined by key byte guesses, cardinality of G0 (k) and G1 (k), respectively, and
(1) (2)
denoted rl (k) and rl (k), respectively. Our leakage model      
 −−−−→N −−−−→N
is defined as: ×M (1) (2)
G0 (k) = {T }N
i·  rl (k) = rl (k)
i 1 i 1
(1) (2)
El = E(rl (k), rl (k)) +B (9)       (12)
 −−−−→N −−−−→N
N ×M (1) (2)
G1 (k) = {T }i·  rl (k)  = rl (k)
where El is the predicted EM leakage, B is assumed to be i 1 i 1
Gaussian noise, and
⎧ So the correct key byte can be calculated by:
⎨ E0 , [x]1 = [y]1
E(x, y) =  

(10) −−−→
E1 , [x]1 = [y]1 kcorrect = argmax max {Δ(k)}M
i
k i={1,2,...,M },k={0,1,...,255}
(13)
E0 and E1 are assumed to be Gaussian variables with signif- We recover all 16 key bytes in a divide-and-conquer manner
icant difference-of-mean (DoM). In fact, the significant DoM (Algorithm 1) by assigning l = 0, 1, 2, ..., 15 in Equ.12,
between them has been verified in our leakage detection. For respectively.
N EM traces, we group them into two sets (Equ.12). One set In 2-group case, for any two random intermediate value
G0 is composed of EM traces that generalized simultaneous r(1) and r(2) , the occurrence probability of generalized simul-
cache-collisions happen, and the rest of EM traces belong to taneous cache-collision is 1/2, but it drops to 1/23 , 1/27 ,
the other set G1 . With the correct key guess k = kcorrect and 1/215 and 1/231 for 4-group, 8-group, 16-group and 32-
sufficient amount of EM traces, the following quantity will be group cases, respectively. Obviously, generalized simultaneous
significantly greater than that with incorrect key guesses: cache-collision scarcely happens in 16-group case and 32-
−−−→  1 
  group case, so we do not attack in these cases. For 4-group

→ 1 −
→ 
Δ(k) =  g − g (11) case, there are only N/8 out of N EM traces colliding in the
n0 →
− n1 →
− final round encryption. In 4-group case, we compute the DoM
g ∈G0 (k) g ∈G1 (k)

304
between the following two group of EM traces: Algorithm 1 DEMA with Cache-Collision Leakage Model
  − −→
  −−−−→
N → − →
Input: N EM traces: T = T1 , T2 , ..., TM , where T is a
 N ×M  (j)
G0 (k) = {T }i·  rl (k) = 0 or 4 −
→ − → −→
j∈[1,4]∩Z i 1 N ×M matrix, and T1 , T2 , ... , TM are N -dimensional column
 
vectors; −−→ −−→ −−→
G1 (k) = T − G0 (k) (1) (1) (1)
N the 1st ciphertexts: C (1) = C0 , C1 , ..., C15 , where
(14)
Similarly, in 8-group case, −−→ −−→ −−→
(1) (1) (1)
  C (1) is a N × 16 matrix, and C0 ,C1 ,...,C15 are N -
  −−−−→
N −−→ −−→ −−→
dimensional column vectors;
 ×M  (j)
H0 (k) = {T }N  rl (k) = 0 or 8 (2) (2) (2)

j∈[1,8]∩Z i
N the 2nd ciphertexts: C (2) = C0 , C1 , ..., C15 , where
1
−−→ −−→ −−→
(2) (2) (2)
 
H1 (k) = T − H0 (k) C (2) is a N × 16 matrix, and C0 ,C1 ,...,C15 are N -
(15) dimensional column vectors.
where all notations are defined as in Equ.12. Output: K = [k0 , k1 , ..., k15 ]: 16-byte correct secret key.
1: D = {0, 13, 10, 7, 4, 1, 14, 11, 8, 5, 2, 15, 12, 9, 6, 3}
V. E VALUATION R ESULTS 2: for l ← 0 to 15 do
In our experiment, we set up a chosen-plaintext attack 3: for kguess ← 0 to 255 −− do 
−−→ −→
scenario, in which attackers are capable of encrypting any 4: (1)
R ← SBox −1 (1)
CD[l] ⊕ kguess
plaintexts, and obtain the corresponding ciphertexts as well −−−→ 
as EM traces. We mount DEMA attacks on GPU-based AES −−→ (2)
5: R(2) ← SBox−1 CD[l] ⊕ kguess
implementation in three cases, and analyze the effectiveness
−−→ −−→
and efficiency of the proposed leakage model in these cases. (1) (1)
6: R
−−→ ← R −−→>>7
First, the relations between DoMs and the number of EM R(2) ← (2)
traces are evaluated for all possible key byte candidates in all
7:

→ −−→ >>7
R −−→
8: R ← R(1) ⊕ R(2)
cases. We find that DEMA on the 6th key byte in 2-group case −
→ −

9: S ← 1 −R
performs the worst among all key bytes in all three cases. As N −

10: rsum ← i←1 R i
is shown in Fig.4, the secret key byte stands out from 256 key N − →
11: ssum ← i←1 S i
byte candidates with approximately 8,000 EM traces. Since
12: for m ← 1 to M do
this is the worst case, other key bytes recovery with DEMA →
− −→ − →
13: U ← Tm · R /* component-wise multiply */
consumes much less EM traces. →
− −→ − →
14: V ← Tm · S /* component-wise multiply */
Second, the efficiency of DEMA with the proposed leakage 1
N − →
15: umean ← rsum i←1 U i
model are evaluated. Specifically, we investigate the global
1
N − →
success rate (GSR) versus the number of EM traces in 2/4/8- 16: vmean ← ssum i←1 V i
group cases (Fig.7). The evaluation results show that DEMA 17: Δm ← |umean − vmean |
in 4-group case outperforms DEMAs in the other cases. As 18: Wkguess ← max{Δ1 , Δ2 , ..., ΔM }
is shown in Fig.5, 16 correct key bytes are clearly visible 19: Wmax ← max{W0 , W1 , ..., W255 }
in circles, when attacked with DEMA in 4-group case. That 20: kD[l] ← argmax(Wmax )
is to say, the 16-byte secret key of the GPU-based AES k
implementation can be recovered with 5,000 EM traces in 21: return K = [k0 , k1 , ..., k15 ]
chosen-plaintexts scenarios. We also investigate the number
of recovered key byte when sampling different number of EM
traces (Fig.6) in 2/4/8-group cases. It is obvious that most needs much less traces than previous styles of power analysis
of 16 key bytes are recovered efficiently except some special attack.
ones, so we combine KEA (Key Enumeration Algorithm) with As the first study about cache-collision attack with electro-
DEMA in our key-recovery attacks to improve performances. magnetic leakages against GPU-based AES implementation,
When assisted with KEA, 600 EM traces will suffice at the this work suggests that generalized simultaneous cache-
expense of less than 100 milliseconds for key exhaustive collision within GPUs does cause leakages via electro-
search. magnetic side-channels. So cache-collision attacks should be
VI. C ONCLUSION considered in the design of secure GPU-based cryptographic
implementations.
This paper presents a cache-collision attack on GPU-based
AES implementation with its electro-magnetic side-channel VII. ACKNOWLEDGMENT
leakages. We propose a novel leakage model based on gen-
This work is supported in part by the National Natural
eralized simultaneous cache-collisions and mount a complete
Science Foundation of China (No. 61472416, 61632020 and
key-recovery attack with a KEA-assisted DEMA. Our attack
61602468), and the Fundamental Theory and Cutting Edge

305
10-3 The 6th key byte guess in 2-group case 1
5

Global Success Rate (GSR)


0.8
4
0.6
3 correct value
DoM

0.4 DEMA in 2-group case


2 DEMA in 4-group case
0.2 DEMA in 8-group case
DEMA+KEA in 4-group case
1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
0 Number of Traces
2000 4000 6000 8000 10000 12000 14000
Number of Traces
Fig. 7. GSR vs. the number of traces.
Fig. 4. The number of traces vs. DoM for all possible key candidates.

[2] Q. Li, C. Zhong, K. Zhao, X. Mei, and X. Chu, “Implementation


10 -3 S-Box 1 10 -3 S-Box 5 10 -3 S-Box 9 10 -3 S-Box 13
5 5 and analysis of AES encryption on GPU,” in 14th IEEE International
5 Conference on High Performance Computing and Communication & 9th
DoM
DoM
DoM

DoM

5
IEEE International Conference on Embedded Software and Systems,
0 0 0 0 HPCC-ICESS 2012, Liverpool, United Kingdom, June 25-27, 2012
0 100 200 KG 0 100 200 KG 0 100 200KG 0 100 200KG 2012, pp. 843–848. ,
10 -3 S-Box 2 10 -3 S-Box 6 5 10 -3
S-Box 10 10 -3 S-Box 14
5 1.5 5 [3] N. Nishikawa, H. Amano, and K. Iwai, “Implementation of bitsliced
DoM

DoM

AES encryption on cuda-enabled GPU,” in Network and System Security


DoM

1
DoM

0.5 - 11th International Conference, NSS 2017, Helsinki, Finland, August


0 0 0 0 21-23, 2017, Proceedings, 2017, pp. 273–287.
0 100 200 KG 0 100 200 KG 0 100 200 KG 0 100 200 KG
10 -3 S-Box 3 S-Box 7 10 -3 S-Box 11 0.01 S-Box 15 [4] A. Miele, “Buffer overflow vulnerabilities in CUDA: a preliminary
5
analysis,” J. Computer Virology and Hacking Techniques, vol. 12, no. 2,
0.01
DoM

1 0.005 pp. 113–120, 2016.


DoM

0.005
DoM

[5] H. Naghibijouybari, K. N. Khasawneh, and N. B. Abu-Ghazaleh, “Con-


DoM

0 0 0 0
0 100 200KG 0 100 200 KG 0 100 200 KG 0 100 200 KG structing and characterizing covert channels on gpgpus,” in
10 -3 S-Box 4 S-Box 8 S-Box 12 0.02 S-Box 16 of the 50th Annual IEEE/ACM International SymposiumProceedings
on Microar-
5 0.01 0.02
chitecture, MICRO 2017, Cambridge, MA, USA, October 14-18, 2017
,
DoM

0.005 0.01 0.01 2017, pp. 354–366.


DoM

DoM

DoM

0
[6] C. Luo, Y. Fei, P. Luo, S. Mukherjee, and D. R. Kaeli, “Side-channel
0 0 0
0 100 200KG 0 100 200 KG 0 100 200 KG 0 100 200 KG power analysis of a GPU AES implementation,” in 33rd IEEE Inter-
national Conference on Computer Design, ICCD 2015, New York City,
KG: key guess, DoM: difference of mean. NY, USA, October 18-21, 2015, 2015, pp. 281–288.
[7] A. J. Duncan, S. Creese, and M. Goldsmith, “Insider attacks in cloud
computing,” in 11th IEEE International Conference on Trust, Security
Fig. 5. Evaluation results of DEMA in 4-group case. and Privacy in Computing and Communications, TrustCom 2012, Liv-
erpool, United Kingdom, June 25-27, 2012, 2012, pp. 857–862.
[8] Z. H. Jiang, Y. Fei, and D. R. Kaeli, “A complete key recovery timing
attack on a GPU,” in 2016 IEEE International Symposium on High
Number of Recovered Key Byte

15
Performance Computer Architecture, HPCA 2016, Barcelona, Spain,
March 12-16, 2016, 2016, pp. 394–405.
[9] ——, “A novel side-channel timing attack on GPUs,” in
10 Proceedings of
the on Great Lakes Symposium on VLSI 2017, Banff, AB, Canada, May
10-12, 2017, 2017, pp. 167–172.
[10] A. Bogdanov, T. Eisenbarth, C. Paar, and M. Wienecke, “Differential
5 cache-collision timing attacks on AES with applications to embedded
DEMA in 2-group case
DEMA in 4-group case cpus,” in Topics in Cryptology - CT-RSA 2010, The Cryptographers’
DEMA in 8-group case Track at the RSA Conference 2010, San Francisco, CA, USA, March
0 1-5, 2010. Proceedings, 2010, pp. 235–251.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000
[11] C. Lauradoux, “Collision attacks on processors with cache and counter-
Number of Traces measures,” in WEWoRC 2005 - Western European Workshop on Research
in Cryptology, July 5-7, 2005, Leuven, Belgium, 2005, pp. 76–85.
Fig. 6. The number of recovered bytes vs. the number of traces. [12] PolarSSL, “An open source SSL library licensed by ARM limited,”
https : //[Link].
[13] D. Agrawal, B. Archambeault, J. R. Rao, and P. Rohatgi, “The EM
side-channel(s),” in Cryptographic Hardware and Embedded Systems -
Technology Research Program of Institute of Information En- CHES 2002, 4th International Workshop, Redwood Shores, CA, USA,
gineering, Chinese Academy of Sciences (No. Y7Z0401102). August 13-15, 2002, Revised Papers, 2002, pp. 29–45.
[14] J. G. Goodwill, J. Jaffe, and P. Rohatgi, “A testing methodology for
We would like to acknowledge their supports. side-channel resistance validation,” in NIST non-invasive attack testing
workshop, 2011, 2011.
R EFERENCES [15] T. Schneider and A. Moradi, “Leakage assessment methodology - A
clear roadmap for side-channel evaluations,” in Cryptographic Hardware
[1] Y. Yang, Z. Guan, H. Sun, and Z. Chen, “Accelerating RSA with and Embedded Systems - CHES 2015 - 17th International Workshop,
fine-grained parallelism using GPU,” in Information Security Practice Saint-Malo, France, September 13-16, 2015, Proceedings
and Experience - 11th International Conference, ISPEC 2015, Beijing, , 2015, pp.
China, May 5-8, 2015. Proceedings, 2015, pp. 454–468. 495–513.

306

You might also like