You are on page 1of 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 60, NO.

8, AUGUST 2013 517

Fault Rate Analysis: Breaking Masked AES


Hardware Implementations Efficiently
An Wang, Man Chen, Zongyue Wang, and Xiaoyun Wang

Abstract—In 2011, Li et al. presented clockwise collision anal- ever, the acquirement of the setup time was not efficient, and
ysis on nonprotected Advanced Encryption Standard (AES) 512 setup-time samples were required in the idea of correlation
hardware implementation. In this brief, we first propose a new collision [10]. In 2011, Li et al. introduced the clockwise
clockwise collision attack, called fault rate analysis (FRA), on collision analysis [12], which was based on the fact that, when
masked AES. Then, we analyze the critical and noncritical paths of
the S-box and find that, for its three input bytes, namely, the input the inputs of a combinational circuit for two consecutive clock
value, the input mask, and the output mask, the path relating to cycles are the same, little computation is required in the second
the output mask is much shorter than those relating to the other clock cycle. In practice, some countermeasures such as delay
two inputs. Therefore, some sophisticated glitch cycles can be block [13] and glitch detection are adopted so that a clock glitch
chosen such that the values in the critical path of the whole S-box cannot be injected into an embedded module easily. However,
are destroyed but this short path is not affected. As a result, the it is possible that adversary can develop some technologies to
output mask does not offer protection to the S-box, which leads to
achieve it in the future. Moreover, most chips for a resource-
a more efficient attack. Compared with three attacks on masking
countermeasures at the Workshop on Cryptographic Hardware constrained environment are still vulnerable owing to the lack
and Embedded Systems 2010 and 2011, our method only costs of countermeasures.
about 8% of their time and 4% of their storage space. In this brief, a new side-channel collision attack, called fault
rate analysis (FRA), on masked AES is proposed. We inject a
Index Terms—Collision attack, fault rate analysis (FRA), mask-
ing, path delay, side-channel attack. clock glitch into masked S-box implementations and utilize the
fault rate to recover the secret key. Specially, we further study
I. I NTRODUCTION the inner structure of the S-box and select some sophisticated
clock glitches to mount a more efficient attack. Due to the

S INCE Kocher proposed timing attack in 1996 [1], many


cryptanalysts have focused on side-channel attacks. The
side-channel information that leaks the secret key includes
implementation characteristics of masking, our attacks have
good universality.

power consumption, timing, electromagnetic radiation, faulty


output, and so on. Accordingly, many side-channel models such II. P RELIMINARY
as collision [2]–[4] correlation coefficient [5], [6] template [7],
and fault [8] were considered. A. Fault Sensitivity Analysis
At the Workshop on Cryptographic Hardware and Embedded When some input bits of a combinational circuit transit,
Systems 2010, Li et al. presented a new fault attack, called outputs will remain stable in a short time. This delay time is
fault sensitivity analysis (FSA) [9], which exploited the fact that called the setup time. Li et al. found that an illegal clock could
the critical path of an Advanced Encryption Standard (AES) cause a setup time violation of an S-box combinational circuit
S-box combinational circuit is data dependent. Thus, an illegal since flip-flops were triggered before the output signal was fixed
clock and a timing model can be used for recovering key to a correct value [9]. This illegal clock is called the clock glitch,
bytes. Combining with the correlation collision attack [10], which can be generated by the digital clock manager inside
Moradi et al. gave an attack on the masked AES hardware the control module in another field-programmable gate array
implementation by colliding timing characteristics [11]. How- (FPGA). When a clock glitch is injected into round 10 of AES,
an adversary can determine which S-box produces a mistake
Manuscript received March 26, 2013; accepted May 31, 2013. Date of according to the wrong ciphertext. Therefore, a glitch cycle
publication July 16, 2013; date of current version August 10, 2013. This work
was supported in part by the 973 Project of China under Grant 2013CB834205,
is evaluated by gradually decreasing until the S-box makes
by the National Natural Science Foundation of China under Grant 61133013, a mistake. At this moment, the glitch cycle is approximately
and by the open fund of the Science and Technology on Information Assurance equal to the setup time of the S-box corresponding to a certain
Laboratory. This brief was recommended by Associate Editor C.-Y. Lee. input value.
A. Wang is with the Institute for Advanced Study and the Institute of Micro-
electronics, Tsinghua University, Beijing 100084, China (e-mail: wanganl@ For each key guess, the correlation coefficient between the
tsinghua.edu.cn). two following lists can be computed. The Hamming weights of
M. Chen and Z. Wang are with the Key Laboratory of Cryptologic Tech- S-box inputs computed from the guessed key are regarded as
nology and Information Security, Ministry of Education, Shandong Univer-
sity, Jinan 250100, China (e-mail: manchen@mail.sdu.edu.cn; zongyue1988@
a list of samples, whereas their corresponding setup times are
sina.com). the other list of samples. Because of the data dependence of the
X. Wang is with the Institute for Advanced Study, Tsinghua University, schemes proposed in [14], the key guess corresponding to the
Beijing 100084, China (e-mail: xiaoyunwang@tsinghua.edu.cn). maximum correlation coefficient is the right one.
Color versions of one or more of the figures in this brief are available online
at http://ieeexplore.ieee.org. However, FSA seems to be inefficient. To get a setup-time
Digital Object Identifier 10.1109/TCSII.2013.2268379 value, adversary shortens the glitch cycles until the probability

1549-7747 © 2013 IEEE


518 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 60, NO. 8, AUGUST 2013

Fig. 1. Procedure of clockwise collision analysis, taking the validity of output


as the distinguisher.

of the faulty output is higher than a threshold, which needs to


repeat encryptions 104 times [11]. Unfortunately, computing
the correlation coefficient needs 512 setup-time values in all.

B. Fault-Based Clockwise Collision Analysis


Block ciphers are usually implemented iteratively, and dif- Fig. 2. Procedure of our attack on a serial S-box circuit, in which the correct
rate of y2 is taken as a distinguisher.
ferent values are input into a same combinational circuit during
different clock cycles, which is called serial implementation.
There exists a phenomenon that when the inputs of an S-box the second clock. If the glitch cycle is shorter than the
circuit are the same in two consecutive clocks, there is no setup time, we define the average probability of y2 = y2
transition during the second clock [12]. Thus, even if the second (intuitively, it should not hold except for some infrequent
clock cycle is much less than the setup time, no fault will occur. coincidence) as p.
This fact results in the side-channel attack, named clockwise We can view the above properties from a new perspective:
collision analysis, which is described in Fig. 1. If the setup-time If x1 = x2 , y2 should be correct with probability p. How-
violation of an S-box combinational circuit cannot be triggered ever, if x1 = x2 , the expectation of probability of y2 = y2
in the second clock, one can infer that inputs of the two S-boxes should be higher than p since it contains the case where x1 =
are the same. Therefore, an adversary can determine the colli- x2 , m1 = m2 , and w1 = w2 . Thus, the correct probability is
sion between two consecutive S-boxes according to the output about ((1 + 65535p)/65536) ≈ p + (1/65536), which implies
results during a specific glitch. Here, the setup-time violation of a probability advantage of 1/65 536. Roughly, p is about 1/256
the S-box in round 10 can be detected by verifying whether an when the glitch cycle is close to 0; thus, the threshold ρ0 may
error takes place in the corresponding ciphertext byte. be chosen as (1/256) + (1/(65 536 × 2)). Then, we construct
an attack described in Algorithm 1. Repeating the online stage
and the linear collision attack [3] in the key recovery stage, an
III. FRA ON M ASKING
adversary can shrink the key space to 232 .
Here, we study whether the S-box circuit happens to have
identical inputs in two consecutive clock cycles, which is Algorithm 1 FRA on serial S-boxes with masks
similar to clockwise collision analysis. The only difference
here is that each S-box includes an input and output mask. To Precomputation stage:
attack the protected S-boxes, we study the AES implementation 1: Determine the minimum setup time t0 of the S-box circuit.
proposed in [15]. There are only four S-box combinational 2: Choose n glitch cycles t1 , t2 , . . . , tn , s.t. ti << t0 , i ∈
circuits in one round, and every four S-boxes are computed {1, . . . , n}.
by the same combinational circuit singly. Assume that two 3: Determine threshold ρ0 of the distinguisher corresponding
S-box operations S1 and S2 in round 10 are processed by the to the cycles t1 , t2 , . . . , tn .
same combinational circuit and that S1 is followed by S2 . For Online stage:
mounting an attack, the adversary runs the whole AES circuit 1: Encrypt a random plaintext P under the normal clock and
with the same plaintext twice, which is described in Fig. 2. For record the correct result. (Two target ciphertexts and subkey
the second encryption, inject a glitch into S2 . According to the bytes of round 10 are denoted by c1 , c2 and k1 , k2 ,
output ciphertexts with no masks, one can determine whether respectively.)
an error happens. 2: For each ti , i ∈ {1, . . . , n}:
Let x1 , x2 and y1 , y2 denote the unmasked inputs and outputs 3: Encrypt P for n times and inject a glitch (cycle is ti )
of the two S-boxes; m1 , m2 and w1 , w2 denote the input and into S2 .
output masks of S1 and S2 in the first experiment; and m1 , m2 4: Compare the n × n output bytes with the correct value c2 .
and w1 , w2 denote those in the second experiment, respectively. 5: Compute the correct rate ρ.
There are two phenomena. 6: If ρ ≥ ρ0 , then x1 = x2 is determined; else. x1 = x2 .
7: If collision is detected, execute the key recovery stage;
1) If x1 ⊕ m1 = x2 ⊕ m2 , m1 = m2 , and w1 = w2 , then
else, repeat steps 1–6.
y2 = y2 always holds. That is to say, the two outputs must
Key recovery stage:
be correct no matter how short the glitch cycle is.
1: For a detected collision xi = xj , ki ⊕ kj = ci ⊕ cj
2) If x1 ⊕ m1 = x2 ⊕ m2 or m1 = m2 or w1 = w2 , there
holds [3].
must exist transitions in the combinational circuit during
WANG et al.: FRA: BREAKING MASKED AES HARDWARE IMPLEMENTATIONS EFFICIENTLY 519

Fig. 3. Brief hardware architecture of a masked S-box for AES. Intuitively, the path from wi to output is much shorter than the critical path.

The signal-to-noise ratio is low in practice; therefore, the ad-


versary should perform experiments many times for an average
success rate based on different glitch cycles. We continue to
explore the internal regularity of the S-box circuit in the next
section.

IV. I MPROVED FRA


Here, we try to study the case where x1 ⊕ m1 = x2 ⊕ m2
and m1 = m2 regardless of w1 and w2 . If this case happens
with a higher probability for some fixed x1 and x2 , we can
conclude that x1 = x2 .

A. Output Mask Independence


Most masked hardware implementations of the S-box first
process xi ⊕ mi and mi and then mix the intermediate value
with output mask wi . Therefore, the timing delay from inputs
xi ⊕ mi and mi to output yi ⊕ wi is usually longer than that
from input wi to yi ⊕ wi . For example, we have implemented “a
very compact perfectly masked S-box for AES” [16]. Accord-
ing to the simulation in Quartus II, we know that the setup time
of the path from xi ⊕ mi and mi to yi ⊕ wi is always longer
than 15 ns, whereas that from wi to yi ⊕ wi is always shorter
than 6 ns. Fig. 3 shows these characteristics intuitively.
One can image that a glitch of 9 ns is injected in the second
cycle (noting that the glitch cycle must be longer than 6 ns so
that wi can be involved in operation correctly).
1) If x1 ⊕ m1 = x2 ⊕ m2 and m1 = m2 , then y2 = y2 al-
ways holds since w2 gets enough time to get through the
circuit. Fig. 4. Correlation among input and output masks and maximum delays of
the critical path in the case of (a) collision and (b) noncollision, respectively.
2) If x1 ⊕ m1 = x2 ⊕ m2 or m1 = m2 , there must be tran-
sitions in the path from x2 ⊕ m2 , m2 to y2 ⊕ w2 . Since 2) If x1 = x2 , the setup time in the case of m1 = m2 is
the glitch cycle is less than the setup time, y2 = y2 holds remarkably shorter than that in other 255 cases.
with some probability p. 3) If x1 = x2 , there is no obvious difference between the
We employed Quartus II and ModelSim to confirm the above setup times regardless of m2 .
statement. For S1 , x1 = 49, m1 = 151, and w1 = 244, and Therefore, if we set the glitch cycle to be longer than the short
for S2 , x2 = 49, whereas m2 and w2 traverse 0–255. The path, an efficient distinguisher can be established: The correct
experimental result is shown in Fig. 4(a). The two horizontal rate of y2 is larger than p when x1 = x2 , which is about ((1 +
axis represents m2 and w2 , respectively, whereas the vertical 255p)/256) ≈ p + (1/256).
axis expresses how soon the output of S2 circuit (including
subsequent round key addition and mask elimination of the
B. Attack Scenario
ciphertext) will be stable when the transition from S1 to S2
takes place. With the experiment on noncollision pairs, select The adversary hopes that the distinguisher is independent
x1 = 94, m1 = 215, w1 = 204, x2 = 95, and m2 , w2 traverse of the output mask for a higher efficiency. Thus, based on the
0–255. Fig. 4(b) shows the noncollision experimental result. above analysis, a fast attack is described in Algorithm 2. Here,
The following can be easily observed from Fig. 4. the threshold ρ0 may be chosen as (1/256) + (1/(256 × 2)) or
1) The setup time is independent of output masks w2 . a little greater.
520 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 60, NO. 8, AUGUST 2013

Fig. 5. The experiment environment of our attack consists of a function


generator, an FPGA board, and a PC. Fig. 7. Correct rates under different glitch cycles when x1 = 1 and x2
traversing 0–255. The stars and gray points stand for collision (x2 = 1) and
noncollision (x2 = 1), respectively.

Fig. 6. Circuit structure of our experiments. The number of correct values is


saved, which can be acquired from Quartus II.

Algorithm 2 Improved FRA on serial S-boxes with masks


Precomputation stage: Fig. 8. Distinct correct rates between (stars) collision and (gray points)
noncollision.
1: Evaluate the S-box circuit’s minimum setup time t01 from
xi ⊕ mi and mi to yi ⊕ wi , and the maximum setup time are plotted in gray points. Although the points overlap with each
t02 from wi to yi ⊕ wi . other, one can still find that, when x1 = x2 = 1, the expectation
2: Choose n glitch cycles t1 , t2 , . . . , tn , such that t02 < ti < of the correct rate 0.79% is higher than that in the case of
t01 , i ∈ {1, . . . , n}. noncollision. According to a threshold predetermined by a
3: Determine threshold ρ0 of the distinguisher. template-based method, collision can be decided successfully.
Online stage: According to our following experiment, adversary can still
The same as Algorithm 1. distinguish collision from noncollision distinctly even if x1
Key recovery stage: is an arbitrary value between 0 and 255. For each pair
The same as Algorithm 1. (x1 , x2 ), x1 , x2 ∈ {0, 1, . . . , 255}, we choose m1 and m2 from
0 to 255 randomly and perform 40 experiments (each ex-
C. Experiments periment consists of 2000 encryptions) during 40 different
glitch cycles for each pair (m1 , m2 ). Then, a total correct rate
We employed an Altera DE2 board with an EP2C35F672C6 corresponding to the pair (x1 , x2 ) is acquired, which is a point
FPGA for implementing an S-box hardware circuit and a in Fig. 8. The horizontal axis shows the values of x1 , and for
RIGOL DG4102 function generator for clock supply, as shown each x1 , the one star above corresponds to the case of x2 = x1 ,
in Fig. 5. The two devices cost about $1000, which is much whereas the 255 gray points below correspond to the other
cheaper than other kinds of side-channel attacks. 255 x2 . It is clear that, for any pair (x1 , x2 ), adversary can
Fig. 6 describes the circuit structure of our attack. First, we distinguish collision from noncollision distinctly.
implemented a masked AES S-box based on the tower field For a 16-byte subkey of round 10, after repeating
[16]. Then, an on-chip glitchy-clock generator [17] was imple- Algorithm 2 for 12 times, 12 collisions can be decided, which
mented for generating expected glitches. A lookup table S-box is described by the 12 dashed lines in Fig. 9. Here, the offset in
and a comparator were used for deciding the correctness of the each row is due to ShiftRows operation of AES. Thus, only the
output of the masked S-box. As a result, the number of correct first four bytes should be guessed, and the other 12 bytes can
values was recorded by a counter, which could be obtained by be expressed according to the found collisions. Altogether, 232
the In-System Memory Content Editor of Quartus II. searches can recover the whole key of AES.
For verifying our attack, let x1 = 1, m1 = 30, and the glitch
cycle be 7.1 ns, 7.2 ns, 7.3 ns, . . . , 11 ns for the fault injection.
D. Efficiency Comparison
When a glitch cycle is fixed, for each x2 ∈ {0, 1, 2, . . . , 255},
we choose m2 from 0 to 255 randomly and encrypt for 2000 Compared with other three collision attacks on masked
times. Thus, 256 correct rates can be obtained, which is a S-box implementations in Table I, our attacks have remark-
column of points in Fig. 7. In this figure, the correct rates corre- able advantages in encryption time, computation complexity,
sponding to x2 = 1 are plotted in stars, and x2 = 0, 2, . . . , 255 and feasibility. Specifically, for the online encryption, the
WANG et al.: FRA: BREAKING MASKED AES HARDWARE IMPLEMENTATIONS EFFICIENTLY 521

fast key recovery. This phenomenon has potential applications


to side-channel attacks on protected block ciphers. By compar-
ison, previous FSA also considered the inner structure of the
S-box circuit, but it only utilized the critical path, instead of
other short paths.
Our method is universal for most masked hardware imple-
mentations. The effectiveness of our attacks depends on two
factors: 1) the S-box should be executed in series; and 2) there
should be a short path corresponding to the output mask. In
practice, except for some special cases in which the speed of
encryption is required, most implementations of S-boxes are in
series due to the area limitation. Furthermore, because the out-
put mask always joins the computation very late (when the input
mask is being removed), all the hardware implementations of
Fig. 9. Twelve detected collisions (dashed lines) corresponding to 16 subkey
the S-box include a short path corresponding to the output
bytes (table cells) of round 10. mask. Therefore, our attacks are applicable to most hardware
implementations of many block ciphers and do not strongly
TABLE I depend on the specific implementation or design.
C OMPARISONS OF F OUR M ETHODS F OR C OLLISION D ETECTION
For resisting our attacks, first, an illegal clock may be de-
tected. A delayed combinational circuit can be connected in
parallel with the S-box. When this circuit encounters errors,
any output of AES is forbidden. Second, our attack relies on
the equality of the previous and next masks. If the circuit can
detect this case and drop the flawed masks, the collision of
24 bits input values will never take place, and our attack can
be avoided. We expect some skillful combinations of counter-
fault-based colliding timing characteristic (CTC) [11] executes measures showing relative high security and low complexity.
106 encryptions in the FPGA, which cost about 100 s, and R EFERENCES
records 512 critical path delays, which occupy 2048 bytes.
[1] P. C. Kocher, “Timing attacks on implementations of Diffie–Hellman,
However, our attack executed in the FPGA only costs 8 s for RSA, DSS, and other systems,” in Proc. CRYPTO, 1996, pp. 104–113.
8 × 104 encryptions and 80 bytes for storage of 40 counters [2] K. Schramm, T. J. Wollinger, and C. Paar, “A new class of collision attacks
corresponding to 40 glitches, which is 8% and 4% of the CTC, and its application to DES,” in Proc. FSE, 2003, pp. 206–222.
respectively. In power-based attacks, it usually takes more than [3] A. Bogdanov, “Improved side-channel collision attacks on AES,” in Proc.
SAC, 2007, pp. 84–95.
100 ms to record one power trace, which is much longer than an [4] C. Clavier, B. Feix, G. Gagnerot, M. Roussellet, and V. Verneuil, “Im-
encryption in the FPGA. Thus, the computations of correlation- proved collision-correlation power analysis on first order protected AES,”
enhanced power analysis (CEPA) [10] and collision-correlation in Proc. CHES, 2011, pp. 49–62.
power analysis (CCPA) [4] executed in the FPGA cost 2000 s [5] E. Brier, C. Clavier, and F. Olivier, “Correlation power analysis with a
leakage model,” in Proc. CHES, 2004, pp. 16–29.
for 2 × 104 encryptions and 1000 s for 104 encryptions, re- [6] E. Oswald, S. Mangard, C. Herbst, and S. Tillich, “Practical second-order
spectively. Moreover, their space cost, i.e., 2 × 106 and 8 × 107 DPA attacks for masked smart card implementations of block ciphers,” in
bytes, is much higher than that of our attack. Proc. CT-RSA, 2006, pp. 192–207.
[7] S. Chair, J. R. Rao, and P. Rohatgi, “Template attacks,” in Proc. CHES,
The column of complexity shows the offline computation 2002, pp. 13–28.
complexity for recovering a key byte. Here, Cρn represents the [8] E. Biham and A. Shamir, “Differential fault analysis of secret key cryp-
complexity of computing the correlation coefficient relating to tosystems,” in Proc. CRYPTO, 1997, pp. 513–525.
n sample pairs, whereas Cdiv stands for the complexity of one [9] Y. Li, K. Sakiyama, S. Gomisawa, T. Fukunaga, J. Takahashi, and K. Ohta,
“Fault sensitivity analysis,” in Proc. CHES, 2010, pp. 320–334.
division operation, which is used to compute the correct rate. [10] A. Moradi, O. Mischke, and T. Eisenbarth, “Correlation-enhanced power
In our attack, only one division is needed, which is much faster analysis collision attack,” in Proc. CHES, 2010, pp. 125–139.
than the computation of the correlation coefficient. [11] A. Moradi, O. Mischke, C. Paar, Y. Li, K. Ohta, and K. Sakiyama, “On
the power of fault sensitivity analysis and collision side-channel attacks
From the point of view of attack capability, our attack can in a combined setting,” in Proc. CHES, 2011, pp. 292–311.
break countermeasures with nonreused masks as well as CEPA [12] Y. Li, K. Ohta, and K. Sakiyama, “An extension of fault sensitivity analy-
and CTC, but CCPA cannot. Furthermore, the execution of sis based on clockwise collision,” in Proc. Inscrypt, 2012, pp. 46–59.
CEPA and CTC should satisfy an extra assumption that after [13] S. Endo, Y. Li, N. Homma, K. Sakiyama, K. Ohta, and T. Aoki, “An effi-
cient countermeasure against fault sensitivity analysis using configurable
averaged directly, the trace can leak information, but our attack delay blocks,” in Proc. FDTC, 2012, pp. 95–102.
does not need this kind of assumptions. [14] S. Morioka and A. Satoh, “An optimized S-box circuit architecture for low
power AES design,” in Proc. CHES, 2002, pp. 172–186.
[15] S. Mangard, M. Aigner, and S. Dominikus, “A highly regular and scal-
V. C ONCLUSION AND C OUNTERMEASURES able AES hardware architecture,” IEEE Trans. Comput., vol. 52, no. 4,
pp. 483–491, Apr. 2003.
We have proposed a new side-channel collision attack on [16] D. Canright and L. Batina, “A very compact perfectly masked S-box for
AES,” in Proc. ACNS, 2008, pp. 446–459.
masked AES based on the fault rate when injecting a clock [17] S. Endo, T. Sugawara, N. Homma, T. Aoki, and A. Satoh, “An on-chip
glitch into the S-box circuit. More interestingly, we first studied glitchy-clock generator for testing fault injection attacks,” J. Cryptogr.
the paths of the S-box circuit and found a better method for Eng., vol. 1, no. 4, pp. 265–270, Dec. 2011.

You might also like